1 Introduction

Clustering is a powerful tool which aims at grouping similar objects into the same cluster and dissimilar objects into different clusters by identifying dominant structures in the data. It has remained a widely studied research area in machine learning [1, 2] and has applications in diverse domains such as computer vision, text mining, bioinformatics and signal processing [36].

Traditional point-based clustering methods such as k-means [1] and k-median [7] work by partitioning the data into clusters based on the cluster prototype points. These methods perform poorly in case when data are not distributed around several cluster points. In contrast to these, plane-based clustering methods such as k-plane clustering [8], proximal plane clustering [9] and local k-proximal plane clustering [10] have been proposed in the literature. These methods calculate k cluster center planes and partition the data into k clusters according to the proximity of the data points with these k planes.

Jayadeva et al. [11] have proposed twin support vector machine (TWSVM) classifier for binary data classification where the two hyperplanes are obtained by solving two related smaller-sized quadratic programming problems (QPPs) as compare to single large-sized QPP in conventional support vector machine (SVM). Mehrkanoon et al. [12] introduced a general framework of non-parallel support vector machines, which involves a regularization term, a scatter loss and a misclassification loss. Taking motivation from Xie and Sun [11, 13], they have proposed multi-view twin support vector machines in the semi-supervised learning framework which combines two views by introducing the constraint of similarity between two one-dimensional projections identifying two distinct TWSVMs from two feature spaces. An inherent shortcoming of twin support vector machines is that the resultant hyperplanes are very sensitive to outliers in data. To overcome this disadvantage, Xie and Sun [14] have proposed multitask centroid twin support vector machines. The more recent extensions and developments in TWSVMs have been discussed in [15, 16].

Recently, Shao et al. [17] proposed a novel plane-based clustering method, namely twin support vector clustering (TWSVC). The method is based on twin support vector machine (TWSVM) [11] and exploits information from both within and between clusters. Different from the TWSVM, the formulation of TWSVC is modified to get one cluster plane close to the points of its own cluster and at the same time far away from the points of different clusters from both sides of cluster plane. Experimental results in [17] show the superiority of the method against existing plane-based methods.

Working on the lines of [18], in this paper, we first extend the TWSVC to least squares TWSVC (LS-TWSVC) and further propose a fuzzy extension of LS-TWSVC termed as F-LS-TWSVC by incorporating a fuzzy matrix which represents the membership value of each data point to different available clusters. The key features of F-LS-TWSVC are listed below:

  • We modify the quadratic programming problem (QPP)-based formulation of TWSVC in least squares sense which leads to solving optimization problem with equality constraints.

  • A regularization term in the objective function is introduced which takes care of structural risk component along with empirical risk associated with data samples.

  • The solution of LS-TWSVC requires solving series of system of linear equations as opposed to solving a series of QPP and a system of linear equations as in the case of TWSVC.

  • We incorporate fuzzy membership matrix of each data sample to different clusters in order to extend LS-TWSVC to F-LS-TWSVC. The initial fuzzy membership matrix is obtained using fuzzy nearest neighbor algorithm [19] (as discussed in Sect. 5.3).

  • Experimental results on several benchmark UCI datasets indicate that the proposed F-LS-TWSVC achieves similar or better clustering accuracy results as compared to TWSVC and with considerably lesser computational time for both linear as well as nonlinear cases.

  • We also perform experiments on image segmentation as an application to our proposed formulation.

The paper is organized as follows. In Sect. 2, we briefly discuss k-means and TWSVC. Section 3 presents the formulation of LS-TWSVC and F-LS-TWSVC along with algorithm in detail. Section 4 discusses the nonlinear extension of LS-TWSVC and F-LS-TWSVC, respectively. Computational comparison of proposed formulation with other plane-based formulations is done in Sect. 5. Section 6 provides the concluding remarks.

2 Background and related work

The samples are denoted by a set of m row vectors \(X = \{ x_{1};x_{2};\ldots ;x_{m} \}\) in the n-dimensional real space \(\mathbb {R}^{n}\), where the j th sample is \(x_{j} = (x_{j1},x_{j2},\ldots ,x_{jn})\). We assume that these samples belong to k clusters with their corresponding cluster labels in \(\{ 1,2,\ldots ,k \}\). Let \(X_{i}\) denotes the set of samples belonging to cluster label i and \(\overline{X_{i}}\) denotes the set of samples belonging to remaining cluster labels, where \(i = 1,2,\ldots ,k\). The fuzzy membership of a sample is denoted by k column vector \(\{ s_{1},s_{2},\ldots ,s_{k} \}\) where \(s_j\), \(j=1,..,k\) represents the fuzzy membership value of all samples in the j th cluster. Let \(S_{i}\) and \(\overline{S_{i}}\) denote the diagonal fuzzy membership matrix corresponding to samples belonging to cluster label i and remaining cluster labels, respectively, where \(i = 1,2,\ldots ,k\) whose diagonal entries represent the association of i th pattern to j th cluster.

2.1 k-Means

Consider the clustering problem with a set X of m unlabeled data samples in \(\mathbb {R}^{n}\). k-means [1] partition X into k clusters \(X_{1},X_{2},\ldots ,X_{k}\) such that the data samples are close to their respective k cluster center points \(\mu _{1},\mu _{2},\ldots ,\mu _{k}\). It aims to minimize the following objective function

$$\begin{aligned} \underset{(\mu _{1},\mu _{2},\ldots ,\mu _{k} ,X_{1},X_{2},\ldots ,X_{k})}{\mathrm{min}} \quad \sum \limits _{i=1}^{k} \sum \limits _{j=1}^{m_{i}} ||X_{i}(j) - \mu _{i} ||_{2}, \end{aligned}$$
(1)

where \(X_{i}(j)\) represents the j th sample in \(X_{i}\), \(m_{i}\) is the number of samples in \(X_{i}\) so that \(m_{1}+m_{2}+\cdots +m_{k} = m\), and \(||. ||_{2}\) denotes \(L_{2}\) norm.

In practice, an iterative relocation algorithm is followed which minimize (1) locally. Given an initial set of k cluster center points, each sample x is labeled to its nearest cluster center by

$$\begin{aligned} y = \underset{i}{\mathrm{arg}} \quad {\mathrm{min}} \{ ||x - \mu _{i} ||,\quad i = 1,2,\ldots ,k \}. \end{aligned}$$
(2)

Then the k cluster center points are updated as the mean of the corresponding cluster samples since for a given assignment \(X_{i}\), the mean of the cluster samples represents the solution to (1). At each iteration, the cluster centers and sample labels are updated until some convergence criteria is satisfied.

2.2 TWSVC

Working on the lines of TWSVM, Wang et al. [17] proposed TWSVC. In TWSVC, the following problem has been considered in order to obtain k cluster center planes \(w_{i}^{\mathrm{T}}x + b_{i} = 0\), \(i = 1,2,\ldots ,k\), one for each cluster:

$$\begin{aligned} {\mathop {\mathrm{Min}}\limits _{(w_{i},\;b_{i},\;q_{i},X_{i})}}&\frac{1}{2} \Vert (X_{i}w_{i}+b_{i}e)\Vert _2^2 + Ce^{\mathrm{T}}q_{i} \nonumber \\ \text {s.t.}&\nonumber \\&|(\overline{X_{i}}w_{i}+b_{i}e)|+q_{i}\ge e,\nonumber \\&q_{i} \ge 0, \end{aligned}$$
(3)

where \(C>0\) is a penalty parameter and \(q_{i}\) is a slack vector corresponding to i th cluster. Here, \(|\cdot |\) would illustrate the condition that the i th cluster center plane is required to be close to the pattern of cluster \(X_i\) and away from the other cluster \(\overline{X_i}\) from both sides.

Each of the k hyperplane is close to the samples of its own cluster and far away from the samples of the other clusters from both sides unlike the One Against All (OAA)-based multi-class TWSVM which yields hyperplanes which are close to the samples of its cluster but are away from the samples of other cluster from one side only.

For given a certain \(X_{i}\) Wang et al. [17] solved (3) by the concave–convex procedure (CCCP) [20], which decomposes it into a series of convex quadratic subproblems with an initial \(w_{i}^{0}\) and \(b_{i}^{0}\) as follows:

$$\begin{aligned} {\mathop {\mathrm{Min}}\limits _{(w_{i}^{j+1},\;b_{i}^{j+1},\;q_{i}^{j+1})}}&\frac{1}{2} \Vert (X_{i}w_{i}^{j+1}+b_{i}^{j+1}e)\Vert _2^2 + Ce^{\mathrm{T}}q_{i}^{j+1} \nonumber \\ \text {s.t.}&\nonumber \\&T(|(\overline{X_{i}}w_{i}^{j+1}+b_{i}^{j+1}e)|)+q_{i}^{j+1}\ge e,\nonumber \\&q_{i}^{j+1} \ge 0, \end{aligned}$$
(4)

where the index of the subproblem \(j=0,1,2,\ldots ,\) and T(.) denotes the first-order Taylor expansion.

Wang et al. [17] showed that the above problem (4) is equivalent to the following optimization problem:

$$\begin{aligned} {\mathop {\mathrm{Min}}\limits _{(w_{i}^{j+1},\;b_{i}^{j+1},\;q_{i}^{j+1})}}&\frac{1}{2} \Vert (X_{i}w_{i}^{j+1}+b_{i}^{j+1}e)\Vert _2^2 + Ce^{\mathrm{T}}q_{i}^{j+1} \nonumber \\ \text {s.t.}& \nonumber \\&{\mathrm{diag}}({\mathrm{sign}}(\overline{X_{i}}w_{i}^{j}+b_{i}^{j}e))(\overline{X_{i}}w_{i}^{j+1}+b_{i}^{j+1}e)+q_{i}^{j+1}\ge e,\nonumber \\& q_{i}^{j+1} \ge 0. \end{aligned}$$
(5)

The solution of (5) is obtained by solving its dual problem

$$\begin{aligned} {\mathop {\mathrm{Min}}\limits _{\alpha }}&\frac{1}{2} \alpha ^{\mathrm{T}}G(H^{\mathrm{T}}H)^{-1}G^{\mathrm{T}}\alpha - e^{\mathrm{T}}\alpha \nonumber \\ \text {s.t.}&\nonumber \\&0 \le \alpha \le Ce, \end{aligned}$$
(6)

where \(G = {\mathrm{diag}}({\mathrm{sign}}(\overline{X_{i}}w_{i}^{j}+b_{i}^{j}e)) [\overline{X_{i}} \quad e]\), \(H = [X_{i} \quad e]\) and \(\alpha \in \mathbb {R}^{m-m_{i}}\) is the Lagrangian multiplier vector.

Once the solution of (6) is obtained, the decision variable \([w_i^{j+1};b_i^{j+1}]\) is obtain from solving systems of linear equation

$$\begin{aligned}{}[w_{i}^{j+1} ; b_{i}^{j+1}]^{\mathrm{T}} = (H^{\mathrm{T}}H)^{-1} G^{\mathrm{T}}\alpha . \end{aligned}$$
(7)

In short, for each \(i=1,2,\ldots ,k\), we select an initial \(w_{i}^{0}\) and \(b_{i}^{0}\) and solve for \([w_{i}^{j+1} ; b_{i}^{j+1}]\) by (7) for \(j=0,1,2\ldots ,\) and stop when \(||[w_{i}^{j+1} ; b_{i}^{j+1}] - [w_{i}^{j} ; b_{i}^{j}]||\) is small enough. We then set \(w_{i} = w_{i}^{j+1}\), \(b_{i} = b_{i}^{j+1}\).

Given any initial sample cluster assignment of X, TWSVC iterates alternatively updating the cluster center planes by solving (3) with a certain \(X_{i}\) and then updating cluster assignments by relabeling each sample by \(y = \underset{i}{\mathrm{arg}} \ {\mathrm{min}} \{ |w_{i}^{\mathrm{T}}x + b_{i}|, i=1,2,\ldots ,k \}\) . The iterations are repeated until some convergence criteria is met.

It is to be noted that the solution of (5) requires solving a QPP with \(m-m_{i}\) parameters and in addition requires an inversion of matrix of size \((n+1)\times (n+1)\) where \(n<< m\).

TWSVC was also extended in [17] to handle nonlinear case by considering k cluster center kernel-generated surfaces for \(i = 1,2,\ldots ,k\)

$$\begin{aligned} K(x,X)u_{i} + \gamma _{i} = 0, \end{aligned}$$
(8)

where K is any arbitrary kernel, \(u_{i} \in \mathbb {R}^{m}\) and \(\gamma \in \mathbb {R}\). The kernel counterpart of (3) for \(i=1,2,\ldots ,k\) is

$$\begin{aligned} {\mathop {\mathrm{Min}}\limits _{(u_{i},\;\gamma _{i},\;\eta _{i},X_{i})}}&\frac{1}{2} \Vert (K(X_{i},X)u_{i}+\gamma _{i}e)\Vert _2^2 + Ce^T\eta _{i} \nonumber \\ \text {s.t.}&\nonumber \\&|(K(\overline{X_{i}},X)u_{i}+\gamma _{i}e)|+\eta _{i}\ge e,\nonumber \\&\eta _{i} \ge 0, \end{aligned}$$
(9)

where \(\eta _{i}\) is a slack vector. The above problem is solved in a similar manner to linear case by CCCP. However, it is worth mentioning that for each i \((i= 1,2,\ldots ,k)\) the solution of nonlinear TWSVC is decomposed into solving a series of subproblems which requires inversion of matrix of size \((m+1)\times (m+1)\) along with a QPP to be solved, where m is the total number of patterns.

3 Fuzzy least squares twin support vector clustering

Taking motivation from [21], we first propose least squares version of TWSVC and then extend it to fuzzy LS-TWSVC. Here, we modify the primal problem of linear TWSVC (3) in least squares sense, with inequality constraints replaced by equality constraints along with adding a regularization term in the objective function to incorporate structural risk minimization (SRM) principle. Thus, for cluster i \((i=1,2,\ldots ,k\)) the optimization problem is given as:

$$\begin{aligned} {\mathop {\mathrm{Min}}\limits _{(w_{i},\;b_{i},\;q_{i},X_{i})}}&\frac{1}{2} \Vert (X_{i}w_{i} +b_{i}e)\Vert _2^2 + \frac{\nu }{2}\left( \Vert w_{i}\Vert ^2_{2} + b_{i}^{2}\right) + \frac{C}{2}\Vert q_{i}\Vert _{2}^2 \nonumber \\ \text {s.t.}&\nonumber \\&|(\overline{X_{i}}w_{i}+b_{i}e)|+q_{i}=e, \end{aligned}$$
(10)

where \(\nu >0\) is a parameter. Note that QPP (10) uses the square of \(L_2\)-norm of slack variable \(q_{i}\) instead of \(L_1\)-norm of \(q_{i}\) in (3), which makes the constraint \(q_{i} \ge 0\) redundant [18]. Solving (10) is equivalent to solving system of linear equations.

Further, we introduce the fuzzy matrices \(S_i\) and \(\overline{S_i}\) in (10) which indicates the fuzzy membership value of each data points to different available clusters as follows:

$$\begin{aligned} {\mathop {\mathrm{Min}}\limits _{(w_{i},\;b_{i},\;q_{i},X_{i})}}&\frac{1}{2} \Vert ((S_iX_{i})w_{i}+b_{i}e)\Vert _2^2 + \frac{\nu }{2}\left( \Vert w_{i}\Vert ^2_{2} + b_{i}^{2}\right) + \frac{C}{2}\Vert q_{i}\Vert _{2}^2 \nonumber \\ \text {s.t.}&\nonumber \\&|((\overline{S_i}\overline{X_{i}})w_{i}+b_{i}e)|+q_{i}= e. \end{aligned}$$
(11)

Similar to the solution of TWSVC formulation [17], the above optimization problem can be solved by using the concave–convex procedure (CCCP) [20], which decomposes it into a series of j \((j=1,2,\ldots )\) quadratic subproblems with initial \(w_i^0\) and \(b_i^0\) as follows:

$$\begin{aligned} {\mathop {\mathrm{Min}}\limits _{\left( w_{i}^{j+1},\;b_{i}^{j+1},\;q_{i}^{j+1}\right) }}&\;\frac{1}{2} \Vert \left( (S_iX_{i})w_{i}^{j+1}+b_{i}^{j+1}e\right) \Vert _2^2 + \frac{\nu }{2}\left( \Vert w_{i}^{j+1}\Vert ^2_{2} + (b_{i}^{j+1})^{2}\right) + \frac{C}{2}\Vert q_{i}^{j+1}\Vert _{2}^2 \nonumber \\ \text {s.t.}&\nonumber \\&T\left( |((\overline{S_i}\overline{X_{i}}\right) w_{i}^{j+1} +b_{i}^{j+1}e)|)+q_{i}^{j+1} = e, \end{aligned}$$
(12)

where T(.) denotes the first-order Taylor expansion.

Working along the lines of [17], the equation (12) reduces to

$$\begin{aligned} {\mathop {\mathrm{Min}}\limits _{\left( w_{i}^{j+1},\;b_{i}^{j+1},\;q_{i}^{j+1}\right) }}&\; \frac{1}{2} \Vert \left( (S_iX_{i})w_{i}^{j+1}+b_{i}^{j+1}e\right) \Vert _2^2 + \frac{\nu }{2}\left( \Vert w_{i}^{j+1}\Vert ^2_{2} + (b_{i}^{j+1})^{2}\right) + \frac{C}{2}\Vert q_{i}^{j+1}\Vert _{2}^2 \nonumber \\ \text {s.t.}&\nonumber \\&{\mathrm{diag}}\left( {\mathrm{sign}}((\overline{S_i}\overline{X_{i}})w_{i}^{j}+b_{i}^{j}e)\right) \left( (\overline{S_i}\overline{X_{i}})w_{i}^{j+1}+b_{i}^{j+1}e\right) +q_{i}^{j+1}= e. \end{aligned}$$
(13)

Substituting the error variable \(q_i^{j+1}\) into the objective function of (13) leads to the following optimization problem.

$$\begin{aligned} {\mathop {\mathrm{Min}}\limits _{\left( w_{i}^{j+1},\;b_{i}^{j+1}\right) }}&\; \frac{1}{2} \Vert \left( (S_iX_{i})w_{i}^{j+1}+b_{i}^{j+1}e\right) \Vert _2^2 + \frac{\nu }{2}\left( \Vert w_{i}^{j+1}\Vert ^2_{2} + (b_{i}^{j+1})^{2}\right) +\nonumber \\&\frac{C}{2}\Vert {\mathrm{diag}}\left( {\mathrm{sign}}((\overline{S_i}\overline{X_{i}})w_{i}^{j}+b_{i}^{j}e)\right) \left( (\overline{S_i}\overline{X_{i}})w_{i}^{j+1}+b_{i}^{j+1}e\right) - e \Vert _2^2. \end{aligned}$$
(14)

Further, considering the gradient of (14) with respect to \(w_{i}^{j+1}\) and \(b_{i}^{j+1}\) and equate it to zero gives:

$$\begin{aligned}&(S_iX_{i})^{\mathrm{T}}\left[ H_{1}z_{i}^{j+1}\right] + \nu w_{i}^{j+1} + C(\overline{S_i}\overline{X_{i}})^{\mathrm{T}} G^{\mathrm{T}}\left[ G(H_{2}z_{i}^{j+1}) - e\right] = 0, \end{aligned}$$
(15)
$$\begin{aligned}&e^{\mathrm{T}}\left[ H_{1}z_{i}^{j+1}\right] + \nu b_{i}^{j+1} + Ce^{\mathrm{T}} G^{\mathrm{T}}\left[ G(H_{2}z_{i}^{j+1}) - e \right] = 0, \end{aligned}$$
(16)

where \(H_{1} = [{S_i}X_{i}\quad e]\), \(H_{2} = [\overline{S_i}\overline{X_{i}} \quad e]\), \(z_{i}^{j+1} = [w_{i}^{j+1}; b_{i}^{j+1}]\) and \(G = {\mathrm{diag}}({\mathrm{sign}}(H_{2}\) \(z_{i}^{j}))\). Rearranging the above equations, we obtained the following system of linear equations:

$$\begin{aligned} \left( H_{1}^{\mathrm{T}}H_{1} + I\nu + CH_{2}^{\mathrm{T}}H_{2}\right) z_{i}^{j+1} = CH_{2}^{\mathrm{T}}G^{T}e, \end{aligned}$$
(17)

which gives the solution for \(z_{i}^{j+1}\):

$$\begin{aligned} z_{i}^{j+1} = \left[w_{i}^{j+1} ; b_{i}^{j+1}\right] = C(H_{1}^{\mathrm{T}}H_{1} + I\nu + CH_{2}^{{\mathrm{T}}}H_{2})^{-1} H_{2}^{{\mathrm{T}}}G^{T}e. \end{aligned}$$
(18)
figure a

It can be finally observed that our algorithm requires the solution of (18) which involves inversion of smaller-dimensional matrix of size \((n+1)\times (n+1)\) as compared to an additional QPP solution required in case of TWSVC. The details of the proposed algorithm are described in Algorithm 1.

4 Nonlinear fuzzy least squares twin support vector clustering

Working on the lines of [11], we extend the nonlinear formulation of F-LS-TWSVC by considering k cluster center kernel-generated surfaces for \(i = 1,2,\ldots ,k\):

$$\begin{aligned} K(x,X)u_{i} + \gamma _{i} = 0, \end{aligned}$$
(19)

where K is any arbitrary kernel, \(u_{i} \in \mathbb {R}^{m}\) and \(\gamma \in \mathbb {R}\). The primal QPP of F-LS-TWSVC (9) is modified in least squares sense as follows for \(i = 1,2,\ldots ,k\):

$$\begin{aligned} {\mathop {\mathrm{Min}}\limits _{(u_{i},\;\gamma _{i},\;\eta _{i})}}&\; \frac{1}{2} \Vert \left( (S_iK(X_{i},X)u_{i})+\gamma _{i}e\right) \Vert _2^2 + \frac{\nu }{2}\left( ||u_{i}||_2^{2} + \gamma _{i}^{2}\right) +C\eta _{i}^T\eta _{i} \nonumber \\ \text {s.t.}& \nonumber \\&|\left( (\overline{S_i}K(\overline{X_{i}},X) u_{i})+\gamma _{i}e \right) |+\eta _{i}= e. \end{aligned}$$
(20)

Similar to the linear case, for each \(i = 1,2,\ldots ,k\) the above problem is also decomposed into series of quadratic subproblems where the index of subproblems is \(j = 0,1,2\ldots ,\) and solution of which can be derived to be:

$$\begin{aligned} \left[ u_{i}^{j+1} ; \gamma _{i}^{j+1}\right] = C\left( E_{1}^{\mathrm{T}}E_{1} + I\nu + CE_{2}^{\mathrm{T}}E_{2}\right) ^{-1} E_{2}^{\mathrm{T}}F^{T}e, \end{aligned}$$
(21)

where \(E_{1} = [{S_i}(K(X_{i},X)) \quad e]\), \(E_{2} = [\overline{S_i}(K(\overline{X_{i}},X)) \quad e]\) and \(F = {\mathrm{diag}}({\mathrm{sign}}(\) \(E_{2}[u_{i}^{j} ; b_{i}^j]))\).

The overall algorithm remains same as of linear case except that now we solve for k kernel-generated surfaces parameters \(\mu _{i}\), \(\gamma _{i}\), \(i=1,2,\ldots ,k\).

It can be noted that the nonlinear algorithm requires the solution of (21) which involves calculating the inverse of matrix of order \((m+1)\times (m+1)\). However, we show that (21) can be solved by calculating inverses of two smaller dimension matrices as compare to \((m+1) \times (m+1)\) by using Sherman–Morrison–Woodbury (SMW) [22] formula. Therefore, inversion of matrices in (21) can be further solved by

$$\begin{aligned} \left[ u_{i}^{j+1} ; \gamma _{i}^{j+1}\right] = C \left( Y - YE_{1}^{\mathrm{T}}(I+E_{1}YE_{1}^{\mathrm{T}})^{-1}E_{1}Y \right) E_{2}^{\mathrm{T}}F^{T}e, \end{aligned}$$
(22)

where \(Y = \frac{1}{\nu } (I - E_{2}^{\mathrm{T}}(\frac{\nu }{C} + E_{2}E_{2}^{\mathrm{T}})^{-1}E_{2})\), which involves matrix inverses of (\(m_{i} \times m_{i}\)) and (\((m-m_{i}) \times (m-m_{i})\)), respectively, for \(i = 1,2,\ldots ,k\).

5 Experimental results

The TWSVC, LS-TWSVC and F-LS-TWSVC clustering methods were implemented by using MATLAB 8.1 [23] running on a PC with Intel 3.40 GHz with 16 GB of RAM. The methods were evaluated on several benchmark datasets from UCI Machine Learning Repository [24].

5.1 Performance measure for UCI datasets

To compare the performance of various clustering algorithm, we have used the metric accuracy [17] as the performance criteria for UCI datasets. Given the k th cluster labels \(y_i\) where \(i=1,\ldots ,m,\) compute the corresponding similarity matrix \(M\in R^{m\times m}\), where

$$\begin{aligned} M(i,j)= {\left\{ \begin{array}{ll} 1 &{} : if \quad y_i=y_j\\ 0 &{} :\text {otherwise}. \end{array}\right. } \end{aligned}$$
(23)

Let \(M_t\) is the similarity matrix computed by the true cluster label of the dataset and \(M_p\) corresponds to the label computed from the prediction of clustering method. Then, the metric accuracy of the clustering method is defined as the

$$\begin{aligned} {\mathrm{Metric ~Accuracy}} = \frac{n_{00}+n_{11}-m}{m^2-m}\times 100\,\%, \end{aligned}$$
(24)

where \(n_{00}\) is the number of zeros in \(M_p\) and \(M_t\), and \(n_{11}\) is the number of ones in \(M_p\) and \(M_t\) respectively.

5.2 Performance measure for BSD

To establish the validity of our proposed formulations, we also perform experiments on the Berkeley Segmentation Dataset (BSD) [25] and for comparison we have used \(F{-}{\mathrm{measure}}\)  [26] and error rate (ER)  [27] as the performance criteria.

  • \(F{-}{\mathrm{measure}}\) can be calculated as

    $$\begin{aligned} F{-}{\mathrm{measure}}=\frac{2\times {\mathrm{Precision}} \times {\mathrm{Recall}}}{{\mathrm{Precision}} + {\mathrm{Recall}}}, \end{aligned}$$
    (25)

    with respect to human ground truth boundaries. Here,

    $$\begin{aligned} {\mathrm{Precision}}=\frac{\mathrm{TP}}{{\mathrm{TP+FP}}}, \end{aligned}$$

    and

    $$\begin{aligned} {\mathrm{Recall}}=\frac{{\mathrm{TP}}}{{\mathrm{TP+FN}}}. \end{aligned}$$
  • ER can be calculated as

    $$\begin{aligned} {\mathrm{ER}}=\frac{{\mathrm{FP+FN}}}{{\mathrm{TT}}}, \end{aligned}$$
    (26)

where TP is number of true detection object pixels, FP is the number of false-detection object pixels, FN is the number of false-detection not object pixels and TT is the total number of pixels present in the image.

For our simulations, we have considered RBF kernel and the values of parameters such as C, \(\nu\) and sigma (kernel parameter) are optimized from the set of values \(\left\{ 2^{i}|i= -9, -8,\ldots ,0\right\}\) using cross-validation methodology [28]. The initial cluster labels and fuzzy membership values are optimized from FNNG initialization as discussed in Sect. 5.3.

5.3 Steps involved in initialization of initial fuzzy membership matrix via fuzzy NNG

Traditionally, the initial labels of clustering can be generated randomly. However, in our algorithm discussed in Algorithm 1, we use fuzzy membership matrix as initial input. In [17], authors have shown via experiments that the results of plane-based clustering methods strongly depend on the initial input of class labels. Hence taking motivation from initialization algorithm based on NNG [17], we implement fuzzy NNG (FNNG) and provide output in the form of fuzzy membership matrix from FNNG method as the initial input to our algorithm. The main process of calculating FNNG is as follows:

  1. 1.

    For the given dataset and a parameter p, construct p nearest neighbor undirected graph whose edges represents the distance between \(x_i\) (i = 1,...,m) and its p nearest neighbor.

  2. 2.

    From the graph, t clusters are obtained by associating the nearest samples. Further, construct a fuzzy membership matrix \(S_{ij}\) where \(i=1,\ldots m\) and \(j=1,\ldots t\) whose (ij) entry can be calculated as follows,

    $$\begin{aligned} S_{ij}=\frac{1}{d_{ij}}, \end{aligned}$$
    (27)

    where \(d_{ij}\) is the euclidean distance of the sample i with the j th cluster. If the current number of cluster t is equal to k, then stop. Else, go to step 3 or 4 accordingly.

  3. 3.

    If \(t<k\), disconnect the two connected samples with the largest distance and go to step 2.

  4. 4.

    If \(t>k\), compute the Hausdorff distance [29] between every two clusters among the t clusters and sort all pairs in ascending order. Merge the nearest pair of clusters into one, until k clusters are formulated, where the Hausdorff distance between two sets \(S_1\) and \(S_2\) of sample is defined as

    $$\begin{aligned} h(S_1,S_2)=max\{\underset{i\in S_1}{\mathrm{max}}\{\underset{j\in S_2}{\mathrm{min}}||i-j||\}, \{\underset{i\in S_2}{\mathrm{max}}\{\underset{j\in S_1}{\mathrm{min}}||i-j||\}\}. \end{aligned}$$
    (28)

For solving F-LS-TWSVC via CCCP, we would need to initialize \([w_1^0\quad b_1^0]\) and to obtain the same we observe the similarity between F-LS-TWSVC and F-LS-TWSVM, i.e., once the labels are known, solving F-LS-TWSVC is same as solving F-LS-TWSVM [30]. Thus, the initial value of \([w_1^0\quad b_1^0]\) can be obtained as the solution of F-LS-TWSVM classifier.

5.4 Computational complexity

In [11], authors have shown that TWSVM is approximately 4 times faster than SVM. The computational complexity of TWSVM is \((m^3/4)\), where m is the total number of training samples. In [18], authors have shown that the solution of LS-TWSVM requires system of linear equations to be solved as opposed to the solution of TWSVM which requires system of linear equations along with two QPPs to be solved.

On the similar lines, our algorithm F-LS-TWSVC essentially differs from TWSVC from the optimization problem involved, i.e., in order to obtain k cluster plane parameters, we solve only two matrices inverse of size \((n+1)\times (n+1)\) in linear case, whereas TWSVC seeks to solve system of linear equations along with two QPPs. Table 6 shows the training time comparison among different algorithms with linear kernel on UCI dataset.

For nonlinear F-LS-TWSVC, solution requires inverse of the matrices with order \((m+1)\) which can further be solved by (22) using SMW formula where we tend to solve inverse of two smaller dimension \((m_i\times m_i)\) and \(((m-m_i)\times (m-m_i))\) matrices. Table 7 shows the training time comparison among different techniques with nonlinear kernel on UCI dataset. Table 9 shows the training time comparison among different techniques for image segmentation on BSD dataset.

5.5 Experimental results on UCI datasets

In this section, we perform experiments on different UCI datasets with TWSVC, and compared its efficacy with proposed algorithms LS-TWSVC and F-LS-TWSVC, respectively. The summary of UCI datasets is given in Table 1.

Table 1 Summary of UCI datasets

In [17], authors have reported clustering accuracy by considering whole dataset for learning the cluster hyperplanes. However, in our presentation of results, we have calculated training clustering accuracy as well as out of sample testing clustering accuracy along with reporting clustering accuracy on the whole dataset together. As a result, we have presented the results in two subsection discussed below.

5.5.1 Part 1 results

Tables 2 and 3 summarize the clustering accuracy results of proposed algorithms F-LS-TWSVC and LS-TWSVC along with TWSVC on several UCI benchmark datasets using linear and nonlinear kernel, respectively. These tables show that metric accuracy of LS-TWSVC and TWSVC are comparable to each other, which further increases approximately 2–5 % on each datasets after incorporating fuzzy membership matrix. In Tables 1 and 2, we have taken results of kPC [8], PPC [9] and FCM [31] from [17] from their respective references.

Table 2 Clustering accuracy with linear kernel on UCI datasets
Table 3 Clustering accuracy with nonlinear kernel on UCI datasets

Figure 1 shows the relations between the parameters and the clustering accuracy (vertical axis) of linear F-LS-TWSVC on the above datasets. It can be found from Fig. 1 that the accuracy of F-LS-TWSVC is affected by both p and c, and higher accuracy is reached by smaller value of p for most datasets.

Figure 2 shows the relations between the parameters and the accuracy (vertical axis) of nonlinear F-LS-TWSVC only on two datasets. In Fig. 2, the x-axis, y-axis and z-axis correspond to the kernel parameter, “c” parameter and clustering accuracy on datasets, respectively. From Fig. 2, it can be found that F-LS-TWSVC performs better for \(c<1\) and \(\sigma <1\). From experiments we have observed that F-LS-TWSVC is invariant from the value of \(\nu\).

Fig. 1
figure 1

Illustration of the effectiveness of linear F-LS-TWSVC on UCI datasets with different parameter: a zoo, b iris, c glass, d dermatology, e E. coli, f compound, g Haberman and h libras

Fig. 2
figure 2

Illustration of the effectiveness of nonlinear F-LS-TWSVC with different parameter. For Dermatology dataset ac and for Zoo dataset df

5.5.2 Part 2 results

In this part, clustering accuracy was determined by following the standard fivefold cross-validation methodology [28]. Tables 4 and 5 summarize testing clustering accuracy results of our proposed algorithms F-LS-TWSVC and LS-TWSVC along with TWSVC on several UCI benchmark datasets (Tables 67).

Table 4 Testing clustering accuracy with linear kernel on UCI datasets
Table 5 Testing clustering accuracy comparison with nonlinear kernel on UCI datasets
Table 6 Average training time (in s) with linear kernel on UCI datasets
Table 7 Average training time (in s) with nonlinear kernel on UCI datasets

5.6 Experimental results on BSD datasets

In this section, we perform image segmentation on BSD dataset with proposed algorithm F-LS-TWSVC. Texture feature is one of the common feature used in image segmentation. Hence, we extract pixel-level texture feature from the images with the help of Gabor filter. Gabor filter [32] is a class of filters in which a filter of arbitrary orientation and scale is synthesized as linear combination of a set of “basis filter.” It allows one to adaptively “steer” a filter to any orientation and scale, and to determine analytically the filter output as a function of orientation and scale. In our experiments, we use three level of scale (0.5, 1.0, 2.0) and four level of orientation \((0^{\circ }, 45^{\circ }, 90^{\circ }, 135^{\circ })\). As a result, we have \(12(3\times 4)\) coefficients for each pixel of image. Finally, we use the maximum (in absolute value) of the 12 coefficients for each pixels which represents the pixel-level wise Gabor features of an image. Further, this feature used as an input to FNNG which give us initial membership matrix for every pixels in different clusters. We have also use this Gabor filter to identify number of clusters present in the image.

Table 8 compares the performance of F-LS-TWSVC with TWSVC methods on Berkeley Segmentation Dataset. It is noticeable that for better segmentation, the value of \(F{-}{\mathrm{measure}}\) should be high and the value of ER should be lower. Table 8 shows that the value of F-measure for F-LS-TWSVC is higher and the value of ER is lower than TWSVC (Table 9).

Table 8 F-measure and error rate on BSD dataset
Table 9 Training time (in s) comparison of Image segmentation on BSD dataset

Figures 345 and 6 show the segmentation results with F-LS-TWSVC and TWSVC, respectively.

Fig. 3
figure 3

Segmentation results a original image (ImageID-296059), b segmented image with F-LS-TWSVC and c segmented image with TWSVC

Fig. 4
figure 4

Segmentation results a original image (ImageID-86016), b segmented image with F-LS-TWSVC and c segmented image with TWSVC

Fig. 5
figure 5

Segmentation results a original image (ImageID-71046), b segmented image with F-LS-TWSVC and c segmented image with TWSVC

Fig. 6
figure 6

Segmentation results a original image (ImageID-198023), b segmented image with F-LS-TWSVC and c segmentation results with TWSVC

6 Conclusions

In this paper, we have extended TWSVM-based clustering method TWSVC to least squares version termed as LS-TWSVC for both linear and nonlinear cases. Further extension of LS-TWSVC to fuzzy scenario by introducing a fuzzy membership matrix has been considered. The proposed algorithms yields k cluster center planes by solving a series of system of linear equations as opposed to solving series of QPPs along with system of linear equations required for TWSVC algorithm. The experimental results on several UCI benchmark datasets shows that our proposed method achieves better clustering accuracy to that of TWSVC and with considerably less computational time. We have also validated our algorithm for image segmentation where the images have been considered from BSD image dataset.

Currently in FNNG initialization method, we have used \(1/d_{ij}\) membership function where \(d_{ij}\) represent the distance between two clusters i and j which would not be suitable for noisy data, i.e., Haberman dataset. Therefore, in the future work, we would like to contribute by adding other robust membership function which could handle the noisy data as well.