1 Introduction

Clustering that discovers the relationship among data samples has been applied to many real-world problems, e.g., marketing, text mining and web analysis [1,2,3,4,5,6,7]. In particular, the partition-based clustering methods [1, 8, 9] are widely used in real applications for their simplicity, e.g., the classical kmeans [10]. The partition-based clustering supposes there are k clusters and each cluster has a cluster center. Then, the data samples are assigned to these clusters according to the distances from the cluster centers. Kmeans supposes the cluster centers are some points in the sample space, but the cluster centers may be some other types in the datasets, e.g., planes in the “cross-planes” dataset [11,12,13]. As an extension of point center, the plane center is able to discover comprehensive structures in the sample space. Therefore, the partition-based clustering with planes as cluster centers, called plane-based clustering, has been studied widely [14,15,16,17,18,19,20].

In plane-based clustering, starting from an initialization, the clustering result is obtained iteratively through the cluster assignment and cluster update steps. In the cluster assignment step, the samples are assigned to the clusters by the distances from the cluster center planes. In the cluster update step, the cluster center planes are the solutions of k optimization problems, where different plane-based methods have different optimization problems. In general, the optimization problems consist of two aspects: within-cluster compactness and between-cluster separation. The first proposed k-plane clustering (kPC) [14] measured the within-cluster scatter by a quadratic function, and another proximal planes clustering (PPC) [15, 16] measured the within-cluster and between-cluster scatters by the same quadratic function. Subsequently, twin support vector clustering (TWSVC) [17] improved the between-cluster scatter by hinge function [12, 21], robust twin support vector clustering (RTWSVC) [18] and fast robust twin support vector k-plane clustering (FRTWSVC) [18] improved them by absolute function. Figure 1 exhibits the metric functions used in these methods. Since these plane-based clustering methods measure the within-cluster and/or between-cluster scatter by some unbounded functions, some samples may influence the cluster center planes significantly, e.g., the samples far from their corresponding cluster center.

Fig. 1
figure 1

Metric functions used in kPC, PPC, TWSVC, RTWSVC, FRTWSVC and RampTWSVC to measure the within-cluster and between-cluster scatters. The horizontal axis denotes the deviation of a sample from the cluster center, and the vertical axis denotes the cost to fit that sample. When a cost value is negative, the cost becomes a reward

To further reduce the influence of the samples far from the cluster center, a better choice is to hire some bounded functions to measure the within-cluster and between-cluster scatters instead of unbounded functions. Therefore, we consider to hire the ramp functions [22] into the plane-based clustering, where the ramp functions are bounded piecewise linear functions and have been applied in semi-supervised and supervised learning successfully [23,24,25]. In this paper, a ramp-based twin support vector clustering method, called RampTWSVC, is proposed by introducing the ramp functions for the measurement of the within-cluster and between-cluster scatters (see Fig. 1). RampTWSVC reduces the influence of the samples far from the cluster centers to obtain the stable cluster center planes. The optimization problems in our RampTWSVC are some more complicated nonconvex programming problems than other plane-based clustering methods. These problems are recast to some mixed-integer programming problems with the same formation. Then, an iterative algorithm is proposed to solve the mixed-integer programming problem, and we prove that the iterative algorithm terminates in a finite number of iterations at a local solution. In addition, RampTWSVC is extended to nonlinear case via a kernel trick to cope with the manifold clustering [26, 27]. Experimental results on the benchmark datasets show the better performance of the proposed RampTWSVC compared with other plane-based clustering methods.

2 Review of plane-based clustering

In this paper, we consider m data samples \(\{x_1,\,x_2,\ldots ,x_m\}\) in the n-dimensional real vector space \({\mathbb {R}}^{n}\). Assume these m samples belong to k clusters with their corresponding labels \(y\in \{1,\,2,\ldots ,k\}\), and they are represented by a matrix \(X=(x_1,x_2,\ldots ,x_m)\in {\mathbb {R}}^{n\times m}\). We further organize the samples from X with the current label i into a matrix \(X_i\) and those with the rest labels into a matrix \({\hat{X}}_i\) with \(i=1,\,2,\ldots ,k\). For readers’ convenience, the symbols X, \(X_i\), and \({\hat{X}}_i\) will also refer to the corresponding sets, depending on the specific context they appear. For example, the symbol X can be comprehended as a matrix belonging to \(R^{n\times m}\) or a set that contains m samples. The cluster center planes are defined as

$$\begin{aligned} \begin{array}{l} f_i(x)=w_i^\top x+b_i=0,\,\,i=1,\ldots ,k, \end{array} \end{aligned}$$
(1)

where \(w_i\in {\mathbb {R}}^n\) is the weight vector and \(b_i\in {\mathbb {R}}\) is the bias.

The following plane-based clustering methods share the same clustering procedures. Starting from an initial assignment of the m samples into k clusters, the cluster center planes (1) are constructed by the current cluster assignment. Once the cluster center planes are obtained, the m samples are reassigned by

$$\begin{aligned} \begin{array}{l} y=\underset{i}{\arg }\min \,|w_i^\top x+b_i|, \end{array} \end{aligned}$$
(2)

where \(|\cdot |\) denotes the absolute value. The cluster center planes and the sample labels are updated alternately until some termination conditions are satisfied. In the following, we briefly describe the different constructions of the cluster center planes in kPC, PPC, TWSVC and RTWSVC.

2.1 kPC and PPC

kPC [14] wishes each cluster center plane close to the current cluster samples. Further on, PPC [15] considers that it should also be far away from different cluster samples. Therefore, the ith (\(i=1,\ldots ,k\)) cluster center plane in PPC is constructed by considering the problem

$$\begin{aligned} \begin{array}{l} \underset{w_i,b_i}{\min }\,\, ||X_i^\top w_i+b_ie||^2-c||{\hat{X}}_i^\top w_i+b_ie||^2\\ \mathrm{s.t.}\,\,\,\,||w_i||=1, \end{array} \end{aligned}$$
(3)

where \(||\cdot ||\) denotes the \(L_2\)-norm, e is a column vector of ones with an appropriate dimension and \(c>0\) is a user-set parameter, while the optimization problem in kPC does not have the second term in the objective of (3).

From the objective of (3), it is obvious that the samples from the within-cluster receive quadratic costs, and the samples from the between-cluster receive quadratic rewards. Therefore, the samples far from the cluster center will have great impact on the potential cluster center planes, and PPC may obtain a cluster center plane which is far from the current cluster.

2.2 TWSVC and RTWSVC

In contrast, TWSVC [17] degrades the rewards of the samples from different clusters by considering the problem with \(i=1,\ldots ,k\) as

$$\begin{aligned} \begin{array}{ll} \underset{w_{i},b_{i}}{\min } &{}\frac{1}{2}\Vert X_i^\top w_{i}+b_{i}e\Vert ^{2}+ce^{\top }(e-|{\hat{X}}_i^\top w_{i}+b_{i}e|)_+,\\ \end{array} \end{aligned}$$
(4)

where \((\cdot )_+\) replaces the negative parts by zeros.

From the second part of (4), it is clear that the samples from the between-cluster would have a certain impact on the cluster center plane only with deviation in [0, 1). Thus, TWSVC is more robust than PPC [17]. However, the issue of within-cluster also exists because of the quadratic function in the first part of (4). Thus, RTWSVC [18] was proposed to decrease its influence by replacing the \(L_2\)-norm in (4) with the \(L_1\)-norm. Note that the metric function of within-cluster of RTWSVC is unbounded from Fig. 1. In order to eradicate the influence of the samples far from the cluster center, it is reasonable to hire a bounded metric function for the within-cluster, whose principle is similar to the metric function for the between-cluster used in TWSVC.

3 RampTWSVC

Similar to the above plane-based clustering methods mentioned in Sect. 2, our RampTWSVC starts with initial sample labels and then computes each cluster center plane and the sample labels alternately, until some termination conditions are satisfied. In the following, we consider to obtain one of the cluster center planes for the given samples with their labels.

3.1 Formation

To obtain the ith (\(i=1,\ldots ,k\)) cluster center plane, our RampTWSVC considers the following problem

$$\begin{aligned} \begin{array}{l} \underset{w_i,b_i}{\min }\,\,c_1\sum \limits _{x_j\in X_i}R_{1}(x_j)+c_2\sum \limits _{x_j\in {\hat{X}}_i}R_{2}({x}_j)+\frac{1}{2}(||w_i||^2+||b_i||^2), \end{array} \end{aligned}$$
(5)

where \(c_1,c_2>0\) are the trade-offs. \(R_{1}(x)\) and \(R_{2}(x)\) are two piecewise linear functions w.r.t. the deviation \(|f_i(x)|=|w_i^\top x+b_i|\) (see Fig. 1) as

$$\begin{aligned} R_{1}(x)=\left\{ \begin{array}{ll} 0&{}\quad \text {if}\,\,|f_i(x)|\le 1-\Delta ,\\ 1-s&{}\quad \text {if}\,\,|f_i(x)|\ge 2-\Delta -s,\\ |f_i(x)|-1+\Delta &{}\quad \text {otherwise}, \end{array}\right. \end{aligned}$$
(6)

and

$$\begin{aligned} R_{2}(x)=\left\{ \begin{array}{ll} 2+2\Delta &{}\quad \text {if}\,\,|f_i(x)|\le -s,\\ 1+\Delta -s&{}\quad \text {if}\,\,|f_i(x)|\ge 1+\Delta ,\\ -|f_i(x)|+2+2\Delta -s&{}\quad \text {otherwise}, \end{array}\right. \end{aligned}$$
(7)

where \(\Delta \in [0,1)\) and \(s\in (-1,0]\) are two constants to control the function form (for instance, \(\Delta =0.3\) and \(s=-0.2\) [23]). \(||w_i||\) and \(||b_i||\) are the regularization terms to control the problem complexity.

It is obvious that both of the functions \(R_1(x)\) (for the within-cluster \(X_i\)) and \(R_2(x)\) (for the between-cluster \({\hat{X}}_i\)) have upper and lower bounds. Thus, the samples much further from the cluster center plane do not have greater impact on the cluster center plane. The above property indicates our RampTWSVC is more robust than PPC, TWSVC and RTWSVC.

In the following, we extend our RampTWSVC to nonlinear manifold clustering, and the solutions to the linear and nonlinear RampTWSVC are elaborated in the next subsection.

The plane-based clustering methods can be extended to nonlinear manifold clustering easily via kernel tricks [13, 28]. For instance, by introducing a pre-defined kernel function \(K(\cdot ,X)\) [11, 17, 21], the plane-based nonlinear clustering seeks k cluster center manifolds in the kernel-generated space as

$$\begin{aligned} \begin{array}{l} g_i(x)=K(x,X)^\top w_i+b_i=0,\,i=1,\ldots ,k. \end{array} \end{aligned}$$
(8)

Then, the nonlinear RampTWSVC considers introducing the ramp functions into the plane-based nonlinear clustering. By replacing \(f_i(x)\) with \(g_i(x)\) in (6) and (7), and substituting them into (5), one can easily obtain the optimization problems to construct the cluster center manifolds (8). When we obtain the k cluster centers (8), a sample x is assigned to a cluster given by

$$\begin{aligned} y=\underset{i}{\arg }\min |K(x,X)^\top w_i+b_i|. \end{aligned}$$
(9)

The procedure of the nonlinear case is the same as the linear one, so the details are omitted.

3.2 Solution

In this subsection, we study the solution to problem (5). The corresponding problem in nonlinear RampTWSVC is similar to the one in linear case. For convenience, let \(u_i=(w_i^\top ,b_i)^\top\), \(Z_i\) be a matrix whose jth column \(z_j\) is \(x_j\) with an additional feature 1 (where the corresponding \(x_j\) belongs to the ith cluster), and let \({\hat{Z}}_i\) be a matrix whose column is similar to \(z_j\) (where the corresponding \(x_j\) does not belongs to the ith cluster). After some algebra, problem (5) is recast to

$$\begin{aligned} \begin{array}{l} \underset{u_i}{\min }\,\,\frac{1}{2}||u_j||^2+c_1e^\top (-1+\Delta -Z_i^\top u_i)_+ +c_1e^\top (-1+\Delta \\ \quad +\,Z_i^\top u_i)_++c_2e^\top (1+\Delta -{\hat{Z}}_i^\top u_i)_+ +c_2e^\top (1+\Delta \\ \quad +\,{\hat{Z}}_i^\top u_i)_+-c_1e^\top (s-2+\Delta -Z_i^\top u_i)_+ -c_1e^\top (s-2+\Delta \\ \quad +\,Z_i^\top u_i)_+-c_2e^\top (s-{\hat{Z}}_i^\top u_i)_+ -c_2e^\top (s+{\hat{Z}}_i^\top u_i)_+. \end{array} \end{aligned}$$
(10)

It is easy to see that the above problem is a nonconvex programming problem because of the concave part \(-(\cdot )_+\). By introducing two auxiliary vectors \(p_1\in \{-1,0,1\}^{m_i}\) and \(p_2\in \{-1,0,1\}^{m-m_i}\) (where \(m_i\) is the sample number of the current ith cluster), the above problem is equivalent to the following mixed-integer programming problem

$$\begin{aligned} \begin{array}{l} \underset{u_i,p_1,p_2}{\min }\,\,\frac{1}{2}||u_i||^2+c_1e^\top (-1+\Delta -Z_i^\top u_i)_+\\ \quad +\,c_1e^\top (-1+\Delta +Z_i^\top u_i)_++c_2e^\top (1+\Delta -{\hat{Z}}_i^\top u_i)_+\\ \quad +\,c_2e^\top (1+\Delta +{\hat{Z}}_i^\top u_i)_++c_1p_1^\top Z_i^\top u_i +c_2p_2^\top {\hat{Z}}_i^\top u_i\\ s.t.\,\,p_1(j)=\left\{ \begin{array}{ll} -1&{}\quad \text {if}\,\,z_j^\top u_i>2-\Delta -s,\\ 1&{}\quad \text {if}\,\,z_j^\top u_i<-2+\Delta +s,\\ 0&{}\quad \text {otherwise}, \end{array}\right. \forall z_j\in Z_i\\ p_2(j)=\left\{ \begin{array}{ll} -1&{}\quad \text {if}\,\, z_j^\top u_i>-s,\\ 1&{}\quad \text {if}\,\,z_j^\top u_i<s,\\ 0&{}\quad \text {otherwise}, \end{array}\right. \forall z_j\in {\hat{Z}}_i \end{array} \end{aligned}$$
(11)

where \(p_1(j)\) and \(p_2(j)\) are the corresponding jth elements of \(p_1\) and \(p_2\), respectively.

Here, we propose an alternating iteration algorithm to solve mixed-integer programming problem (11). Starting with an initialized \(u_i^{(0)}\), it is easy to calculate \(p_1^{(0)}\) and \(p_2^{(0)}\) by the constraints of (11). For fixed \(p_1^{(t-1)}\) and \(p_2^{(t-1)}\) (\(t=1,2,\ldots\)), (11) becomes an unconstrained convex problem and its solution can be obtained by many algorithms easily (e.g., sequential minimal optimization (SMO) [29] and fast Newton-Amijio algorithm [30]). Once \(u_i^{(t)}\) is obtained, \(p_1^{(t)}\) and \(p_2^{(t)}\) are updated again. The loop will be continued until the objective of (11) does not decrease any more.

Theorem 3.1

The above alternating iteration algorithm terminates in a finite number of iterations at a local optimal point, where a local optimal point of the mixed-integer programming problem (11) is defined as the point\((u_i^*,p_1^*,p_2^*)\)if\(u_i^*\)is the global solution to problem (11) with fixed\((p_1^*,p_2^*)\)and vice versa.

Proof

From the procedure of the alternating iteration algorithm, it is obvious that the global solutions to (11) with fixed \(u_i\) or \((p_1,p_2)\) are obtained in iteration. Since there are a finite number of ways to select \(p_1\) and \(p_2\), there are two finite numbers \(0<r_1<r_2\) such that \((p_1^{(r_1)},p_2^{(r_1)})=(p_1^{(r_2)},p_2^{(r_2)})\). Thus, we have \(u_i^{(r_1)}=u_i^{(r_2)}\). That is to say, the objective values are equal in the \(r_1\)th and \(r_2\)th iterations. Since \(p_1^\top Z_i^\top u_i\le 0\) and \(p_2^\top {\hat{Z}}_i^\top u_i\le 0\) always hold, the objective value of (11) keeps nonincreasing in iteration. Therefore, the objective is invariant after the \(r_1\)th iteration, and then the algorithm would terminate at the \(r_1\)th iteration.

Let us consider the point \((u_i^{(r_1)},p_1^{(r_1)},p_2^{(r_1)})\). From the above proof, we have \(G(u_i^{(r_1)},p_1^{(r_1)},p_2^{(r_1)})=G(u_i^{(r_1)},p_1^{(r_1+1)},p_2^{(r_1+1)})\), where \(G(\cdot )\) is the objective value of (11). If there is more than one global solution to (11) with fixed \(u_i\), we always select the same one for the same \(u_i\). Thus, we have \((p_1^{(r_1)},p_2^{(r_1)})=(p_1^{(r_1+1)},p_2^{(r_1+1)})\), which indicates \((u_i^{(r_1)},p_1^{(r_1)},p_2^{(r_1)})\) is a local optimal point. \(\square\)

Table 1 Details of the benchmark datasets, where m is the number of samples, n is the number of dimensions and k is the number of classes
Table 2 Linear clustering on benchmark datasets
Table 3 Nonlinear clustering on benchmark datasets

4 Experimental results

In this section, we analyze the performance of our RampTWSVC compared with kmeans [10], kPC [14], PPC [15], TWSVC [17], RTWSVC [18] and FRTWSVC [18] on several benchmark datasets [31, 32]. All the methods were implemented by MATLAB 2017a on a PC with an Intel Core Duo Processor (double 4.2 GHz) with 16GB RAM. The parameters c in PPC, TWSVC, RTWSVC, FRTWSVC and \(c_1\), \(c_2\) in RampTWSVC were selected from \(\{2^i|i=-8,-7,\ldots ,-1\}\). For nonlinear case, the Gaussian kernel \(K(x_1,x_2)=\exp \{-\mu ||x_1-x_2||^2\}\) [28] was used, and its parameter \(\upmu\) was selected from \(\{2^i|i=-10,-9,\ldots ,1\}\). The random initialization was used on kmeans, and the nearest neighbor graph (NNG) initialization [17] was used on other methods. In the experiments, we used the metric accuracy (AC) [17] and mutual information (MI) [33] to measure the performance of these methods. We ran kmeans ten times on each dataset and recorded its mean value and standard deviation.

Table 4 Linear clustering rank in Table 2
Table 5 Nonlinear clustering rank in Table 3
Fig. 2
figure 2

Illustration of parameter influence of linear RampTWSVC on four datasets, where the parameters include \(c_1\) and \(c_2\)

Table 1 shows the details of the benchmark datasets. Tables 2 and 3 exhibit the linear and nonlinear clustering results on the benchmark datasets, respectively. The highest metrics among these methods on each dataset are in bold. From Table 2, it can be seen that our linear RampTWSVC performs better than other linear methods on seven datasets in terms of both AC and MI, and it is more accurate than other methods on other six datasets. On the rest of nine datasets, our linear RampTWSVC is also competitive with the best one. From Table 3, it is obvious that our nonlinear RampTWSVC has much higher AC and MI over other methods on most of the datasets. More precisely, we reported the rank of each method on these datasets in Tables 4 and 5 from Tables 2 and 3, respectively, where the numbers represent the ranks of these seven methods, the same number implies the same performance and the smaller number implies the higher performance. The mean metrics were also reported at the last two columns in these tables. It is obvious that our RampTWSVC owns the smallest mean metrics among these methods, which indicates that our RampTWSVC performs better than other methods on these datasets. It is worth to notice that there are several datasets on which other methods perform better than RampTWSVC. For example, kmeans obtains a much higher performance than others on dataset (g) in Table 2, which implies the cluster centers in dataset (g) are more suitable for points than planes. In these plane-based methods, our RampTWSVC cut off the influence of the samples far from the cluster centers, and the bounds are decided by the parameters. So the parameter selection is very important in our RampTWSVC.

To evaluate the parameter influence in our RampTWSVC, as examples, we reported the metric ACs with different parameter settings on four datasets. Figure 2 exhibits the results of linear RampTWSVC with parameters \(c_1\) and \(c_2\). It can be seen that the above two parameters play an important role in this method. Thence, we should be very careful about the parameter selection. Figure 3 exhibits the results of nonlinear RampTWSVC with parameters \(c_1\), \(c_2\) and \(\upmu\), where \(c_1=c_2\) for simplicity. From Fig. 3, it is obtained that we cannot obtain a desired result if \(\upmu\) is improper, which indicates the parameter \(\upmu\) is more important that the other two parameters for nonlinear RampTWSVC. This is obviously because the kernel mapping plays an more important role in the nonlinear case.

Fig. 3
figure 3

Illustration of parameter influence of nonlinear RampTWSVC on four datasets, where the parameters include \(c_1\), \(c_2\) and \(\mu\), and we set \(c_1=c_2\)

5 Conclusions

A plane-based clustering method (RampTWSVC) has been proposed in this paper. RampTWSVC hires the bounded ramp functions to reduce the influence of the samples far from the cluster centers, and it performs well in a robust manner. The nonconvex programming problems in RampTWSVC are recast to a series of mixed-integer programming problems and solved by a proposed alternating iteration algorithm. The local solutions of the nonconvex problems are guaranteed in theory. RampTWSVC contains both the linear and nonlinear formations. Experimental results on several benchmark datasets have indicated that our RampTWSVC performs much better than other plane-based clustering methods on many benchmark datasets. For practical convenience, the corresponding RampTWSVC MATLAB code has been uploaded upon http://www.optimal-group.org/Resources/Code/RampTWSVC.html. Future work includes the parameters regulation and efficient solver design for our RampTWSVC.