1 Introduction

An increasing number of machine learning approaches are prevailing these days, such as artificial neural networks [1], multi-objective optimization [2] and support vector machine (SVM) [3], and they all have their pros and cons. The SVM, proposed by Vapnik, is based on statistical learning theory. Under the guidance of VC dimension and structural risk minimization principle, it can get a good generalization and promotion ability through the compromise between empirical risk and model complexity. It is very effective when solving small sample, nonlinear, high dimensional sample problems, also feasible to avoid dimensional disasters and over-fitting problems to some extent. Nowadays, SVM has been applied to many fields, containing text classification [4], disease detection [5, 6], driver fatigue detection [7] and object detection [8] etc. Nevertheless, there are still some defects in SVM, and many improvements have been put forward in recent years.

One drawback of classic SVM is the slow computational speed. In order to improve this problem, Jayadeva et al. [9] proposed a twin SVM (TSVM) for binary classification problems. TSVM generates two nonparallel hyperplanes by solving two smaller-sized quadratic programming problems (QPPs) rather than one large QPP, which makes the computational speed approximately four times faster than that of classic SVM in theory. Meanwhile, TSVM requires that the data points of one class are as close as possible to one hyperplane and as far as possible from the other. Based on TSVM, many algorithms in [10,11,12,13,14,15,16,17] have been proposed.

In many TSVM-based algorithms, they all need matrix inverse operation. However, sometimes the matrix may not be reversible. What’s more, solving a matrix inverse is computationally expensive. In order to remedy this problem, the twin hypersphere support vector machine (THSVM) [18] was raised. It finds two irrelevant hyperspheres rather than two nonparallel hyperplanes by solving a pair of smaller-sized QPPs. It requires each hypersphere to capture as many data points of one class as possible. Because it needs to find two centers and two radiuses, it is still computational cost. In addition, THSVM can not deal with imbalanced data classification problems well.

To decrease computational cost and deal with imbalanced data classification problems more efficently, Xu [19] came up with a maximum margin of twin spheres support vector machine (MMTSSVM). It finds two homocentric hyperspheres rather than two irrelevant hyperspheres by solving one QPP and one linear programming problem (LPP).

All algorithms mentioned above use the Hinge loss function, which makes the models sensitive to outliers. The outliers may normally be given the largest hinge loss values, so the decision hyperplane is drawn toward outliers incorrectly, leading to decreased generalization performance. Hence, Huang et al. [20] proposed a ramp loss support vector machine (RSVM). According to the Ramp loss function, the outliers can be given fixed loss values. Thus, RSVM decreases the sensitivity to outliers. Because the RSVM is a non-convex optimization problem, the authors adopt the effective Concave-Convex Procedure (CCCP) [21] to solve it. Based on RSVM, some algorithms have been proposed, such as RLSSVM [22] etc.

In our daily life, multi-class classification problems are more common. The researchers have proposed a great deal of multi-class classification strategies, such as One-vs.-One (OVO) [23, 24], One-vs.-Rest (OVR) [25, 26], Rest-vs.-One (RVO) [27, 28] and One-vs.-One-vs.-Rest(OVOVR) [29,30,31]. For a K-class classification problem, OVO needs to combine the K classes in pairs. We choose the i th class as the positive class, and the j th class as the negative class to generate a classifier. Thus, K(K-1)/2 classifiers will finally be obtained. In OVR method, we choose the i th class as the positive class, and the rest classes as the negative class to generate a classifier. In total, OVR will generate K classifiers. RVO is just opposite to OVR. It picks the i th class as the negative class and the rest classes as the positive class to generate a classifier. RVO will also generate K classifiers. OVOVR is a bit similar with OVO. It also combines the classes in pairs. We select the i th class as the positive class, the j th class as the negative class and the remaining classes as the rest class to generate one classifier. OVOVR will also generate K(K-1)/2 classifiers totally. All the strategies above determine the type of a new data point by ‘voting’ scheme.

In order to improve the generalization performance of MMTSSVM, we propose a Ramp loss maximum margin of twin spheres support vector machine (Ramp-MMTSSVM) in this article. We employ MMTSSVM as the core of our algorithm because it has faster computational speed and can deal with imbalanced data classification problems effectively. Meanwhile, we use the Ramp loss function to substitute the Hinge loss function to decrease the sensitivity of MMTSSVM to outliers. After introducing the Ramp loss function, the optimization problem becomes a non-differentiable non-convex problem, which can not be solved by quadratic programming directly. Therefore, we use CCCP approach to deal with it. We also discuss the properties of parameters in Ramp-MMTSSVM, which can help us determine the range of parameters. In addition, we extend Ramp-MMTSSVM to multi-class classification problems by RVO strategy, termed as multicategory Ramp loss maximum margin of twin spheres support vector machine (MRMMTSSVM). Through twenty benchmark experiments, we compare Ramp-MMTSSVM with THSVM, SSLM and MMTSSVM and compare MRMMTSSVM with OVO-THSVM, THKSVM and OVR-MMTSSVM. The experimental results imply that our algorithms have better performance than other algorithms.

The paper is organized as follows. Section 2 gives a brief introduction on MMTSSVM and RSVM. In Section 3, we introduce Ramp-MMTSSVM. Section 4 shows our algorithm, MRMMTSSVM. Several artificial experiments and twenty benchmark experiments to testify the effectiveness of our algorithms are shown in Section 5. The last section, Section 6, is our conclusion on present study.

2 Background

In this section, we review the basics of MMTSSVM and RSVM.

To make it easier to understand, we first give some definitions of notions. Given a set {(x1,y1),…,(xl,yl)}, where \(\mathbf {x}_{i} \in \mathcal {R}^{m}\) and yi ∈{+ 1,− 1}, i = 1,…,l includes l+ positive data points and l negative data points. In addition, I+ and I represent positive set and negative set, respectively. ϕ is a nonlinear mapping, which maps the input data points into the higher-dimensional feature space. Kernel function K(xi,xj) = (ϕ(xi) ⋅ ϕ(xj)) is selected in advance.

2.1 Maximum margin of twin spheres support vector machine

MMTSSVM is good at dealing with imbalanced data classification problems. It aims to find two homocentric spheres. On one hand, the small sphere needs to cover as many data points of the positive class as possible. On the other hand, the large sphere needs to push out as many data points of the negative class as possible. In addition, it follows the maximum margin principle which requires the margin between the small sphere and the large sphere is as large as possible. The optimization problems that MMTSSVM needs to solve are denoted as follows,

$$\begin{array}{@{}rcl@{}} \min \limits_{R^{2}, C, \xi_{i}} &&R^{2} - \frac{\nu}{l^{-}} \sum\limits_{j \in I^{-}} \| \phi (\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1} l^{+}} \sum\limits_{i \in I^{+}} \xi_{i}\\ s.t. &&\| \phi (\mathbf{x}_{i}) - C \|^{2} \leq R^{2} + \xi_{i},\\ && \xi_{i} \geq 0, i \in I^{+}, \end{array} $$
(1)

and

$$\begin{array}{@{}rcl@{}} \min \limits_{\rho^{2}, \eta_{j}} &&R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-}} \sum\limits_{j \in I^{-}} \eta_{j}\\ s.t. &&\|\phi (\mathbf{x}_{j}) - C \|^{2} \geq R^{2} + \rho^{2} - \eta_{j},\\ &&\eta_{j} \geq 0, j \in I^{-}, \end{array} $$
(2)

where ν,ν1 and ν2 are parameters chosen a priori. C and R are the center and the radius of the small sphere, respectively. \(\sqrt {R^{2}+\rho ^{2}}\) is the radius of the large sphere.

The decision function of MMTSSVM is shown as follows,

$$\begin{array}{@{}rcl@{}} f (\mathbf{x}) = \left\{ \begin{array}{rcl} 1, & & {if \| \phi (\mathbf{x}) -C \| < \frac{R + \sqrt{R^{2} + \rho^{2}}}{2}}\\ -1, & & {else}. \end{array} \right. \end{array} $$
(3)

2.2 Ramp loss SVM

Classic SVM employs the Hinge loss function, which is denoted as follows, Hs(z) = max(0,sz), where s indicates the position of the Hinge point, to penalize examples classified with an insufficient margin. In terms of the Hinge loss function, the objective function can be written as follows,

$$\begin{array}{@{}rcl@{}} \min_{\mathbf{w}, b} \quad \frac{1}{2} \| \mathbf{w} \|^{2} + c\sum\limits_{i = 1}^{l} H_{1} \left( y_{i}f(\mathbf{x}_{i})\right), \end{array} $$
(4)

where the f(x) is the decision function and the expression of f(x) is f(x) = wTx + b.

From Fig. 1, we can see that the outliers tend to have the largest loss values according to the Hinge loss function. Therefore, the outliers have a negative influence on constructing the hyperplane and the model is sensitive to outliers. To improve this problem, researchers proposed the Ramp loss function, i.e., the Robust Hinge loss. The expression of the Ramp loss function is shown as follows,

$$\begin{array}{@{}rcl@{}} R_{s}(z) = \left\{ \begin{array}{rcl} 0, & & {z > 1}\\ 1 - z, & & {s \leq z \leq 1}\\ 1 - s, & & {z < s}, \end{array} \right. \end{array} $$
(5)

where s < 1 is given a prior. We can see that the Ramp loss function can be expressed by a convex Hinge loss and a concave loss, i.e., Rs(z) = H1(z) − Hs(z) in Fig. 1. After introducing the Ramp loss function, RSVM is denoted as follows,

$$\begin{array}{@{}rcl@{}} &&\min_{\mathbf{w}, b} \quad \frac{1}{2}\|\mathbf{w}\|^{2} + c\sum\limits_{i = 1}^{l} R_{s}\left( y_{i}f(\mathbf{x}_{i})\right)\\ &=&\frac{1}{2}\|\mathbf{w}\|^{2} + c\sum\limits_{i = 1}^{l} H_{1}\left( y_{i}f(\mathbf{x}_{i})\right) - c\sum\limits_{i = 1}^{l} H_{s}\left( y_{i}f(\mathbf{x}_{i})\right). \end{array} $$
(6)

This is a non-differentiable non-convex problem, which is solved by CCCP approach.

Fig. 1
figure 1

When s=-1, the geometric decomposition diagram of the Ramp loss function

3 Ramp loss maximum margin of twin spheres support vector machine

Because MMTSSVM is sensitive to outliers, we substitute the Hinge loss function with the Ramp loss function to improve this problem.

We set u1 = R2 −∥ϕ(xi) − C2, u2 = ∥ϕ(xj) − C2 − (R2 + ρ2). Since the Hinge loss function is Hs(z) = max(0,sz), the objective functions of problems (1) and (2) are equal to

$$\begin{array}{@{}rcl@{}} \min_{R^{2}, C} \quad R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}} H(u_{1}), \end{array} $$
(7)

and

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2}} \quad R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} } H(u_{2}). \end{array} $$
(8)

3.1 Primal Formulations

After replacing the Hinge loss function with the Ramp loss function, we get the primal formulations as follows,

$$\begin{array}{@{}rcl@{}} \min_{R^{2}, C} \quad R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}} Q(u_{1}), \end{array} $$
(9)

and

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2}} \quad R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} } Q(u_{2}), \end{array} $$
(10)

where Q(u1) and Q(u2) are the Ramp loss functions. Q(u1) and Q(u2) can be denoted as follows,

$$\begin{array}{@{}rcl@{}} Q(u_{1})=\left\{ \begin{array}{rcl} 0, & & {u_{1}\geq0}\\ -u_{1}, & & {-k_{1}R^{2}<u_{1}<0}\\ k_{1}R^{2}, & & {u_{1}\leq-k_{1}R^{2}}, \end{array} \right. \end{array} $$
(11)
$$\begin{array}{@{}rcl@{}} Q(u_{2})=\left\{ \begin{array}{rcl} 0, & & {u_{2}\geq0}\\ -u_{2}, & & {-k_{2}(R^{2}+\rho^{2})<u_{2}<0}\\ k_{2}(R^{2}+\rho^{2}), & & {u_{2}\leq-k_{2}(R^{2}+\rho^{2})}, \end{array} \right. \end{array} $$
(12)

where k1 and k2 are chosen a prior.

Then, the problems (9) and (10) can be rewritten as

$$\begin{array}{@{}rcl@{}} \min_{R^{2}, C} &&R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2}\\ &&+ \frac{1}{\nu_{1}l^{+}} Q\left( R^{2}-\|\phi(\mathbf{x}_{i})-C\|^{2}\right), \end{array} $$
(13)

and

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2}} \quad R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} } Q\left( \|\phi(\mathbf{x}_{j}) - C\|^{2} - (R^{2}+\rho^{2})\right). \end{array} $$
(14)

From (11) and (12), we can see that the Ramp loss function limits the maximum loss value, which reduces the influence of outliers on the optimal solution to the problem. Therefore, the model is less sensitive to outliers.

3.2 Dual problems

The Ramp loss can be decomposed into the sum of a convex Hinge loss and a concave function, i.e.,

$$\begin{array}{@{}rcl@{}} Q(u_{i})=H_{1}(u_{i})-H_{2}(u_{i}), \end{array} $$
(15)

where H1(ui) = max(−ui,0)(i = 1,2), H2(u1) = max(−u1k1R2,0), H2(u2) = max (−u2k2(R2 + ρ2),0).

The (13) and (14) can be rewritten as follows,

$$\begin{array}{@{}rcl@{}} \min_{R^{2}, C} &&\underbrace{R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}}H_{1}(u_{1})}_{J_{vex}}\\ &&\underbrace{-\frac{1}{\nu_{1}l^{+}}H_{2}(u_{1})}_{J_{cav}}, \end{array} $$
(16)

and

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2}} \quad \underbrace{R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} }H_{1}(u_{2})}_{J_{vex}} \underbrace{-\frac{1}{\nu_{2}l^{-} }H_{2}(u_{2})}_{J_{cav}}. \end{array} $$
(17)

We can see that QPP (16) and LPP (17) are both composed of a convex function and a concave function, which can not be solved by quadratic programming. For this reason, we take advantage of CCCP approach to solve this problem for its simple tuning and iteration manner.

We first take QPP (16) into consideration. The objective function is the sum of a convex function u(x) and a concave function v(x). CCCP solves problems by iterating a series of concave functions, i.e.,

$$\begin{array}{@{}rcl@{}} \mathbf{x}^{t + 1}=argmin_{\mathbf{x}}{u(\mathbf{x})+\mathbf{x}^{T}\nabla v(\mathbf{x}^{t})}, \end{array} $$
(18)

where t means the number of iterations. The convex part of QPP (16) is shown as follows,

$$\begin{array}{@{}rcl@{}} J_{vex}(C,R^{2}) = R^{2} - \frac{\nu}{l^{-}}\!\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}}H_{1}(u_{1}),\\ \end{array} $$
(19)

and the concave part is shown as follows,

$$\begin{array}{@{}rcl@{}} J_{cav}= -\frac{1}{\nu_{1}l^{+}}H_{2}(u_{1}). \end{array} $$
(20)

The CCCP process for this issue is as follows:

figure d

Then we rewrite the problem (??) as follows:

$$\begin{array}{@{}rcl@{}} \min_{C,R^{2},\xi_{i}} &&R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}}\sum\limits_{i\in I^{+}} \xi_{i}\\ &&+J_{cav}^{\prime}\left( R^{2} - \|\phi(\mathbf{x}_{i}) - C\|^{2}\right)\!\cdot\!\left( R^{2} - \|\phi(\mathbf{x}_{i}) - C\|^{2}\right)\\ s.t. &&\|\phi(\mathbf{x}_{i})-C\|^{2}\!\leq\! R^{2}+\xi_{i},\\ && \xi_{i}\geq 0, i\in I^{+}. \end{array} $$
(22)

To simplify the process, we introduce the following symbols,

$$\begin{array}{@{}rcl@{}} \theta_{i}&=&-\frac{1}{\nu_{1}l^{+}}\frac{\partial H_{2}(u_{1})}{\partial u_{1}}\\ &=&\left\{ \begin{array}{rcl} \frac{1}{\nu_{1}l^{+}}, & & {if \|\phi(\mathbf{x}_{i})-C\|^{2}> R^{2}+k_{1}R^{2}}\\ 0, & & {else}. \end{array}\right. \end{array} $$
(23)

Therefore, problem (22) can be rewritten as follows,

$$\begin{array}{@{}rcl@{}} \min_{C,R^{2},\xi_{i}} &&R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}}\sum\limits_{i\in I^{+}} \xi_{i}\\ &&+ \sum\limits_{i\in I^{+}} \theta_{i}\left( R^{2}-\|\phi(\mathbf{x}_{i})-C\|^{2}\right)\\ s.t. &&\|\phi(\mathbf{x}_{i})-C\|^{2}\leq R^{2}+\xi_{i},\\ && \xi_{i}\geq 0, i\in I^{+}. \end{array} $$
(24)

Similarly, the CCCP structure of problem (17) can also be given. The convex part of this problem is shown as follows:

$$\begin{array}{@{}rcl@{}} J_{vex}(\rho^{2})=R^{2}-\rho^{2}+\frac{1}{\nu_{2}l^{-}}H_{1}(u_{2}), \end{array} $$
(25)

and the concave part is shown as follows:

$$\begin{array}{@{}rcl@{}} J_{cav}(\rho^{2})=-\frac{1}{\nu_{2}l^{-} }H_{2}(u_{2}). \end{array} $$
(26)

The CCCP process for problem (17) is as follows:

figure e

Then we rewrite the problem (??) as follows:

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2},\eta_{j}} &&R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} }\sum\limits_{j\in I^{-}}\eta_{j}+J_{cav}^{\prime}\left( \|\phi\left( \mathbf{x}_{j}\right)-C\|^{2}\right.\\ &&-\left.\left( R^{2}+\rho^{2}\right)\right)\left( \|\phi\left( \mathbf{x}_{j}\right)-C\|^{2}-\left( R^{2}+\rho^{2}\right)\right)\\ s.t. &&\|\phi(\mathbf{x}_{j})-C\|^{2}\geq R^{2}+\rho^{2}-\eta_{j},\\ && \eta_{j}\geq 0, j\in I^{-}. \end{array} $$
(28)

To simplify the process, we introduce the following symbols,

$$\begin{array}{@{}rcl@{}} \delta_{j}&=& -\frac{1}{\nu_{2}l^{-}}\frac{\partial H_{2}(u_{2})}{\partial u_{2}}\\ &=&\left\{ \begin{array}{rcl} \frac{1}{\nu_{2}l^{-}}, & & {if \|\phi(\mathbf{x}_{j})-C\|^{2} < R^{2}+\rho^{2}}\\ \quad & & {-k_{2}\left( R^{2}+\rho^{2}\right)}\\ 0, & & {else}. \end{array}\right. \end{array} $$
(29)

Therefore, problem (28) can be rewritten as follows,

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2},\eta_{j}} &&R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} }\sum\limits_{j\in I^{-}}\eta_{j}\\ &&+\sum\limits_{j\in I^{-}}\delta_{j}\left( \|\phi\left( \mathbf{x}_{j}\right)-C\|^{2}-\left( R^{2}+\rho^{2}\right)\right)\\ s.t. &&\|\phi(\mathbf{x}_{j})-C\|^{2}\geq R^{2}+\rho^{2}-\eta_{j},\\ && \eta_{j}\geq 0, j\in I^{-}. \end{array} $$
(30)

To solve the problem (24), we introduce the Lagrange function which is shown as follows,

$$\begin{array}{@{}rcl@{}} L_{1}&=&R^{2}- \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}}\sum\limits_{i\in I^{+}} \xi_{i}\\ &&+ \sum\limits_{i\in I^{+}} \theta_{i}\left( R^{2}-\|\phi(\mathbf{x}_{i})-C\|^{2}\right)\\ &&+\sum\limits_{i\in I^{+}}\alpha_{i}\left( \|\phi(\mathbf{x}_{i}) - C\|^{2}-R^{2} - \xi_{i}\right) - \sum\limits_{i\in I^{+}}\beta_{i}\xi_{i}. \end{array} $$
(31)

where αi ≥ 0, βi ≥ 0 both are Lagrange multipliers. Differentiating the Lagrangian function L1 with respect to variables R2,C and ξi yields the following Karush-Kuhn-Tucker(KKT) conditions:

$$\begin{array}{@{}rcl@{}} \frac{\partial L_{1}}{\partial R^{2}}= 1+\sum\limits_{i\in I^{+}}\theta_{i}-\sum\limits_{i\in I^{+}}\alpha_{i}= 0, \end{array} $$
(32)
$$\begin{array}{@{}rcl@{}} \frac{\partial L_{1}}{\partial C}&= &\frac{2\nu}{l^{-}}\sum\limits_{j \in I^{-}}\left( \phi(\mathbf{x}_{j})-C\right)+ 2\sum\limits_{i\in I^{+}}\theta_{i}\left( \phi(\mathbf{x}_{i})\right.\\ &&-\left.C\right)-2\sum\limits_{i\in I^{+}}\alpha_{i}\left( \phi(\mathbf{x}_{i})-C\right)= 0, \end{array} $$
(33)
$$\begin{array}{@{}rcl@{}} \frac{\partial L_{1}}{\partial \xi_{i}}=\frac{1}{\nu_{1}l^{+}}-\alpha_{i}-\beta_{i}= 0. \end{array} $$
(34)

From (32), we can get

$$\begin{array}{@{}rcl@{}} \sum\limits_{i\in I^{+}}\left( \alpha_{i}-\theta_{i}\right)= 1. \end{array} $$
(35)

From (33), we can obtain the center C as follows:

$$\begin{array}{@{}rcl@{}} C=\frac{1}{1-\nu}\left( \sum\limits_{i\in I^{+}}(\alpha_{i}-\theta_{i})\phi(\mathbf{x}_{i})-\frac{\nu}{l^{-}}\sum\limits_{j\in I^{-}}\phi(\mathbf{x}_{j})\right). \end{array} $$
(36)

Then

$$\begin{array}{@{}rcl@{}} <C,C>&=& \frac{1}{(1-\nu)^{2}}\left( \sum\limits_{i\in I^{+}}\sum\limits_{j\in I^{+}}(\alpha_{i}-\theta_{i})\right.\\ &&\boldsymbol\cdot (\alpha_{j}-\theta_{j})K(\mathbf{x}_{i},\mathbf{x}_{j})\\ &&+\left( \frac{\nu}{l^{-}}\right)^{2}\sum\limits_{i\in I^{-}}\sum\limits_{j\in I^{-}}K(\mathbf{x}_{i},\mathbf{x}_{j})\\ &&-\left.\frac{2\nu}{l^{-}}\sum\limits_{i\in I^{+}}\sum\limits_{j\in I^{-}}(\alpha_{i}-\theta_{i})K(\mathbf{x}_{i},\mathbf{x}_{j})\right). \end{array} $$
(37)

Finally, we derive the dual formulation as follows:

$$\begin{array}{@{}rcl@{}} \max_{\boldsymbol{\alpha}} &&-\sum\limits_{i\in I^{+}}\sum\limits_{j\in I^{+}}(\alpha_{i}-\theta_{i})(\alpha_{j}-\theta_{j})K(\mathbf{x}_{i},\mathbf{x}_{j})\\ &&+\sum\limits_{i\in I^{+}}(\alpha_{i}-\theta_{i})\left( \frac{2\nu}{l^{-}}\sum\limits_{j\in I^{-}}K(\mathbf{x}_{j},\mathbf{x}_{i})\right.\\ &&+\left.(1-\nu)K(\mathbf{x}_{i},\mathbf{x}_{i})\vphantom{\sum\limits_{j\in I^{-}}}\right)\\ s.t. &&0\leq \alpha_{i}\leq \frac{1}{\nu_{1}l^{+}}, \sum\limits_{i\in I^{+}}\left( \alpha_{i}-\theta_{i}\right)= 1. \end{array} $$
(38)

In order to obtain the radius of the small sphere, we use positive points xi whose Lagrangian multiplier αi satisfy \(0<\alpha _{i}<\frac {1}{\nu _{1}l^{+}}\). Supposing the number of xi above is np. According to KKT condition, we can get: αi (∥ϕ(xi) − C2R2ξi) = 0. Then we can acquire the square radius of the small sphere as

$$\begin{array}{@{}rcl@{}} R^{2}\! &=&\!\frac{1}{n_{p}}\sum\limits_{i = 1}^{n_{p}}\|\phi(\mathbf{x}_{i})-C\|^{2}=\frac{1}{n_{p}}\sum\limits_{i = 1}^{n_{p}}\left( \vphantom{\sum\limits_{j\in I^{-}}}K(\mathbf{x}_{i},\mathbf{x}_{i})+C^{2}\right.\\ && - \left.\frac{2}{1 - \nu}\!\!\left( \sum\limits_{j\in I^{+}}\!(\alpha_{j} - \theta_{j}){\kern-.5pt}K\!(\mathbf{x}_{i}{\kern-.5pt},{\kern-.5pt}\mathbf{x}_{j}{\kern-.5pt}) - \frac{\nu}{l^{-}}\!\sum\limits_{j\in I^{-}}\!K\!(\mathbf{x}_{i}{\kern-.5pt},{\kern-.5pt}\mathbf{x}_{j}{\kern-.5pt})\!\right)\!\!\right)\!.\\ \end{array} $$
(39)

To solve the problem (30), we introduce the Lagrange function which is shown as follows,

$$\begin{array}{@{}rcl@{}} L_{2}&=&R^{2}-\rho^{2}+\frac{1}{\nu_{2}l^{-}}\sum\limits_{j\in I^{-}}\eta_{j}+\sum\limits_{j\in I^{-}}\delta_{j}\left( \| \phi(\mathbf{x}_{j}) - C \|^{2}\right.\\ &&-\left.(R^{2}+\rho^{2})\right)-\sum\limits_{j\in I^{-}}\gamma_{j}\left( \| \phi(\mathbf{x}_{j}) - C \|^{2} - R^{2} - \rho^{2}\right.\\ &&+\left.\eta_{j}\right)-\sum\limits_{j\in I^{-}}\lambda_{j}\eta_{j}. \end{array} $$
(40)

where γj ≥ 0 and λj ≥ 0 are Lagrange multipliers. Differentiating the Lagrange function L2 with respect to variables ρ2 and ηj yields the following KKT conditions:

$$\begin{array}{@{}rcl@{}} \frac{\partial L_{2}}{\partial \rho^{2}}=-1-\sum\limits_{j\in I^{-}}\delta_{j}+\sum\limits_{j\in I^{-}}\gamma_{j}= 0, \end{array} $$
(41)
$$\begin{array}{@{}rcl@{}} \frac{\partial L_{2}}{\partial \eta_{j}}=\frac{1}{\nu_{2}l^{-}}-\gamma_{j}-\lambda_{j}= 0. \end{array} $$
(42)

From (41), we can obtain

$$\begin{array}{@{}rcl@{}} \sum\limits_{j\in I^{-}}\left( \gamma_{j}-\delta_{j}\right)= 1. \end{array} $$
(43)

Finally, we derive the dual formulation as follows:

$$\begin{array}{@{}rcl@{}} \max_{\boldsymbol{\gamma}} &&-\sum\limits_{j\in I^{-}}\left( \gamma_{j}-\delta_{j}\right)\left( \| \phi(\mathbf{x}_{j}) - C \|^{2}\right)\\ s.t. &&\sum\limits_{j\in I^{-}}(\gamma_{j}-\delta_{j})= 1, 0\leq \gamma_{j}\leq \frac{1}{\nu_{2}l^{-}}. \end{array} $$
(44)

In order to obtain ρ2, we use negative points xj whose Lagrangian multiplier γj satisfy \(0<\gamma _{j}<\frac {1}{\nu _{2}l^{-}}\). Supposing the number of xj is nn. According to KKT condition: γj(∥ϕ(xj) − C2R2ρ2 + ηj) = 0, we can acquire

$$\begin{array}{@{}rcl@{}} \rho^{2}=\frac{1}{n_{n}}\sum\limits_{j = 1}^{n_{n}}\left( \|\phi\left( \mathbf{x}_{j}\right)-C\|^{2}-R^{2}\right). \end{array} $$
(45)

Then, CCCP procedure based on Ramp-MMTSSVM is summarized as follows.

figure f

3.3 Property of parameters ν 1 and ν 2

In the Ramp-MMTSSVM, parameters ν1 and ν2 have their theoretical significance. In the following part, we will analyze the properties of ν1 and ν2.

To make it easier to understand, we first give the following two definitions.

Definition 1

The small sphere can be divided into four sets S+ = {i|∥ϕ(xi) − C2 < R2}, B+ = {i|∥ϕ(xi) − C2 = R2}, R+ = {i|R2 < ∥ϕ(xi) − C2 < R2 + k1R2} and W+ = {i|∥ϕ(xi) − C2R2 + k1R2}.

Definition 2

The large sphere can be divided into four sets S = {j|∥ϕ(xj) − C2R2 + ρ2k2(R2 + ρ2)}, B = {j|R2 + ρ2k2(R2 + ρ2) < ∥ϕ(xj) − C2 < R2 + ρ2}, R = {j|∥ϕ(xj) − C2 = R2 + ρ2} and W = {j|∥ϕ(xj) − C2 > R2 + ρ2}.

According to the two definitions above, we can obtain the following propositions.

Proposition 1

Let n 1 represent the number of positive samples in B + , R + and W + , n 2 represent the number of positive samples in R + and W + , n 3 represent the number of positive samples in W + . Thus, we can obtain \(\frac {n_{2}-n_{3}}{l^{+}}\leq \nu _{1} \leq \frac {n_{1}-n_{3}}{l^{+}}\) .

Proof

If xi(i = 1,2,…,l+) represents a positive data point in S+, we can get ξi = 0. Then, according to KKT condition αi(∥ϕ(xi) − C2R2ξi) = 0, we can get αi = 0. From (23), we can acquire 𝜃i = 0. If xi represents a positive data point in B+, we can get ξi = 0, then according to KKT condition βiξi = 0, we can get βi ≥ 0. Therefore, \(0\leq \alpha _{i}\leq \frac {1}{\nu _{1}l^{+}}\), and 𝜃i = 0. If xi represents a positive data point in R+, we can get ξi≠ 0, then βi = 0, then \(\alpha _{i}=\frac {1}{\nu _{1}l^{+}}\) and 𝜃i = 0. If xi represents a positive data point in W+, we can get \(\alpha _{i}=\frac {1}{\nu _{1}l^{+}}\) and \(\theta _{i}=\frac {1}{\nu _{1}l^{+}}\). According to the conclusions above, we have \(\frac {n_{2}}{\nu _{1}l^{+}}\leq {\sum }_{i\in I^{+}}\alpha _{i} \leq \frac {n_{1}}{\nu _{1}l^{+}}\) and \(-{\sum }_{i\in I^{+}}\theta _{i}=-\frac {n_{3}}{\nu _{1}l^{+}}\). Therefore, \(\frac {n_{2}-n_{3}}{\nu _{1}l^{+}}\leq {\sum }_{i\in I^{+}}(\alpha _{i}-\theta _{i}) \leq \frac {n_{1}-n_{3}}{\nu _{1}l^{+}}\). From (35), we can get \(\frac {n_{2}-n_{3}}{\nu _{1}l^{+}}\leq 1 \leq \frac {n_{1}-n_{3}}{\nu _{1}l^{+}}\). Then, \(\frac {n_{2}-n_{3}}{l^{+}}\leq \nu _{1} \leq \frac {n_{1}-n_{3}}{l^{+}}\). □

Proposition 2

Let m 1 represent the number of negative samples in S , B and R , m 2 represent the number of negative samples in S and B , m 3 represent the number of negative samples in S . Thus, we can obtain \(\frac {m_{2}-m_{3}}{l^{-}}\leq \nu _{2} \leq \frac {m_{1}-m_{3}}{l^{-}}\) .

Proof

If xj(i = 1,2,…,l) represents a negative data point in S, we can get ηj≠ 0, then according to KKT condition λjηj = 0, we can get λj = 0. From (42), we can acquire \(\gamma _{j}=\frac {1}{\nu _{2}l^{-}}\). And from (29), we can get \(\delta _{j}=\frac {1}{\nu _{2}l^{-}}\). If xj represents a negative data point in B, we can get \(\gamma _{j}=\frac {1}{\nu _{2}l^{-}}\), and δj = 0. If xj represents a negative data point in R, we can get ηj = 0, then λj ≥ 0, then \(0\leq \gamma _{j}\leq \frac {1}{\nu _{2}l^{-}}\) and δj = 0. If xj represents a negative data point in W, we can get ηj = 0, then γj = 0 and δj = 0. According to conclusions above, we have \(\frac {m_{2}}{\nu _{2}l^{-}}\leq {\sum }_{j\in I^{-}}\gamma _{j} \leq \frac {m_{1}}{\nu _{2}l^{-}}\) and \(-{\sum }_{j\in I^{-}}\delta _{j}=-\frac {m_{3}}{\nu _{2}l^{-}}\). Therefore, \(\frac {m_{2}-m_{3}}{\nu _{2}l^{-}}\leq {\sum }_{j\in I^{-}}(\gamma _{j}-\delta _{j}) \leq \frac {m_{1}-m_{3}}{\nu _{2}l^{-}}\). From (43), we can get \(\frac {m_{2}-m_{3}}{\nu _{2}l^{-}}\leq 1 \leq \frac {m_{1}-m_{3}}{\nu _{2}l^{-}}\). Then, \(\frac {m_{2}-m_{3}}{l^{-}}\leq \nu _{2} \leq \frac {m_{1}-m_{3}}{l^{-}}\). □

From Propositions above, we can acquire the range of ν1 and ν2: 0 < ν1,ν2 < 1.

4 Multicategory Ramp loss maximum margin of twin spheres support vector machine

Because multi-class classification problems are more common, we extend Ramp-MMTSSVM to multi-class classification problems by RVO strategy. For a K-class classification problem, we choose the i th class as the negative class and the rest classes as the positive class to generate a Ramp-MMTSSVM. Thus, K classifiers will finally be generated. The optimization problems of MRMMTSSVM are shown as follows:

$$\begin{array}{@{}rcl@{}} \min_{C_{k},{R_{k}^{2}},\xi_{i}} &&{R_{k}^{2}} - \frac{\nu_{k}}{l_{k}}\sum\limits_{j \in A_{k}} \| \phi(\mathbf{x}_{j}) - C_{k} \|^{2} + \frac{1}{\nu_{1k}l_{K-1}}\sum\limits_{i\in B_{k}} \xi_{i}\\ &&+ \sum\limits_{i\in B_{k}} \theta_{i}\left( {R_{k}^{2}}-\|\phi(\mathbf{x}_{i})-C_{k}\|^{2}\right)\\ s.t.&&\|\phi(\mathbf{x}_{i})-C_{k}\|^{2}\leq {R_{k}^{2}}+\xi_{i},\\ && \xi_{i}\geq 0, i\in B_{k}. \end{array} $$
(46)

and

$$\begin{array}{@{}rcl@{}} \min_{{\rho_{k}^{2}},\eta_{j}} &&{R_{k}^{2}} - {\rho_{k}^{2}} + \frac{1}{\nu_{2k}l_{k} }\sum\limits_{j\in A_{k}}\eta_{j}\\ &&+\sum\limits_{j\in A_{k}}\delta_{j}\left( \|\phi\left( \mathbf{x}_{j}\right)-C_{k}\|^{2}-\left( {R_{k}^{2}}+{\rho_{k}^{2}}\right)\right)\\ s.t. &&\|\phi\left( \mathbf{x}_{j}\right)-C_{k}\|^{2}\geq {R_{k}^{2}}+{\rho_{k}^{2}}-\eta_{j},\\ && \eta_{j}\geq 0, j\in A_{k}. \end{array} $$
(47)

where lk denotes the number of data points of k th class and lK− 1 denotes the number of data points of the rest classes, \(A_{k} \in R^{l_{k}\times n}(k = 1, \ldots , K)\), and \(B_{k}=[{A_{1}^{T}}, {\ldots } A_{k-1}^{T}, A_{k + 1}^{T}, {\ldots } {A_{K}^{T}}]^{T}\).

For solving each classifier, we also use CCCP approach. The usage of CCCP approach is the same as the binary classification case. Finally, we adopt ‘voting’ scheme to obtain the class of each new data point.

5 Numerical experiments

In this section, we conduct experiments on one artificial dataset and twenty benchmark datasets to demonstrate the validity of the algorithms we proposed.

5.1 Experiments on one artificial dataset

We generate a 2-D artificial dataset, including 80 positive data points and 10 negative data points. The positive data points follow uniform distribution N(0,0,3,3) and the negative data points follow N(3,3,2,2). The illustration of linear Ramp-MMTSSVM on this dataset is shown in Fig. 2.

Fig. 2
figure 2

Illustration of linear Ramp-MMTSSVM

In this part, we compare linear Ramp-MMTSSVM and linear MMTSSVM which are shown in Figs. 2 and 3 on the artificial dataset. From the pictures, it is easy to find that the small sphere captures more positive data points in Ramp-MMTSSVM than that in MMTSSVM. This represents Ramp-MMTSSVM has better experimental performance than MMTSSVM on this artificial dataset.

Fig. 3
figure 3

Illustration of linear MMTSSVM

To testify that the Ramp loss can reduce the effect of outliers, we add five noise points to the positive and negative class respectively. After adding noises, the performance of Ramp-MMTSSVM and MMTSSVM are shown in Figs. 4 and 5 respectively. From Fig. 5, it is obvious that the boundary of MMTSSVM expands outward. However, the boundary of Ramp-MMTSSVM in Fig. 4 has almost no change. That is to say, when handling noises and outliers, Ramp-MMTSSVM has better performance than MMTSSVM.

Fig. 4
figure 4

Illustration of linear Ramp-MMTSSVM when handling the noise data points

Fig. 5
figure 5

Illustration of linear MMTSSVM when handling the noise data points

In addition, in order to better show the property of ν1 which is mentioned in part 3, we depict Fig. 6, where the xaxis represents the values of parameter ν1. In the picture, the black dotted line indicates when ν1 changes, the changing proportion of n1n3 in positive data points. The red solid line represents the changing values of ν1. The blue dotted line indicates when ν1 changes, the changing proportion of n2n3 in positive data points.

Fig. 6
figure 6

The property of parameter ν1

From the picture, we can get that parameter ν1 is a lower bound on the fraction of n1n3 in the positive data points and an upper bound on the fraction of n2n3 in the positive data points, which is helpful for us to select the range of parameters.

5.2 Experiments on benchmark datasets

In this part, we make experiments on twenty benchmark datasets, which are collected from UCI machine learning repository. The detailed information of twenty datasets is shown in Table 1.

Table 1 The characteristics of twenty benchmark datasets

In the binary classification case, Ramp-MMTSSVM is compared with THSVM, SSLM and MMTSSVM. In the multi-class classification case, MRMMTSSVM is compared with OVO-THSVM, THKSVM and OVR-MMTSSVM.

We adopt 5-fold cross-validation for each experiment to make the experiments more convincing. All experiments are carried out in Matlab R2014a on Windows 7 running on a PC with system configuration Inter Core i3-4160Duo CPU(3.60GHz)with 4.00 GB of RAM.

5.2.1 Parameters selection

The approaches we mentioned above, i.e., THSVM, SSLM, MMTSSVM, THKSVM, Ramp-MMTSSVM and MRMMTSSVM all depend heavily on parameters selection. In these experiments, we choose the Gaussian kernel function k(xi,xj) = exp(−∥xixj2/σ2). We acquire optimal values of the parameters by the grid search method. All the Gaussian kernel parameter σ is selected from the set {2i|i = − 3,− 1,…,7}.

In THSVM, for reducing computational complexity, we set c1 = c2 and ν1 = ν2, which are chosen from the set {2i|i = 0,2,…,8} and {0.1,0.2,…,0.9}, respectively.

In SSLM, we set ν1 = ν2, which are selected from the set {0.001,0.01} and the range of ν is {0.1,0.2,…,0.9}.

In MMTSSVM, we set ν1 = ν2. All the ν are selected from the set {0.1,0.2,…,0.9}.

In THKSVM, the range of νk is {2i|i = − 4,− 2,…,4}. And dk is selected from the set {2i|i = 1,3,…,9}.

In our approach, we set k1 = k2 whose range is {0.5,1,1.5,2}. The range of other parameters in our approaches is the same as that in MMTSSVM.

5.2.2 Result analysis

In Tables 2 and 3, ‘Accuracy’ means the average value of testing results and plus or minus the corresponding standard deviation. ‘Time’ means the mean value of the time, including training time and testing time.

Table 2 The experimental results for binary classification case on ten benchmark datasets
Table 3 The experimental results for multi-class classification case on ten benchmark datasets

From Table 2, we can see in binary classification case, the experimental results of our approach are better than other algorithms on most datasets, i.e., Sonar, Breast cancer, Heart, Pima, Spectf heart and Abalone dataset. MMTSSVM obtains the best performance on Ionosphere, Banknote and Monks dataset. However, the accuracy of our algorithm is lower than MMTSSVM within a little range and higher than THSVM and SSLM. THSVM performs the best on Liver disorder dataset. Also, the performance of our approach is slightly worse than THSVM, which ranks the second. In binary classification case, SSLM never acquires the best performance on the ten datasets.

From Table 3, we can see in multi-class classification case, MRMMTSSVM has the best experimental results on most datasets, i.e., Soybean, Ecoli, Optidigits, Hayes-roth, Teaching, Balance and Image dataset. For Seeds dataset, THKSVM and MRMMTSSVM acquire the same accuracy, which is the highest. THKSVM obtains the best performance on Iris dataset, but MRMMTSSVM is a little worse than THKSVM, ranking the second. On Cmc dataset, OVO-THSVM gets the best experimental result. Though the accuracy produced by MRMMTSSVM is not the highest on Cmc dataset, it is better than THKSVM and OVR-MMTSSVM. In multi-class classification case, OVR-MMTSSVM never obtains the highest accuracy on the ten datasets.

In conclusion, we can see that both in binary and multi-class classification case, our algorithms perform better than other algorithms.

5.2.3 Friedman test

In order to better analyse the experimental performance of eight algorithms statically, we introduce Friedman Test. Tables 4 and 5 show the average ranking of four algorithms in binary and multi-class classification case respectively.

Table 4 Average ranks of four algorithms in binary classification case
Table 5 Average ranks of four algorithms in multi-class classification case

Under the null-hypothesis that all the algorithms are equivalent, one can compute the Friedman statistic according to:

$$\begin{array}{@{}rcl@{}} {\chi_{F}^{2}}=\frac{12N}{k(k + 1)}\left[\sum\limits_{j}{R_{j}^{2}}-\frac{k(k + 1)^{2}}{4}\right], \end{array} $$
(48)

where \(R_{j}=\frac {1}{N}{\sum }_{i}{r_{i}^{j}}\) and \({r_{i}^{j}}\) represents the j th of k algorithms on the i th of N datasets. Friedman’s \({\chi _{F}^{2}}\) is undesirably conservative and derives a better statistic

$$\begin{array}{@{}rcl@{}} F_{F}=\frac{(N-1){\chi_{F}^{2}}}{N(k-1)-{\chi_{F}^{2}}}, \end{array} $$
(49)

which is distributed according to the F-distribution with k − 1 and (k − 1)(N − 1) degrees of freedom.

According to (48) and (49), we can obtain in binary classification case, \({\chi _{F}^{2}}= 18.9300\) and FF = 15.3902, where FF is distributed according to F-distribution with (3,27) degrees of freedom. The critical value of F(3,27) is 2.96 for the level of significance α = 0.05, and similarly it is 3.65 for α = 0.025 and 4.60 for α = 0.01. We can see that the value of FF is much larger than the critical value which means our approach in binary classification case has significant difference with other algorithms. In addition, from Table 4, it is clear that the average ranking of our approach is far lower than that of the remaining algorithms, which means our approach is more valid than other three algorithms in binary classification case.

Similarly, we can obtain that in multi-class classification case, \({\chi _{F}^{2}}= 16.1700\) and FF = 10.5228. It is also clear that the value of FF is much larger than critical value. From Table 5, we can see that the average ranking of MRMMTSSVM is the lowest. They both represent that MRMMTSSVM has better experimental performance than other three algorithms.

In conclusion, our approaches are more valid than other algorithms both in binary and multi-class classification condition.

5.3 The effect of parameters on the performance

In Ramp-MMTSSVM and MRMMTSSVM, there are three parameters, i.e., k,ν, and σ. The different values of the three parameters will have a great influence on experimental accuracies. Therefore, in this part, we conduct experiments on Sonar dataset to investigate the influence of the three parameters on experimental performance. The results are shown in Figs. 78 and 9.

Fig. 7
figure 7

The effect of parameter k on the performance

Fig. 8
figure 8

The effect of parameter ν on the performance

Fig. 9
figure 9

The effect of parameter σ on the performance

The x-axis of the three figures denotes the changing values of the three parameters, and the y-axis denotes the corresponding accuracies. From Fig. 7, we can see that on Sonar dataset when k is from 0.5 to 1.5, the accuracy increases. While when k = 2.0, the accuracy decreases. Figure 8 shows that when ν is small, the accuracy is stable and great change of accuracy takes place when ν increases. We can get in Fig. 9 that the accuracy changes greatly when σ is small and as σ increases, the accuracy tends to be stable. In conclusion, the three parameters k, ν and σ all have an influence on experimental results in Ramp-MMTSSVM and MRMMTSSVM. In addition, the proper ranges of these paremeters are different for different datasets.

6 Conclusions

In this paper, we propose a Ramp loss maximum margin of twin spheres support vector machine. It reduces the sensitivity to outliers and improves generalization performance. Because it is a non-differentiable non-convex optimization problem, we adopt CCCP approach to solve it. Furthermore, we prove the properties of the parameters ν1 and ν2, and testify them by one artificial experiment. In addition, we extend Ramp-MMTSSVM to multi-class classification problems by RVO strategy. The experimental results show that our approaches both in binary and multi-class classification cases have better performance than other algorithms on twenty benchmark datasets. During the experiments, we find that the computational speed of our algorithms is a little slower than other algorithms. Therefore, how to accelerate the computational speed is our future work.