All-in-one multicategory Ramp loss maximum margin of twin spheres support vector machine

Lu, Sijie; Wang, Huiru; Zhou, Zhijian

doi:10.1007/s10489-018-1377-x

All-in-one multicategory Ramp loss maximum margin of twin spheres support vector machine

Published: 12 January 2019

Volume 49, pages 2301–2314, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

All-in-one multicategory Ramp loss maximum margin of twin spheres support vector machine

Download PDF

Sijie Lu¹,
Huiru Wang¹ &
Zhijian Zhou¹

360 Accesses
16 Citations
Explore all metrics

Abstract

Maximum margin of twin spheres support vector machine (MMTSSVM) is effective to deal with imbalanced data classification problems. However, it is sensitive to outliers because of the use of the Hinge loss function. To enhance the stability of MMTSSVM, we propose a Ramp loss maximum margin of twin spheres support vector machine (Ramp-MMTSSVM) in this paper. In terms of the Ramp loss function, the outliers can be given fixed loss values, which reduces the negative effect of outliers on constructing models. Since Ramp-MMTSSVM is a non-differentiable non-convex optimization problem, we adopt Concave-Convex Procedure (CCCP) approach to solve it. We also analyze the properties of parameters and verify them by one artificial experiment. Besides, we use Rest-vs.-One(RVO) strategy to extend Ramp-MMTSSVM to multi-class classification problems. The experimental results on twenty benchmark datasets indicate that no matter in binary or multi-class classification cases, our approaches both can obtain better experimental performance than the compared algorithms.

Ramp loss KNN-weighted multi-class twin support vector machine

Article 06 May 2022

Maximum margin of twin spheres machine with pinball loss for imbalanced data classification

Article 15 June 2017

A non-convex robust small sphere and large margin support vector machine for imbalanced data classification

Article 08 October 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

An increasing number of machine learning approaches are prevailing these days, such as artificial neural networks [1], multi-objective optimization [2] and support vector machine (SVM) [3], and they all have their pros and cons. The SVM, proposed by Vapnik, is based on statistical learning theory. Under the guidance of VC dimension and structural risk minimization principle, it can get a good generalization and promotion ability through the compromise between empirical risk and model complexity. It is very effective when solving small sample, nonlinear, high dimensional sample problems, also feasible to avoid dimensional disasters and over-fitting problems to some extent. Nowadays, SVM has been applied to many fields, containing text classification [4], disease detection [5, 6], driver fatigue detection [7] and object detection [8] etc. Nevertheless, there are still some defects in SVM, and many improvements have been put forward in recent years.

One drawback of classic SVM is the slow computational speed. In order to improve this problem, Jayadeva et al. [9] proposed a twin SVM (TSVM) for binary classification problems. TSVM generates two nonparallel hyperplanes by solving two smaller-sized quadratic programming problems (QPPs) rather than one large QPP, which makes the computational speed approximately four times faster than that of classic SVM in theory. Meanwhile, TSVM requires that the data points of one class are as close as possible to one hyperplane and as far as possible from the other. Based on TSVM, many algorithms in [10,11,12,13,14,15,16,17] have been proposed.

In many TSVM-based algorithms, they all need matrix inverse operation. However, sometimes the matrix may not be reversible. What’s more, solving a matrix inverse is computationally expensive. In order to remedy this problem, the twin hypersphere support vector machine (THSVM) [18] was raised. It finds two irrelevant hyperspheres rather than two nonparallel hyperplanes by solving a pair of smaller-sized QPPs. It requires each hypersphere to capture as many data points of one class as possible. Because it needs to find two centers and two radiuses, it is still computational cost. In addition, THSVM can not deal with imbalanced data classification problems well.

To decrease computational cost and deal with imbalanced data classification problems more efficently, Xu [19] came up with a maximum margin of twin spheres support vector machine (MMTSSVM). It finds two homocentric hyperspheres rather than two irrelevant hyperspheres by solving one QPP and one linear programming problem (LPP).

All algorithms mentioned above use the Hinge loss function, which makes the models sensitive to outliers. The outliers may normally be given the largest hinge loss values, so the decision hyperplane is drawn toward outliers incorrectly, leading to decreased generalization performance. Hence, Huang et al. [20] proposed a ramp loss support vector machine (RSVM). According to the Ramp loss function, the outliers can be given fixed loss values. Thus, RSVM decreases the sensitivity to outliers. Because the RSVM is a non-convex optimization problem, the authors adopt the effective Concave-Convex Procedure (CCCP) [21] to solve it. Based on RSVM, some algorithms have been proposed, such as RLSSVM [22] etc.

In our daily life, multi-class classification problems are more common. The researchers have proposed a great deal of multi-class classification strategies, such as One-vs.-One (OVO) [23, 24], One-vs.-Rest (OVR) [25, 26], Rest-vs.-One (RVO) [27, 28] and One-vs.-One-vs.-Rest(OVOVR) [29,30,31]. For a K-class classification problem, OVO needs to combine the K classes in pairs. We choose the i th class as the positive class, and the j th class as the negative class to generate a classifier. Thus, K(K-1)/2 classifiers will finally be obtained. In OVR method, we choose the i th class as the positive class, and the rest classes as the negative class to generate a classifier. In total, OVR will generate K classifiers. RVO is just opposite to OVR. It picks the i th class as the negative class and the rest classes as the positive class to generate a classifier. RVO will also generate K classifiers. OVOVR is a bit similar with OVO. It also combines the classes in pairs. We select the i th class as the positive class, the j th class as the negative class and the remaining classes as the rest class to generate one classifier. OVOVR will also generate K(K-1)/2 classifiers totally. All the strategies above determine the type of a new data point by ‘voting’ scheme.

In order to improve the generalization performance of MMTSSVM, we propose a Ramp loss maximum margin of twin spheres support vector machine (Ramp-MMTSSVM) in this article. We employ MMTSSVM as the core of our algorithm because it has faster computational speed and can deal with imbalanced data classification problems effectively. Meanwhile, we use the Ramp loss function to substitute the Hinge loss function to decrease the sensitivity of MMTSSVM to outliers. After introducing the Ramp loss function, the optimization problem becomes a non-differentiable non-convex problem, which can not be solved by quadratic programming directly. Therefore, we use CCCP approach to deal with it. We also discuss the properties of parameters in Ramp-MMTSSVM, which can help us determine the range of parameters. In addition, we extend Ramp-MMTSSVM to multi-class classification problems by RVO strategy, termed as multicategory Ramp loss maximum margin of twin spheres support vector machine (MRMMTSSVM). Through twenty benchmark experiments, we compare Ramp-MMTSSVM with THSVM, SSLM and MMTSSVM and compare MRMMTSSVM with OVO-THSVM, THKSVM and OVR-MMTSSVM. The experimental results imply that our algorithms have better performance than other algorithms.

The paper is organized as follows. Section 2 gives a brief introduction on MMTSSVM and RSVM. In Section 3, we introduce Ramp-MMTSSVM. Section 4 shows our algorithm, MRMMTSSVM. Several artificial experiments and twenty benchmark experiments to testify the effectiveness of our algorithms are shown in Section 5. The last section, Section 6, is our conclusion on present study.

2 Background

In this section, we review the basics of MMTSSVM and RSVM.

To make it easier to understand, we first give some definitions of notions. Given a set {(x₁,y₁),…,(x_l,y_l)}, where $\mathbf {x}_{i} \in \mathcal {R}^{m}$ and y_i ∈{+ 1,− 1}, i = 1,…,l includes l⁺ positive data points and l⁻ negative data points. In addition, I⁺ and I⁻ represent positive set and negative set, respectively. ϕ is a nonlinear mapping, which maps the input data points into the higher-dimensional feature space. Kernel function K(x_i,x_j) = (ϕ(x_i) ⋅ ϕ(x_j)) is selected in advance.

2.1 Maximum margin of twin spheres support vector machine

MMTSSVM is good at dealing with imbalanced data classification problems. It aims to find two homocentric spheres. On one hand, the small sphere needs to cover as many data points of the positive class as possible. On the other hand, the large sphere needs to push out as many data points of the negative class as possible. In addition, it follows the maximum margin principle which requires the margin between the small sphere and the large sphere is as large as possible. The optimization problems that MMTSSVM needs to solve are denoted as follows,

$$\begin{array}{@{}rcl@{}} \min \limits_{R^{2}, C, \xi_{i}} &&R^{2} - \frac{\nu}{l^{-}} \sum\limits_{j \in I^{-}} \| \phi (\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1} l^{+}} \sum\limits_{i \in I^{+}} \xi_{i}\\ s.t. &&\| \phi (\mathbf{x}_{i}) - C \|^{2} \leq R^{2} + \xi_{i},\\ && \xi_{i} \geq 0, i \in I^{+}, \end{array} $$

(1)

and

$$\begin{array}{@{}rcl@{}} \min \limits_{\rho^{2}, \eta_{j}} &&R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-}} \sum\limits_{j \in I^{-}} \eta_{j}\\ s.t. &&\|\phi (\mathbf{x}_{j}) - C \|^{2} \geq R^{2} + \rho^{2} - \eta_{j},\\ &&\eta_{j} \geq 0, j \in I^{-}, \end{array} $$

(2)

where ν,ν₁ and ν₂ are parameters chosen a priori. C and R are the center and the radius of the small sphere, respectively. $\sqrt {R^{2}+\rho ^{2}}$ is the radius of the large sphere.

The decision function of MMTSSVM is shown as follows,

$$\begin{array}{@{}rcl@{}} f (\mathbf{x}) = \left\{ \begin{array}{rcl} 1, & & {if \| \phi (\mathbf{x}) -C \| < \frac{R + \sqrt{R^{2} + \rho^{2}}}{2}}\\ -1, & & {else}. \end{array} \right. \end{array} $$

(3)

2.2 Ramp loss SVM

Classic SVM employs the Hinge loss function, which is denoted as follows, H_s(z) = max(0,s − z), where s indicates the position of the Hinge point, to penalize examples classified with an insufficient margin. In terms of the Hinge loss function, the objective function can be written as follows,

$$\begin{array}{@{}rcl@{}} \min_{\mathbf{w}, b} \quad \frac{1}{2} \| \mathbf{w} \|^{2} + c\sum\limits_{i = 1}^{l} H_{1} \left( y_{i}f(\mathbf{x}_{i})\right), \end{array} $$

(4)

where the f(x) is the decision function and the expression of f(x) is f(x) = w^Tx + b.

From Fig. 1, we can see that the outliers tend to have the largest loss values according to the Hinge loss function. Therefore, the outliers have a negative influence on constructing the hyperplane and the model is sensitive to outliers. To improve this problem, researchers proposed the Ramp loss function, i.e., the Robust Hinge loss. The expression of the Ramp loss function is shown as follows,

$$\begin{array}{@{}rcl@{}} R_{s}(z) = \left\{ \begin{array}{rcl} 0, & & {z > 1}\\ 1 - z, & & {s \leq z \leq 1}\\ 1 - s, & & {z < s}, \end{array} \right. \end{array} $$

(5)

where s < 1 is given a prior. We can see that the Ramp loss function can be expressed by a convex Hinge loss and a concave loss, i.e., R_s(z) = H₁(z) − H_s(z) in Fig. 1. After introducing the Ramp loss function, RSVM is denoted as follows,

$$\begin{array}{@{}rcl@{}} &&\min_{\mathbf{w}, b} \quad \frac{1}{2}\|\mathbf{w}\|^{2} + c\sum\limits_{i = 1}^{l} R_{s}\left( y_{i}f(\mathbf{x}_{i})\right)\\ &=&\frac{1}{2}\|\mathbf{w}\|^{2} + c\sum\limits_{i = 1}^{l} H_{1}\left( y_{i}f(\mathbf{x}_{i})\right) - c\sum\limits_{i = 1}^{l} H_{s}\left( y_{i}f(\mathbf{x}_{i})\right). \end{array} $$

(6)

This is a non-differentiable non-convex problem, which is solved by CCCP approach.

3 Ramp loss maximum margin of twin spheres support vector machine

Because MMTSSVM is sensitive to outliers, we substitute the Hinge loss function with the Ramp loss function to improve this problem.

We set u₁ = R² −∥ϕ(x_i) − C∥², u₂ = ∥ϕ(x_j) − C∥² − (R² + ρ²). Since the Hinge loss function is H_s(z) = max(0,s − z), the objective functions of problems (1) and (2) are equal to

$$\begin{array}{@{}rcl@{}} \min_{R^{2}, C} \quad R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}} H(u_{1}), \end{array} $$

(7)

and

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2}} \quad R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} } H(u_{2}). \end{array} $$

(8)

3.1 Primal Formulations

After replacing the Hinge loss function with the Ramp loss function, we get the primal formulations as follows,

$$\begin{array}{@{}rcl@{}} \min_{R^{2}, C} \quad R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}} Q(u_{1}), \end{array} $$

(9)

and

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2}} \quad R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} } Q(u_{2}), \end{array} $$

(10)

where Q(u₁) and Q(u₂) are the Ramp loss functions. Q(u₁) and Q(u₂) can be denoted as follows,

$$\begin{array}{@{}rcl@{}} Q(u_{1})=\left\{ \begin{array}{rcl} 0, & & {u_{1}\geq0}\\ -u_{1}, & & {-k_{1}R^{2}<u_{1}<0}\\ k_{1}R^{2}, & & {u_{1}\leq-k_{1}R^{2}}, \end{array} \right. \end{array} $$

(11)

$$\begin{array}{@{}rcl@{}} Q(u_{2})=\left\{ \begin{array}{rcl} 0, & & {u_{2}\geq0}\\ -u_{2}, & & {-k_{2}(R^{2}+\rho^{2})<u_{2}<0}\\ k_{2}(R^{2}+\rho^{2}), & & {u_{2}\leq-k_{2}(R^{2}+\rho^{2})}, \end{array} \right. \end{array} $$

(12)

where k₁ and k₂ are chosen a prior.

Then, the problems (9) and (10) can be rewritten as

$$\begin{array}{@{}rcl@{}} \min_{R^{2}, C} &&R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2}\\ &&+ \frac{1}{\nu_{1}l^{+}} Q\left( R^{2}-\|\phi(\mathbf{x}_{i})-C\|^{2}\right), \end{array} $$

(13)

and

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2}} \quad R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} } Q\left( \|\phi(\mathbf{x}_{j}) - C\|^{2} - (R^{2}+\rho^{2})\right). \end{array} $$

(14)

From (11) and (12), we can see that the Ramp loss function limits the maximum loss value, which reduces the influence of outliers on the optimal solution to the problem. Therefore, the model is less sensitive to outliers.

3.2 Dual problems

The Ramp loss can be decomposed into the sum of a convex Hinge loss and a concave function, i.e.,

$$\begin{array}{@{}rcl@{}} Q(u_{i})=H_{1}(u_{i})-H_{2}(u_{i}), \end{array} $$

(15)

where H₁(u_i) = max(−u_i,0)(i = 1,2), H₂(u₁) = max(−u₁ − k₁R²,0), H₂(u₂) = max (−u₂ − k₂(R² + ρ²),0).

The (13) and (14) can be rewritten as follows,

$$\begin{array}{@{}rcl@{}} \min_{R^{2}, C} &&\underbrace{R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}}H_{1}(u_{1})}_{J_{vex}}\\ &&\underbrace{-\frac{1}{\nu_{1}l^{+}}H_{2}(u_{1})}_{J_{cav}}, \end{array} $$

(16)

and

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2}} \quad \underbrace{R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} }H_{1}(u_{2})}_{J_{vex}} \underbrace{-\frac{1}{\nu_{2}l^{-} }H_{2}(u_{2})}_{J_{cav}}. \end{array} $$

(17)

We can see that QPP (16) and LPP (17) are both composed of a convex function and a concave function, which can not be solved by quadratic programming. For this reason, we take advantage of CCCP approach to solve this problem for its simple tuning and iteration manner.

We first take QPP (16) into consideration. The objective function is the sum of a convex function u(x) and a concave function v(x). CCCP solves problems by iterating a series of concave functions, i.e.,

$$\begin{array}{@{}rcl@{}} \mathbf{x}^{t + 1}=argmin_{\mathbf{x}}{u(\mathbf{x})+\mathbf{x}^{T}\nabla v(\mathbf{x}^{t})}, \end{array} $$

(18)

where t means the number of iterations. The convex part of QPP (16) is shown as follows,

$$\begin{array}{@{}rcl@{}} J_{vex}(C,R^{2}) = R^{2} - \frac{\nu}{l^{-}}\!\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}}H_{1}(u_{1}),\\ \end{array} $$

(19)

and the concave part is shown as follows,

$$\begin{array}{@{}rcl@{}} J_{cav}= -\frac{1}{\nu_{1}l^{+}}H_{2}(u_{1}). \end{array} $$

(20)

The CCCP process for this issue is as follows:

Then we rewrite the problem (??) as follows:

$$\begin{array}{@{}rcl@{}} \min_{C,R^{2},\xi_{i}} &&R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}}\sum\limits_{i\in I^{+}} \xi_{i}\\ &&+J_{cav}^{\prime}\left( R^{2} - \|\phi(\mathbf{x}_{i}) - C\|^{2}\right)\!\cdot\!\left( R^{2} - \|\phi(\mathbf{x}_{i}) - C\|^{2}\right)\\ s.t. &&\|\phi(\mathbf{x}_{i})-C\|^{2}\!\leq\! R^{2}+\xi_{i},\\ && \xi_{i}\geq 0, i\in I^{+}. \end{array} $$

(22)

To simplify the process, we introduce the following symbols,

$$\begin{array}{@{}rcl@{}} \theta_{i}&=&-\frac{1}{\nu_{1}l^{+}}\frac{\partial H_{2}(u_{1})}{\partial u_{1}}\\ &=&\left\{ \begin{array}{rcl} \frac{1}{\nu_{1}l^{+}}, & & {if \|\phi(\mathbf{x}_{i})-C\|^{2}> R^{2}+k_{1}R^{2}}\\ 0, & & {else}. \end{array}\right. \end{array} $$

(23)

Therefore, problem (22) can be rewritten as follows,

$$\begin{array}{@{}rcl@{}} \min_{C,R^{2},\xi_{i}} &&R^{2} - \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}}\sum\limits_{i\in I^{+}} \xi_{i}\\ &&+ \sum\limits_{i\in I^{+}} \theta_{i}\left( R^{2}-\|\phi(\mathbf{x}_{i})-C\|^{2}\right)\\ s.t. &&\|\phi(\mathbf{x}_{i})-C\|^{2}\leq R^{2}+\xi_{i},\\ && \xi_{i}\geq 0, i\in I^{+}. \end{array} $$

(24)

Similarly, the CCCP structure of problem (17) can also be given. The convex part of this problem is shown as follows:

$$\begin{array}{@{}rcl@{}} J_{vex}(\rho^{2})=R^{2}-\rho^{2}+\frac{1}{\nu_{2}l^{-}}H_{1}(u_{2}), \end{array} $$

(25)

and the concave part is shown as follows:

$$\begin{array}{@{}rcl@{}} J_{cav}(\rho^{2})=-\frac{1}{\nu_{2}l^{-} }H_{2}(u_{2}). \end{array} $$

(26)

The CCCP process for problem (17) is as follows:

Then we rewrite the problem (??) as follows:

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2},\eta_{j}} &&R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} }\sum\limits_{j\in I^{-}}\eta_{j}+J_{cav}^{\prime}\left( \|\phi\left( \mathbf{x}_{j}\right)-C\|^{2}\right.\\ &&-\left.\left( R^{2}+\rho^{2}\right)\right)\left( \|\phi\left( \mathbf{x}_{j}\right)-C\|^{2}-\left( R^{2}+\rho^{2}\right)\right)\\ s.t. &&\|\phi(\mathbf{x}_{j})-C\|^{2}\geq R^{2}+\rho^{2}-\eta_{j},\\ && \eta_{j}\geq 0, j\in I^{-}. \end{array} $$

(28)

To simplify the process, we introduce the following symbols,

$$\begin{array}{@{}rcl@{}} \delta_{j}&=& -\frac{1}{\nu_{2}l^{-}}\frac{\partial H_{2}(u_{2})}{\partial u_{2}}\\ &=&\left\{ \begin{array}{rcl} \frac{1}{\nu_{2}l^{-}}, & & {if \|\phi(\mathbf{x}_{j})-C\|^{2} < R^{2}+\rho^{2}}\\ \quad & & {-k_{2}\left( R^{2}+\rho^{2}\right)}\\ 0, & & {else}. \end{array}\right. \end{array} $$

(29)

Therefore, problem (28) can be rewritten as follows,

$$\begin{array}{@{}rcl@{}} \min_{\rho^{2},\eta_{j}} &&R^{2} - \rho^{2} + \frac{1}{\nu_{2}l^{-} }\sum\limits_{j\in I^{-}}\eta_{j}\\ &&+\sum\limits_{j\in I^{-}}\delta_{j}\left( \|\phi\left( \mathbf{x}_{j}\right)-C\|^{2}-\left( R^{2}+\rho^{2}\right)\right)\\ s.t. &&\|\phi(\mathbf{x}_{j})-C\|^{2}\geq R^{2}+\rho^{2}-\eta_{j},\\ && \eta_{j}\geq 0, j\in I^{-}. \end{array} $$

(30)

To solve the problem (24), we introduce the Lagrange function which is shown as follows,

$$\begin{array}{@{}rcl@{}} L_{1}&=&R^{2}- \frac{\nu}{l^{-}}\sum\limits_{j \in I^{-}} \| \phi(\mathbf{x}_{j}) - C \|^{2} + \frac{1}{\nu_{1}l^{+}}\sum\limits_{i\in I^{+}} \xi_{i}\\ &&+ \sum\limits_{i\in I^{+}} \theta_{i}\left( R^{2}-\|\phi(\mathbf{x}_{i})-C\|^{2}\right)\\ &&+\sum\limits_{i\in I^{+}}\alpha_{i}\left( \|\phi(\mathbf{x}_{i}) - C\|^{2}-R^{2} - \xi_{i}\right) - \sum\limits_{i\in I^{+}}\beta_{i}\xi_{i}. \end{array} $$

(31)

where α_i ≥ 0, β_i ≥ 0 both are Lagrange multipliers. Differentiating the Lagrangian function L₁ with respect to variables R²,C and ξ_i yields the following Karush-Kuhn-Tucker(KKT) conditions:

$$\begin{array}{@{}rcl@{}} \frac{\partial L_{1}}{\partial R^{2}}= 1+\sum\limits_{i\in I^{+}}\theta_{i}-\sum\limits_{i\in I^{+}}\alpha_{i}= 0, \end{array} $$

(32)

$$\begin{array}{@{}rcl@{}} \frac{\partial L_{1}}{\partial C}&= &\frac{2\nu}{l^{-}}\sum\limits_{j \in I^{-}}\left( \phi(\mathbf{x}_{j})-C\right)+ 2\sum\limits_{i\in I^{+}}\theta_{i}\left( \phi(\mathbf{x}_{i})\right.\\ &&-\left.C\right)-2\sum\limits_{i\in I^{+}}\alpha_{i}\left( \phi(\mathbf{x}_{i})-C\right)= 0, \end{array} $$

(33)

$$\begin{array}{@{}rcl@{}} \frac{\partial L_{1}}{\partial \xi_{i}}=\frac{1}{\nu_{1}l^{+}}-\alpha_{i}-\beta_{i}= 0. \end{array} $$

(34)

From (32), we can get

$$\begin{array}{@{}rcl@{}} \sum\limits_{i\in I^{+}}\left( \alpha_{i}-\theta_{i}\right)= 1. \end{array} $$

(35)

From (33), we can obtain the center C as follows:

$$\begin{array}{@{}rcl@{}} C=\frac{1}{1-\nu}\left( \sum\limits_{i\in I^{+}}(\alpha_{i}-\theta_{i})\phi(\mathbf{x}_{i})-\frac{\nu}{l^{-}}\sum\limits_{j\in I^{-}}\phi(\mathbf{x}_{j})\right). \end{array} $$

(36)

Then

$$\begin{array}{@{}rcl@{}} <C,C>&=& \frac{1}{(1-\nu)^{2}}\left( \sum\limits_{i\in I^{+}}\sum\limits_{j\in I^{+}}(\alpha_{i}-\theta_{i})\right.\\ &&\boldsymbol\cdot (\alpha_{j}-\theta_{j})K(\mathbf{x}_{i},\mathbf{x}_{j})\\ &&+\left( \frac{\nu}{l^{-}}\right)^{2}\sum\limits_{i\in I^{-}}\sum\limits_{j\in I^{-}}K(\mathbf{x}_{i},\mathbf{x}_{j})\\ &&-\left.\frac{2\nu}{l^{-}}\sum\limits_{i\in I^{+}}\sum\limits_{j\in I^{-}}(\alpha_{i}-\theta_{i})K(\mathbf{x}_{i},\mathbf{x}_{j})\right). \end{array} $$

(37)

Finally, we derive the dual formulation as follows:

$$\begin{array}{@{}rcl@{}} \max_{\boldsymbol{\alpha}} &&-\sum\limits_{i\in I^{+}}\sum\limits_{j\in I^{+}}(\alpha_{i}-\theta_{i})(\alpha_{j}-\theta_{j})K(\mathbf{x}_{i},\mathbf{x}_{j})\\ &&+\sum\limits_{i\in I^{+}}(\alpha_{i}-\theta_{i})\left( \frac{2\nu}{l^{-}}\sum\limits_{j\in I^{-}}K(\mathbf{x}_{j},\mathbf{x}_{i})\right.\\ &&+\left.(1-\nu)K(\mathbf{x}_{i},\mathbf{x}_{i})\vphantom{\sum\limits_{j\in I^{-}}}\right)\\ s.t. &&0\leq \alpha_{i}\leq \frac{1}{\nu_{1}l^{+}}, \sum\limits_{i\in I^{+}}\left( \alpha_{i}-\theta_{i}\right)= 1. \end{array} $$

(38)

In order to obtain the radius of the small sphere, we use positive points x_i whose Lagrangian multiplier α_i satisfy $0<\alpha _{i}<\frac {1}{\nu _{1}l^{+}}$. Supposing the number of x_i above is n_p. According to KKT condition, we can get: α_i (∥ϕ(x_i) − C∥² − R² − ξ_i) = 0. Then we can acquire the square radius of the small sphere as

$$\begin{array}{@{}rcl@{}} R^{2}\! &=&\!\frac{1}{n_{p}}\sum\limits_{i = 1}^{n_{p}}\|\phi(\mathbf{x}_{i})-C\|^{2}=\frac{1}{n_{p}}\sum\limits_{i = 1}^{n_{p}}\left( \vphantom{\sum\limits_{j\in I^{-}}}K(\mathbf{x}_{i},\mathbf{x}_{i})+C^{2}\right.\\ && - \left.\frac{2}{1 - \nu}\!\!\left( \sum\limits_{j\in I^{+}}\!(\alpha_{j} - \theta_{j}){\kern-.5pt}K\!(\mathbf{x}_{i}{\kern-.5pt},{\kern-.5pt}\mathbf{x}_{j}{\kern-.5pt}) - \frac{\nu}{l^{-}}\!\sum\limits_{j\in I^{-}}\!K\!(\mathbf{x}_{i}{\kern-.5pt},{\kern-.5pt}\mathbf{x}_{j}{\kern-.5pt})\!\right)\!\!\right)\!.\\ \end{array} $$

(39)

To solve the problem (30), we introduce the Lagrange function which is shown as follows,

$$\begin{array}{@{}rcl@{}} L_{2}&=&R^{2}-\rho^{2}+\frac{1}{\nu_{2}l^{-}}\sum\limits_{j\in I^{-}}\eta_{j}+\sum\limits_{j\in I^{-}}\delta_{j}\left( \| \phi(\mathbf{x}_{j}) - C \|^{2}\right.\\ &&-\left.(R^{2}+\rho^{2})\right)-\sum\limits_{j\in I^{-}}\gamma_{j}\left( \| \phi(\mathbf{x}_{j}) - C \|^{2} - R^{2} - \rho^{2}\right.\\ &&+\left.\eta_{j}\right)-\sum\limits_{j\in I^{-}}\lambda_{j}\eta_{j}. \end{array} $$

(40)

where γ_j ≥ 0 and λ_j ≥ 0 are Lagrange multipliers. Differentiating the Lagrange function L₂ with respect to variables ρ² and η_j yields the following KKT conditions:

$$\begin{array}{@{}rcl@{}} \frac{\partial L_{2}}{\partial \rho^{2}}=-1-\sum\limits_{j\in I^{-}}\delta_{j}+\sum\limits_{j\in I^{-}}\gamma_{j}= 0, \end{array} $$

(41)

$$\begin{array}{@{}rcl@{}} \frac{\partial L_{2}}{\partial \eta_{j}}=\frac{1}{\nu_{2}l^{-}}-\gamma_{j}-\lambda_{j}= 0. \end{array} $$

(42)

From (41), we can obtain

$$\begin{array}{@{}rcl@{}} \sum\limits_{j\in I^{-}}\left( \gamma_{j}-\delta_{j}\right)= 1. \end{array} $$

(43)

Finally, we derive the dual formulation as follows:

$$\begin{array}{@{}rcl@{}} \max_{\boldsymbol{\gamma}} &&-\sum\limits_{j\in I^{-}}\left( \gamma_{j}-\delta_{j}\right)\left( \| \phi(\mathbf{x}_{j}) - C \|^{2}\right)\\ s.t. &&\sum\limits_{j\in I^{-}}(\gamma_{j}-\delta_{j})= 1, 0\leq \gamma_{j}\leq \frac{1}{\nu_{2}l^{-}}. \end{array} $$

(44)

In order to obtain ρ², we use negative points x_j whose Lagrangian multiplier γ_j satisfy $0<\gamma _{j}<\frac {1}{\nu _{2}l^{-}}$. Supposing the number of x_j is n_n. According to KKT condition: γ_j(∥ϕ(x_j) − C∥² − R² − ρ² + η_j) = 0, we can acquire

$$\begin{array}{@{}rcl@{}} \rho^{2}=\frac{1}{n_{n}}\sum\limits_{j = 1}^{n_{n}}\left( \|\phi\left( \mathbf{x}_{j}\right)-C\|^{2}-R^{2}\right). \end{array} $$

(45)

Then, CCCP procedure based on Ramp-MMTSSVM is summarized as follows.

3.3 Property of parameters ν ₁ and ν ₂

In the Ramp-MMTSSVM, parameters ν₁ and ν₂ have their theoretical significance. In the following part, we will analyze the properties of ν₁ and ν₂.

To make it easier to understand, we first give the following two definitions.

Definition 1

The small sphere can be divided into four sets S₊ = {i|∥ϕ(x_i) − C∥² < R²}, B₊ = {i|∥ϕ(x_i) − C∥² = R²}, R₊ = {i|R² < ∥ϕ(x_i) − C∥² < R² + k₁R²} and W₊ = {i|∥ϕ(x_i) − C∥² ≥ R² + k₁R²}.

Definition 2

The large sphere can be divided into four sets S₋ = {j|∥ϕ(x_j) − C∥² ≤ R² + ρ² − k₂(R² + ρ²)}, B₋ = {j|R² + ρ² − k₂(R² + ρ²) < ∥ϕ(x_j) − C∥² < R² + ρ²}, R₋ = {j|∥ϕ(x_j) − C∥² = R² + ρ²} and W₋ = {j|∥ϕ(x_j) − C∥² > R² + ρ²}.

According to the two definitions above, we can obtain the following propositions.

Proposition 1

Let n ₁ represent the number of positive samples in B ₊ , R ₊ and W ₊ , n ₂ represent the number of positive samples in R ₊ and W ₊ , n ₃ represent the number of positive samples in W ₊ . Thus, we can obtain $\frac {n_{2}-n_{3}}{l^{+}}\leq \nu _{1} \leq \frac {n_{1}-n_{3}}{l^{+}}$ .

Proof

If x_i(i = 1,2,…,l⁺) represents a positive data point in S₊, we can get ξ_i = 0. Then, according to KKT condition α_i(∥ϕ(x_i) − C∥² − R² − ξ_i) = 0, we can get α_i = 0. From (23), we can acquire 𝜃_i = 0. If x_i represents a positive data point in B₊, we can get ξ_i = 0, then according to KKT condition β_iξ_i = 0, we can get β_i ≥ 0. Therefore, $0\leq \alpha _{i}\leq \frac {1}{\nu _{1}l^{+}}$, and 𝜃_i = 0. If x_i represents a positive data point in R₊, we can get ξ_i≠ 0, then β_i = 0, then $\alpha _{i}=\frac {1}{\nu _{1}l^{+}}$ and 𝜃_i = 0. If x_i represents a positive data point in W₊, we can get $\alpha _{i}=\frac {1}{\nu _{1}l^{+}}$ and $\theta _{i}=\frac {1}{\nu _{1}l^{+}}$. According to the conclusions above, we have $\frac {n_{2}}{\nu _{1}l^{+}}\leq {\sum }_{i\in I^{+}}\alpha _{i} \leq \frac {n_{1}}{\nu _{1}l^{+}}$ and $-{\sum }_{i\in I^{+}}\theta _{i}=-\frac {n_{3}}{\nu _{1}l^{+}}$. Therefore, $\frac {n_{2}-n_{3}}{\nu _{1}l^{+}}\leq {\sum }_{i\in I^{+}}(\alpha _{i}-\theta _{i}) \leq \frac {n_{1}-n_{3}}{\nu _{1}l^{+}}$. From (35), we can get $\frac {n_{2}-n_{3}}{\nu _{1}l^{+}}\leq 1 \leq \frac {n_{1}-n_{3}}{\nu _{1}l^{+}}$. Then, $\frac {n_{2}-n_{3}}{l^{+}}\leq \nu _{1} \leq \frac {n_{1}-n_{3}}{l^{+}}$. □

Proposition 2

Let m ₁ represent the number of negative samples in S ₋ , B ₋ and R ₋ , m ₂ represent the number of negative samples in S ₋ and B ₋ , m ₃ represent the number of negative samples in S ₋ . Thus, we can obtain $\frac {m_{2}-m_{3}}{l^{-}}\leq \nu _{2} \leq \frac {m_{1}-m_{3}}{l^{-}}$ .

Proof

If x_j(i = 1,2,…,l⁻) represents a negative data point in S₋, we can get η_j≠ 0, then according to KKT condition λ_jη_j = 0, we can get λ_j = 0. From (42), we can acquire $\gamma _{j}=\frac {1}{\nu _{2}l^{-}}$. And from (29), we can get $\delta _{j}=\frac {1}{\nu _{2}l^{-}}$. If x_j represents a negative data point in B₋, we can get $\gamma _{j}=\frac {1}{\nu _{2}l^{-}}$, and δ_j = 0. If x_j represents a negative data point in R₋, we can get η_j = 0, then λ_j ≥ 0, then $0\leq \gamma _{j}\leq \frac {1}{\nu _{2}l^{-}}$ and δ_j = 0. If x_j represents a negative data point in W₋, we can get η_j = 0, then γ_j = 0 and δ_j = 0. According to conclusions above, we have $\frac {m_{2}}{\nu _{2}l^{-}}\leq {\sum }_{j\in I^{-}}\gamma _{j} \leq \frac {m_{1}}{\nu _{2}l^{-}}$ and $-{\sum }_{j\in I^{-}}\delta _{j}=-\frac {m_{3}}{\nu _{2}l^{-}}$. Therefore, $\frac {m_{2}-m_{3}}{\nu _{2}l^{-}}\leq {\sum }_{j\in I^{-}}(\gamma _{j}-\delta _{j}) \leq \frac {m_{1}-m_{3}}{\nu _{2}l^{-}}$. From (43), we can get $\frac {m_{2}-m_{3}}{\nu _{2}l^{-}}\leq 1 \leq \frac {m_{1}-m_{3}}{\nu _{2}l^{-}}$. Then, $\frac {m_{2}-m_{3}}{l^{-}}\leq \nu _{2} \leq \frac {m_{1}-m_{3}}{l^{-}}$. □

From Propositions above, we can acquire the range of ν₁ and ν₂: 0 < ν₁,ν₂ < 1.

4 Multicategory Ramp loss maximum margin of twin spheres support vector machine

Because multi-class classification problems are more common, we extend Ramp-MMTSSVM to multi-class classification problems by RVO strategy. For a K-class classification problem, we choose the i th class as the negative class and the rest classes as the positive class to generate a Ramp-MMTSSVM. Thus, K classifiers will finally be generated. The optimization problems of MRMMTSSVM are shown as follows:

$$\begin{array}{@{}rcl@{}} \min_{C_{k},{R_{k}^{2}},\xi_{i}} &&{R_{k}^{2}} - \frac{\nu_{k}}{l_{k}}\sum\limits_{j \in A_{k}} \| \phi(\mathbf{x}_{j}) - C_{k} \|^{2} + \frac{1}{\nu_{1k}l_{K-1}}\sum\limits_{i\in B_{k}} \xi_{i}\\ &&+ \sum\limits_{i\in B_{k}} \theta_{i}\left( {R_{k}^{2}}-\|\phi(\mathbf{x}_{i})-C_{k}\|^{2}\right)\\ s.t.&&\|\phi(\mathbf{x}_{i})-C_{k}\|^{2}\leq {R_{k}^{2}}+\xi_{i},\\ && \xi_{i}\geq 0, i\in B_{k}. \end{array} $$

(46)

and

$$\begin{array}{@{}rcl@{}} \min_{{\rho_{k}^{2}},\eta_{j}} &&{R_{k}^{2}} - {\rho_{k}^{2}} + \frac{1}{\nu_{2k}l_{k} }\sum\limits_{j\in A_{k}}\eta_{j}\\ &&+\sum\limits_{j\in A_{k}}\delta_{j}\left( \|\phi\left( \mathbf{x}_{j}\right)-C_{k}\|^{2}-\left( {R_{k}^{2}}+{\rho_{k}^{2}}\right)\right)\\ s.t. &&\|\phi\left( \mathbf{x}_{j}\right)-C_{k}\|^{2}\geq {R_{k}^{2}}+{\rho_{k}^{2}}-\eta_{j},\\ && \eta_{j}\geq 0, j\in A_{k}. \end{array} $$

(47)

where l_k denotes the number of data points of k th class and l_K− 1 denotes the number of data points of the rest classes, $A_{k} \in R^{l_{k}\times n}(k = 1, \ldots , K)$, and $B_{k}=[{A_{1}^{T}}, {\ldots } A_{k-1}^{T}, A_{k + 1}^{T}, {\ldots } {A_{K}^{T}}]^{T}$.

For solving each classifier, we also use CCCP approach. The usage of CCCP approach is the same as the binary classification case. Finally, we adopt ‘voting’ scheme to obtain the class of each new data point.

5 Numerical experiments

In this section, we conduct experiments on one artificial dataset and twenty benchmark datasets to demonstrate the validity of the algorithms we proposed.

5.1 Experiments on one artificial dataset

We generate a 2-D artificial dataset, including 80 positive data points and 10 negative data points. The positive data points follow uniform distribution N(0,0,3,3) and the negative data points follow N(3,3,2,2). The illustration of linear Ramp-MMTSSVM on this dataset is shown in Fig. 2.

In this part, we compare linear Ramp-MMTSSVM and linear MMTSSVM which are shown in Figs. 2 and 3 on the artificial dataset. From the pictures, it is easy to find that the small sphere captures more positive data points in Ramp-MMTSSVM than that in MMTSSVM. This represents Ramp-MMTSSVM has better experimental performance than MMTSSVM on this artificial dataset.

To testify that the Ramp loss can reduce the effect of outliers, we add five noise points to the positive and negative class respectively. After adding noises, the performance of Ramp-MMTSSVM and MMTSSVM are shown in Figs. 4 and 5 respectively. From Fig. 5, it is obvious that the boundary of MMTSSVM expands outward. However, the boundary of Ramp-MMTSSVM in Fig. 4 has almost no change. That is to say, when handling noises and outliers, Ramp-MMTSSVM has better performance than MMTSSVM.

In addition, in order to better show the property of ν₁ which is mentioned in part 3, we depict Fig. 6, where the x − axis represents the values of parameter ν₁. In the picture, the black dotted line indicates when ν₁ changes, the changing proportion of n₁ − n₃ in positive data points. The red solid line represents the changing values of ν₁. The blue dotted line indicates when ν₁ changes, the changing proportion of n₂ − n₃ in positive data points.

From the picture, we can get that parameter ν₁ is a lower bound on the fraction of n₁ − n₃ in the positive data points and an upper bound on the fraction of n₂ − n₃ in the positive data points, which is helpful for us to select the range of parameters.

5.2 Experiments on benchmark datasets

In this part, we make experiments on twenty benchmark datasets, which are collected from UCI machine learning repository. The detailed information of twenty datasets is shown in Table 1.

Table 1 The characteristics of twenty benchmark datasets

Full size table

In the binary classification case, Ramp-MMTSSVM is compared with THSVM, SSLM and MMTSSVM. In the multi-class classification case, MRMMTSSVM is compared with OVO-THSVM, THKSVM and OVR-MMTSSVM.

We adopt 5-fold cross-validation for each experiment to make the experiments more convincing. All experiments are carried out in Matlab R2014a on Windows 7 running on a PC with system configuration Inter Core i3-4160Duo CPU(3.60GHz)with 4.00 GB of RAM.

5.2.1 Parameters selection

The approaches we mentioned above, i.e., THSVM, SSLM, MMTSSVM, THKSVM, Ramp-MMTSSVM and MRMMTSSVM all depend heavily on parameters selection. In these experiments, we choose the Gaussian kernel function k(x_i,x_j) = exp(−∥x_i −x_j∥²/σ²). We acquire optimal values of the parameters by the grid search method. All the Gaussian kernel parameter σ is selected from the set {2ⁱ|i = − 3,− 1,…,7}.

In THSVM, for reducing computational complexity, we set c₁ = c₂ and ν₁ = ν₂, which are chosen from the set {2ⁱ|i = 0,2,…,8} and {0.1,0.2,…,0.9}, respectively.

In SSLM, we set ν₁ = ν₂, which are selected from the set {0.001,0.01} and the range of ν is {0.1,0.2,…,0.9}.

In MMTSSVM, we set ν₁ = ν₂. All the ν are selected from the set {0.1,0.2,…,0.9}.

In THKSVM, the range of ν_k is {2ⁱ|i = − 4,− 2,…,4}. And d_k is selected from the set {2ⁱ|i = 1,3,…,9}.

In our approach, we set k₁ = k₂ whose range is {0.5,1,1.5,2}. The range of other parameters in our approaches is the same as that in MMTSSVM.

5.2.2 Result analysis

In Tables 2 and 3, ‘Accuracy’ means the average value of testing results and plus or minus the corresponding standard deviation. ‘Time’ means the mean value of the time, including training time and testing time.

Table 2 The experimental results for binary classification case on ten benchmark datasets

Full size table

Table 3 The experimental results for multi-class classification case on ten benchmark datasets

Full size table

From Table 2, we can see in binary classification case, the experimental results of our approach are better than other algorithms on most datasets, i.e., Sonar, Breast cancer, Heart, Pima, Spectf heart and Abalone dataset. MMTSSVM obtains the best performance on Ionosphere, Banknote and Monks dataset. However, the accuracy of our algorithm is lower than MMTSSVM within a little range and higher than THSVM and SSLM. THSVM performs the best on Liver disorder dataset. Also, the performance of our approach is slightly worse than THSVM, which ranks the second. In binary classification case, SSLM never acquires the best performance on the ten datasets.

From Table 3, we can see in multi-class classification case, MRMMTSSVM has the best experimental results on most datasets, i.e., Soybean, Ecoli, Optidigits, Hayes-roth, Teaching, Balance and Image dataset. For Seeds dataset, THKSVM and MRMMTSSVM acquire the same accuracy, which is the highest. THKSVM obtains the best performance on Iris dataset, but MRMMTSSVM is a little worse than THKSVM, ranking the second. On Cmc dataset, OVO-THSVM gets the best experimental result. Though the accuracy produced by MRMMTSSVM is not the highest on Cmc dataset, it is better than THKSVM and OVR-MMTSSVM. In multi-class classification case, OVR-MMTSSVM never obtains the highest accuracy on the ten datasets.

In conclusion, we can see that both in binary and multi-class classification case, our algorithms perform better than other algorithms.

5.2.3 Friedman test

In order to better analyse the experimental performance of eight algorithms statically, we introduce Friedman Test. Tables 4 and 5 show the average ranking of four algorithms in binary and multi-class classification case respectively.

Table 4 Average ranks of four algorithms in binary classification case

Full size table

Table 5 Average ranks of four algorithms in multi-class classification case

Full size table

Under the null-hypothesis that all the algorithms are equivalent, one can compute the Friedman statistic according to:

$$\begin{array}{@{}rcl@{}} {\chi_{F}^{2}}=\frac{12N}{k(k + 1)}\left[\sum\limits_{j}{R_{j}^{2}}-\frac{k(k + 1)^{2}}{4}\right], \end{array} $$

(48)

where $R_{j}=\frac {1}{N}{\sum }_{i}{r_{i}^{j}}$ and ${r_{i}^{j}}$ represents the j th of k algorithms on the i th of N datasets. Friedman’s ${\chi _{F}^{2}}$ is undesirably conservative and derives a better statistic

$$\begin{array}{@{}rcl@{}} F_{F}=\frac{(N-1){\chi_{F}^{2}}}{N(k-1)-{\chi_{F}^{2}}}, \end{array} $$

(49)

which is distributed according to the F-distribution with k − 1 and (k − 1)(N − 1) degrees of freedom.

According to (48) and (49), we can obtain in binary classification case, ${\chi _{F}^{2}}= 18.9300$ and F_F = 15.3902, where F_F is distributed according to F-distribution with (3,27) degrees of freedom. The critical value of F(3,27) is 2.96 for the level of significance α = 0.05, and similarly it is 3.65 for α = 0.025 and 4.60 for α = 0.01. We can see that the value of F_F is much larger than the critical value which means our approach in binary classification case has significant difference with other algorithms. In addition, from Table 4, it is clear that the average ranking of our approach is far lower than that of the remaining algorithms, which means our approach is more valid than other three algorithms in binary classification case.

Similarly, we can obtain that in multi-class classification case, ${\chi _{F}^{2}}= 16.1700$ and F_F = 10.5228. It is also clear that the value of F_F is much larger than critical value. From Table 5, we can see that the average ranking of MRMMTSSVM is the lowest. They both represent that MRMMTSSVM has better experimental performance than other three algorithms.

In conclusion, our approaches are more valid than other algorithms both in binary and multi-class classification condition.

5.3 The effect of parameters on the performance

In Ramp-MMTSSVM and MRMMTSSVM, there are three parameters, i.e., k,ν, and σ. The different values of the three parameters will have a great influence on experimental accuracies. Therefore, in this part, we conduct experiments on Sonar dataset to investigate the influence of the three parameters on experimental performance. The results are shown in Figs. 7, 8 and 9.

The x-axis of the three figures denotes the changing values of the three parameters, and the y-axis denotes the corresponding accuracies. From Fig. 7, we can see that on Sonar dataset when k is from 0.5 to 1.5, the accuracy increases. While when k = 2.0, the accuracy decreases. Figure 8 shows that when ν is small, the accuracy is stable and great change of accuracy takes place when ν increases. We can get in Fig. 9 that the accuracy changes greatly when σ is small and as σ increases, the accuracy tends to be stable. In conclusion, the three parameters k, ν and σ all have an influence on experimental results in Ramp-MMTSSVM and MRMMTSSVM. In addition, the proper ranges of these paremeters are different for different datasets.

6 Conclusions

In this paper, we propose a Ramp loss maximum margin of twin spheres support vector machine. It reduces the sensitivity to outliers and improves generalization performance. Because it is a non-differentiable non-convex optimization problem, we adopt CCCP approach to solve it. Furthermore, we prove the properties of the parameters ν₁ and ν₂, and testify them by one artificial experiment. In addition, we extend Ramp-MMTSSVM to multi-class classification problems by RVO strategy. The experimental results show that our approaches both in binary and multi-class classification cases have better performance than other algorithms on twenty benchmark datasets. During the experiments, we find that the computational speed of our algorithms is a little slower than other algorithms. Therefore, how to accelerate the computational speed is our future work.

References

Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Book MATH Google Scholar
Meza J, Espitia H, Montenegro C, Crespo RG (2016) Statistical analysis of a multi-objective optimization algorithm based on a model of particles with vorticity behavior. Soft Comput 20(9):3521–3536
Article Google Scholar
Vapnik V (1995) The nature of statistical learning theory. Springer, New York
Book MATH Google Scholar
Zhang W, Yoshida T, Tang X (2008) Text classification based on multi-word with support vector machine. Knowl-Based Syst 21(8):879–886
Article Google Scholar
Ghosh S, Mondal S, Ghosh B (2014) A comparative study of breast cancer detection based on SVM and MLP BPN classifier. In: First international conference on automation, control, energy & systems (ACES-14), pp 87–90
Gohariyan E, Esmaeilpour M, Shirmohammadi MM (2017) The combination of mammography and MRI for diagnosing breast cancer using fuzzy NN and SVM. Int J Interact Multimed Artif Intell 4(5):20–24
Google Scholar
Naz S, Ziauddin S, Shahid AR (2018) Driver fatigue detection using mean intensity, SVM, and SIFT. Int J Interact Multimed Artif Intell 5(IP):1
Google Scholar
Pang Y, Zhang K, Yuan Y, Wang K (2014) Distributed object detection with linear SVMs. IEEE T Cybern 44(11):2122–2133
Article Google Scholar
Jayadeva Khemchandani R, Chandra S (2007) Twin support vector machines for pattern classification. IEEE T Pattern Anal 29(5):905–910
Article MATH Google Scholar
Peng X (2010) A ν-twin support vector machine (ν-TSVM) classifier and its geometric algorithms. Inform Sciences 180(20):3863–3875
Article MathSciNet MATH Google Scholar
Xie X, Sun S (2014) Multi-view Laplacian twin support vector machines. Appl Intell 41(4):1059–1068
Article Google Scholar
Xu Y, Guo R (2014) An improved ν-twin support vector machine. Appl Intell 41(1):42–54
Article Google Scholar
Wang H, Zhou Z (2017) An improved rough margin-based ν-twin bounded support vector machine. Knowl-Based Syst 128:125–138
Article Google Scholar
Wang H, Zhou Z, Xu Y (2018) An improved ν-twin bounded support vector machine. Appl Intell 48(4):1041–1053
Article Google Scholar
Xu Y, Wang L, Zhong P (2012) A rough margin-based ν-twin support vector machine. Neural Comput Appl 21:1307–1317
Article Google Scholar
Xu Y, Yu J, Zhang Y (2014) KNN-Based weighted rough ν-twin support vector machine. Knowel-Based Syst 71:303–313
Article Google Scholar
Kumar MA, Gopal M (2009) Least squares twin support vector machines for pattern classification. Expert Syst Appl 36(4):7535–7543
Article Google Scholar
Peng X, Xu D (2013) A twin-hypersphere support vector machine classifier and the fast learning algorithm. Inform Sci 221:12–27
Article MathSciNet MATH Google Scholar
Xu Y (2016) Maximum margin of twin spheres support vector machine for imbalanced data classification. IEEE T Cybern 47(6):1540–1550
Article Google Scholar
Huang X, Shi L, Suykens JAK (2014) Ramp loss linear programming support vector machine. J Mach Learn Res 15:2185–2211
MathSciNet MATH Google Scholar
Yuille A, Rangarajan A (2003) The concave-convex procedure. Neural Comput 15:915–936
Article MATH Google Scholar
Liu D, Shi Y, Huang X (2016) Ramp loss least squares support vector machine. J Comput Sci-Neth 14:61–68
Article MathSciNet Google Scholar
Zhang Z, Krawczyk B, Garcìa S, Rosales- Pérez A, Herrera F (2016) Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data. Knowl-Based Syst 106:251–263
Article Google Scholar
Zhou L, Wang Q, Fujita H (2017) One versus one multi-class classification fusion using optimizing decision directed acyclic graph for predicting listing status of companies. Inform Fusion 36:80–89
Article Google Scholar
Yang X, Yu Q, Guo T (2013) The one-against-all partition based binary tree support vector machine algorithms for multi-class classification. Neurocomputing 113:1–7
Article Google Scholar
Tomar D, Agarwal S (2015) A comparison on multi-class classification methods based on least squares twin support vector machine. Knowl-Based Syst 81:131–147
Article Google Scholar
Xu Y, Guo R (2014) A twin hyper-sphere multi-class classification support vector machine. J Intell Fuzzy Syst 27(4):1783–1790
MathSciNet MATH Google Scholar
Kumar D, Thakur M (2018) All-in-one multicategory least squares nonparallel hyperplanes support vector machine. Pattern Recogn Lett 105:165–174
Article Google Scholar
Angulo C, Parra X, Català A (2003) K-SVCR. a support vector machine for multi-class classification. Neurocomputing 55(1-2):57–77
Article Google Scholar
Xu Y (2016) K-nearest neighbor-based weighted multi-class twin support vector machine. Neurocomputing 205:430–438
Article Google Scholar
Xu Y, Guo R, Wang L (2013) A twin multi-class classification support vector machine. Cogn Comput 5(4):580–588
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by National Natural Science Foundation of China (No.11671010).

Author information

Authors and Affiliations

College of Science, China Agricultural University, No.17 Qinghua East Road, 100083, Haidian, Beijing, China
Sijie Lu, Huiru Wang & Zhijian Zhou

Authors

Sijie Lu
View author publications
You can also search for this author in PubMed Google Scholar
Huiru Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhijian Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhijian Zhou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, S., Wang, H. & Zhou, Z. All-in-one multicategory Ramp loss maximum margin of twin spheres support vector machine. Appl Intell 49, 2301–2314 (2019). https://doi.org/10.1007/s10489-018-1377-x

Download citation

Published: 12 January 2019
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s10489-018-1377-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

All-in-one multicategory Ramp loss maximum margin of twin spheres support vector machine

Abstract

Similar content being viewed by others

Ramp loss KNN-weighted multi-class twin support vector machine

Maximum margin of twin spheres machine with pinball loss for imbalanced data classification

A non-convex robust small sphere and large margin support vector machine for imbalanced data classification

Explore related subjects

1 Introduction

2 Background

2.1 Maximum margin of twin spheres support vector machine

2.2 Ramp loss SVM

3 Ramp loss maximum margin of twin spheres support vector machine

3.1 Primal Formulations

3.2 Dual problems

3.3 Property of parameters ν 1 and ν 2

Definition 1

Definition 2

Proposition 1

Proof

Proposition 2

Proof

4 Multicategory Ramp loss maximum margin of twin spheres support vector machine

5 Numerical experiments

5.1 Experiments on one artificial dataset

5.2 Experiments on benchmark datasets

5.2.1 Parameters selection

5.2.2 Result analysis

5.2.3 Friedman test

5.3 The effect of parameters on the performance

6 Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

3.3 Property of parameters ν ₁ and ν ₂