1 Introduction

The support vector machine (SVM) is an excellent tool for binary data classification [14]. This learning strategy, introduced by Vapnik [3, 4], is a principled and very powerful method in machine learning algorithms. Within a few years after its introduction, the SVM has already outperformed most other systems in a wide variety of applications. These include a wide spectrum of research areas, ranging from pattern recognition [5, 6], text categorization [7], biomedicine [8], brain–computer interface [9], and financial applications [10].

The theory of SVM proposed by Vapnik et al. is based on the structural risk minimization (SRM) principle [14]. In its simplest form, the SVM for linearly separable two-class problems finds an optimal hyperplane that maximizes the margin between the two classes. The hyperplane is obtained by solving a dual quadratic programming problem (QPP). For nonlinearly separable cases, the input vectors are first mapped into a high dimensional feature space by using a nonlinear kernel function [1, 4]. A linear classifier is then implemented in that feature space to classify the data.

One of the main challenges for the classical SVM is the large computational complexity of the QPP. In addition, the performance of a trained SVM also depends on the optimal parameter set, which is usually found by cross-validation on a training set. The long training time of QPP not only causes the SVM to take a long time to train on a large database, but also prevents the SVM from locating the optimal parameter set from a very fine grid of parameters over a large span. To reduce the learning complexity of SVM, various algorithms and versions of SVM have been reported by many researchers with comparable classifications ability, including the Chunking algorithm [11], decomposition method [12], sequential minimal optimization (SMO) approach [13, 14], least squares support vector machine (LS-SVM) [15, 16], and geometric algorithms [1820].

All the above classifiers discriminate a point by determining in which half-space it lies. Mangasarian and Wild [21] proposed a nonparallel-plane classifier for binary data classification, named the generalized eigenvalue proximal support vector machine (GEPSVM). In this approach, data points of each class are proximal to one of two nonparallel planes. The nonparallel planes are eigenvectors corresponding to the smallest eigenvalues of two related generalized eigenvalue problems. The GEPSVM is very fast as it solves two generalized eigenvalue problems of the order of input space dimension. But the performance of it is only comparable with the classical SVM and in many cases, it gives low classification accuracies. Recently, Jayadeva et al. [22] proposed a twin support vector machine (TSVM) classifier for binary data classification. The TSVM aims at generating two nonparallel hyperplanes such that each one is closer to one class and is at least one far from the other class for any given binary data set. The strategy of solving a pair of smaller sized QPPs instead of a large one as in classical SVMs makes the learning speed of TSVM be approximately four times faster than classical SVM. In terms of generalization, the TSVM favorably compares with classical SVM. Some extensions to the TSVM include the least squares TSVM (LS-TSVM) [23], smooth TSVM [24], nonparallel-plane proximal classifier (NPPC) [25, 26], geometric algorithms [27], localized TSVM (LCTSVM) [28], twin support vector regression (TSVR) [29, 30], twin parametric-margin SVM (TPMSVM) [31], and twin parametric insensitive support vector regressions (TPISVR) [32].

The experimental results in [22] show that the TSVM obtains four times faster learning speed than a classical SVM. This is because the TSVM differs from the SVM in one fundamental way, in which a pair of smaller sized QPPs is solved, whereas in the classical SVM, only a single QPP is solved. In addition, the TSVM compares favorably with the SVM and GEPSVM in terms of generalization performance. However, there exists some shortcomings in the TSVM. First, for the nonlinear TSVM, the objective functions of its dual QPPs require inversion of matrix of size (l + 1) × (l + 1) twice, where l is the number of training samples, which increases the computational cost for optimizing them. On the other hand, in order to extend some efficient SVM algorithms on the TSVM, such as the SMO and decomposition algorithm, it needs to cache the inverse matrices, which leads the TSVM to be not suitable for large-scale problems. Second, the TSVM uses a pair of nonparallel hyperplanes to depict the characterizations of two classes of samples, such that each hyperplane is closest to data points of the corresponding class and is at a distance of at least 1 from the data points of the other one. It is obviously that this description is not reasonable for general cases; for instance, it is impossible to use two planes to reflect the binary classification samples coming from two different Gaussian distributions. Third, unlike the classical SVM, the TSVM only aims at minimizing the empirical risks of training samples such that in each QPP, the L 2-norm empirical risks of samples of one class and the L 1-norm empirical risks of samples of the other class are minimized simultaneously with a trade-off parameter. A direct consequence is to appear the over-fitting phenomenon on the nonparallel hyperplanes, which reduce the generalization performance.

In this paper, we propose a new hypersphere classifier, termed the twin support vector hypersphere (TSVH), for binary pattern recognition. Our TSVH is motivated by the support vector data description (SVDD) [33, 34], a classical one-class support vector machine (OCSVM) for unsupervised learning (the readers can find the introduction of OCSVM at [35] etc). Basically, the TSVH aims at generating two hyperspheres such that each hypersphere contains as many as possible samples of one class and is as far as possible from those of the other class. Similar to the TSVM, the TSVH solves two smaller sized QPPs instead of a large one as we do in the traditional SVM. There exists some merits in the TSVH compared with the TSVM. First, the formulation of TSVH is totally different from that of the TSVM. Specifically, the inversions of matrix of size (l + 1) × (l + 1) twice in the dual QPPs of TSVM are avoided, which cause the TSVH to obtain a lower learning cost and can be easily extended to large-scale problems with efficient learning algorithms. Second, the TSVH uses a pair of hyperspheres to describe the training samples, such as each hypersphere covers as many as possible samples of the one class, whereas is far from the samples of the other class. This strategy is more reasonable than the TSVM for most real cases, which means TSVH can deal with more general pattern recognition problems than TSVM. Computational comparisons with the TSVM and SVM in terms of classification accuracy and learning time are made on several artificial and UCI data sets, showing that the proposed TSVH is not only faster than the other two methods, but also obtains comparable generalization.

The remainder of this paper is organized as follows: Sect. 2 briefly reviews the SVM, SVDD, and TSVM. Section 3 introduces the novel twin support vector hypersphere. Section 4 discusses the learning algorithm for the proposed TSVH. Experiments on several artificial and benchmark data sets are discussed in Sect. 5. Finally, Sect. 6 summarizes and concludes the paper.

2 Background

In this section, we briefly introduce the classical SVM, SVDD, and TSVM. Without the loss of generalization, we consider a classification problem with the data set \({D}=\left\{(\user2{x}_1,y_1),\ldots,(\user2{x}_l,y_l)\right\}, \) where \({\user2{x}_i\in{\mathcal{X}}\subset\mathcal{R}^d}\) and \(y_i\in\{-1,1\}. \) Further, we denote by \({{\mathcal{I}}^\pm}\) the sets of indices i such that y i  =  ±1 and \({{\mathcal{I}}={\mathcal{I}}^+\bigcup {\mathcal{I}}^-, }\) and denote by D  ±  the positive and negative set, i.e., \({{D}^\pm=\{\user2{x}_i|i\in{\mathcal{I}}^\pm\}}\).

2.1 Support vector machine

Generally, the SVM finds the best (maximal margin) separating hyperplane \(H(\user2{w},b)\) between two classes of samples in the feature space \(\mathcal{H}\)

$$ H(\user2{w},b):\user2{w}^T{\varvec{\varphi}}(\user2{x})+b=0 $$
(1)

maximizing the total (interclass) margin \(2/||\user2{w}||\) and satisfying \({y_i\left(\user2{w}^T{\varvec{\varphi}}(\user2{x}_i)+b\right)-1\geq0,\,i\in{\mathcal{I}}, }\) where \({{\varvec{\varphi}}(\cdot):{\mathcal{X}}\rightarrow\mathcal{H}}\) maps \({{\mathcal{X}}}\) into \(\mathcal{H},\,\user2{w}\in \mathcal{H}, \) and \(||\cdot||\) is the L 2-norm. For the linear case, we have \({\varvec{\varphi}}({\user2{x}})={\user2{x}}\). Under the Mercer theorem [3, 4, 36], it is possible to use a kernel \(k(\user2{u},\user2{v})\) to represent the inner product in \(\mathcal{H}, \) i.e., \(k(\user2{u},\user2{v})={\varvec{\varphi}}(\user2{u})^T{\varvec{\varphi}}(\user2{v}), \) such as the Gaussian kernel \(k(\user2{u},\user2{v})=\exp\{-\gamma||\user2{u}-\user2{v}||^2\}\) with γ > 0. To fit practical applications, one way is to add a slack variable ξ i for each \(\user2{x}_i, \) which allows a controlled violation of the constraint. Therefore, SVM can be expressed as the following mathematical model:

$$ \begin{aligned} \hbox{min}&\, \frac{1}{2}\user2{w}^T\user2{w}+C\sum\limits_{i\in{{\mathcal{I}}}}\xi_i\\ \hbox{s.t.}&\,y_i\left(\user2{w}^T{\varvec{\varphi}}(\user2{x}_i)+b\right)\geq1-\xi_i,\\ &\,\xi_i\geq0,\; i\in{{\mathcal{I}}}, \end{aligned} $$
(2)

which is a QPP with linear inequalities in \(\mathcal{H}\), where C > 0 is the pre-specified regularization factor given by users. By introducing the Lagrangian coefficients α i ’s, we derive its dual QPP as following:

$$ \begin{aligned} &\hbox{max}\quad \sum\limits_{i\in {{\mathcal{I}}}}\alpha_i-\frac{1}{2}\sum\limits_{i,j\in I}\alpha_i\alpha_jy_iy_jk(\user2{x}_i,\user2{x}_j)\\ &\hbox{s.t.}\quad0\leq\alpha_i\leq C,\; i\in{{\mathcal{I}}},\\ &\,\sum\limits_{i\in{{\mathcal{I}}}}\alpha_iy_i=0.\\ \end{aligned} $$
(3)

After optimizing this dual QPP, we obtain the following decision function:

$$ f(\user2{x})=\hbox{sgn}\left(\sum\limits_{i\in{{\mathcal{I}}}} \alpha_iy_ik(\user2{x}_i,\user2{x})+b\right), $$
(4)

where \(\user2{w}\) in \(H(\user2{w},b)\) is derived by the Karush–Kuhn–Tucker (KKT) optimality conditions, which is:

$$ \user2{w}=\sum\limits_{i\in {{\mathcal{I}}}}y_i\alpha_i{\varvec{\varphi}} (\user2{x}_i)=\sum\limits_{i\in {{\mathcal{I}}}^+}\alpha_i{\varvec{\varphi}} (\user2{x}_i)-\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_j{\varvec{\varphi}}(\user2{x}_j). $$
(5)

Note that optimizing the dual QPP (3) is time-consuming for large-scale problems because of enormous matrix storages and intensive matrix operations. Thus, a major research topic related to SVM is to focus on the fast learning aspect, e.g., [11, 13, 14, 1720].

2.2 Twin support vector machine

The TSVM is a binary classifier that does classification using two nonparallel hyperplanes instead of a single hyperplane in the classical SVM [22], which are obtained by solving two smaller sized QPPs.

For the linear case, the TSVM finds the following pair of nonparallel positive and negative hyperplanes in \(\mathcal{R}^d: \)

$$ \user2{w}_+^T\user2{x}+b_+=0,\quad\user2{w}_-^T\user2{x}+b_-=0, $$
(6)

such that each hyperplane is closer to the samples of one class and is as far as possible from those of the other class. A new point is assigned to class +1 (positive) or −1 (negative) depending upon its proximity to the above two nonparallel hyperplanes. Formally, the linear TSVM solves the following two QPPs for finding the positive and negative hyperplanes, respectively:

$$ \begin{aligned} \hbox{min}&\, \frac{1}{2}\sum\limits_{i\in {{\mathcal{I}}}^+} \left(\user2{w}_+^T\user2{x}_i+b_+\right)^2\,+\,C_1\sum\limits_{j\in{{\mathcal{I}}}^-}\xi_j\\ \hbox{s.t.}&\,-\user2{w}_+^T\user2{x}_j-b_+\geq1-\xi_j,\\ &\,\xi_j\geq0,\; j\in {{\mathcal{I}}}^-,\\ \end{aligned} $$
(7)
$$ \begin{aligned} \hbox{min}&\, \frac{1}{2}\sum\limits_{j\in {{\mathcal{I}}}^-}\left(\user2{w}_-^T\user2{x}_j+b_-\right)^2\,+\,C_2\sum\limits_{i\in{{\mathcal{I}}}^+}\xi_i\\ \hbox{s.t.}&\,\user2{w}_-^T\user2{x}_i+b_-\geq1-\xi_i,\\ &\,\xi_i\geq0,\; i\in{{\mathcal{I}}}^+, \\ \end{aligned} $$
(8)

where C 1, C 2 > 0 are the pre-specified trade-off factors, and \({\xi_j,\,j\in {\mathcal{I}}^-,\,\xi_i,i\in {\mathcal{I}}^+}\) are the Slack variables. By introducing the Lagrangian multipliers, we obtain the following dual QPPsFootnote 1:

$$ \begin{aligned} &\hbox{max}\quad\sum\limits_{j\in{{\mathcal{I}}}^-}\alpha_j-\frac{1}{2} \sum\limits_{j_1,j_2\in {{\mathcal{I}}}^-}\alpha_{j_1}\alpha_{j_2}\user2{z}^T_{j_1} \left(\sum\limits_{i\in{{\mathcal{I}}}^+}\user2{z}_{i}\user2{z}^T_{i}\right)^{-1} \user2{z}_{j_2}\\ &\hbox{s.t.}\quad 0\leq\alpha_j\leq C_1,\; j\in{{\mathcal{I}}}^-, \\ \end{aligned} $$
(9)

are the centers and

$$ \begin{aligned} \hbox{max}&\,\sum\limits_{i\in{{\mathcal{I}}}^+}\alpha_i-\frac{1}{2} \sum\limits_{i_1,i_2\in {{\mathcal{I}}}^+}\alpha_{i_1}\alpha_{i_2}\user2{z}^T_{i_1} \left(\sum\limits_{j\in{{\mathcal{I}}}^-}\user2{z}_{j} \user2{z}^T_{j}\right)^{-1}\user2{z}_{i_2}\\ \hbox{s.t.}&\,\,0\leq \alpha_i\leq C_2,\; i\in{{\mathcal{I}}}^+, \\ \end{aligned} $$
(10)

where \({\user2{z}_k=(\user2{x}_k;1),\,k\in{\mathcal{I}}}\). After optimizing (9) and (10), we obtain the augmented vectors of the two non-parallel hyperplanes, which are:

$$ \user2{v}^T_+=(\user2{w}_+^T,\,b_+)= -\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_j\user2{z}^T_{j}\left(\sum\limits_{i\in {{\mathcal{I}}}^+}\user2{z}_{i}\user2{z}^T_{i}\right)^{-1}, $$
(11)
$$ {\user2{v}} ^T_-=(\user2{w}_-^T,\,b_-)= \sum\limits_{i\in {{\mathcal{I}}}^+}\alpha_i\user2{z}^T_{i}\left(\sum\limits_{j\in {{\mathcal{I}}}^-}\user2{z}_{j}\user2{z}^T_{j}\right)^{-1}. $$
(12)

A new data point \(\user2{x}\) is then assigned to the class + or −, depending on which of the two hyperplanes given by (6) it lies closest to. Thus,

$$ f(\user2{x})=\arg\min_{+,-}\left\{d_\pm(\user2{x})\right\}, $$

where

$$ d_\pm(\user2{x})=\frac{\left|\user2{w}^T_\pm \user2{x}+b_\pm\right|}{||\user2{w}_\pm||}. $$

For the nonlinear case, the nonparallel positive and negative hyperplanes are expressed as following:

$$ \sum\limits_{k\in {{\mathcal{I}}}}w_{k,+}k(\user2{x}_k,\user2{x})+b_+=0\quad \hbox{and}\,\,\sum\limits_{k\in {{\mathcal{I}}}}w_{k,-}k(\user2{x}_k,\user2{x})+b_-=0, $$
(13)

with \(\user2{w}_\pm=({w}_{1,\pm};{w}_{2,\pm};\ldots;{w}_{l,\pm})\) and \(k(\user2{u},\user2{v})\) is some kernel. It is easy to obtain the primal QPPs for this case, which are similar to (11) and (12). If we define \(\user2{z}_k=\left(k(\user2{x}_1,\user2{x}_k);\ldots;k(\user2{x}_l,\user2{x}_k);1\right)\) for each \({k \in {\mathcal{I}}, }\) then we have the same dual QPPs as (9) and (10) and the same augmented vectors as (11) and (12) for (13).

The QPPs (9) and (10) have lower complexity than (3) since they have only \({l^\mp=|{\mathcal{I}}^\mp|}\) parameters, while (3) has l = l + + l parameters, which causes the TSVM to be approximately four times faster than the classical SVM. This is because the complexity of SVM is no more than \(\mathcal{O}(l^3), \) while the TSVM solves two QPPs, each of which is roughly of size (l/2). Thus, the ratio of runtime is approximately 4. It is necessary to point out that, in order to optimize (9) and (10), the linear TSVM requires inversion of matrix of size (d + 1) × (d + 1) twice, whereas the nonlinear TSVM requires inversion of matrix of size (l + 1) × (l + 1) twice. Further, the samples with \({0<\alpha_j\leq C_1,\,j\in {\mathcal{I}}^-}\) or \({0<\alpha_i\leq C_2,\,i\in {\mathcal{I}}^+}\) are defined as SVs, since they are significant in determining the two hyperplanes (6). The experiments show that the TSVM obtains good generalization performance and faster learning speed than the classical SVM. However, as the above discussion, there still exists some shortcomings in the TSVM, which reduce the application ranges and generalization of TSVM.

2.3 Support vector data description

An one-class problem is usually understood as computing a binary function that is supposed to capture regions in input space where the probability density lies, that is, a function such that most of the samples will lie in the region where the function is nonzero (see Refs. [3335]). In the viewpoint of learning, an one-class problem is an unsupervised one, where the samples are unlabeled or the label of each sample is thought to be itself. Over the last decade, the SVM has already been generalized to solve one-class problems. One of the popular OCSVM is the SVDD proposed by Tax and Duin [33, 34].

The SVDD identifies a sphere with minimum volume that captures the given normal dataset. The sphere volume is characterized with its center \(\user2{c}\) and radius R in some feature space. Minimization of the volume is achieved by minimizing R 2, which represents the structural error:

$$ \begin{aligned} &\min\quad R^2\\ &\hbox{s.t.}\quad ||{\varvec{\varphi}}(\user2{x}_i)-\user2{c}||^2\leq R^2,\; \forall i.\\ \end{aligned} $$
(14)

The above constraints do not allow any point to fall outside of the hypersphere. In order to make provision within the model for potential outliers within the training set, a penalty cost function is introduced as follows (for samples that lie outside of the hypersphere):

$$ \begin{aligned} &\min\quad R^2+C\sum\limits_i\xi_i\\ &\hbox{s.t.}\quad ||{\varvec{\varphi}}(\user2{x}_i)- \user2{c}||^2\leq R^2+\xi_i,\; \forall i.\\ \end{aligned} $$
(15)

where C is the coefficient of penalty for each outlier and ξ i is the distance between the ith sample and the hypersphere. This is a quadratic optimization problem and can be solved efficiently by introducing Lagrange multipliers for constraints. Formally, it can be solved by optimizing the following dual problem:

$$ \begin{aligned} &\max\quad \sum\limits_i\alpha_ik(\user2{x}_i,\user2{x}_i) -\sum\limits_{i,j}\alpha_i\alpha_jk(\user2{x}_i,\user2{x}_j)\\ &\hbox{s.t.}\quad 0\leq\alpha_i\leq C,\; \forall i,\\ \,&\,\sum\limits_i\alpha_i=1. \\ \end{aligned} $$
(16)

This problem can be solved rather easily using well-established quadratic programming algorithms and will lead to the representation of "normal" data. The assessment of whether a data point is inside or outside the SVDD hypersphere is based on the sign of the following function:

$$ \begin{aligned} f(\user2{x})&=\hbox{sgn}\left(R^2-\left|\left|{\varvec{\varphi}}(\user2{x}) -\user2{c}\right|\right|^2\right)\\ &=\hbox{sgn}\left(R^2-\sum\limits_{i,j}\alpha_i\alpha_jk(\user2{x}_i, \user2{x}_j)+2\sum\limits_i\alpha_ik(\user2{x}_i,\user2{x}) -k(\user2{x},\user2{x})\right),\\ \end{aligned} $$
(17)

where R 2 is computed according to the KKT conditions, and the center \({\user2{c}}\) is

$$ {\user2{c}}=\sum\limits_i\alpha_i{\varvec{\varphi}}(\user2{x}_i). $$
(18)

Positive (negative) sign implies that the distance of the data point to the center of the hyper-sphere is less (greater) than the radius of the hypersphere.

3 Twin support vector hypersphere

In this section, we introduce a novel approach to SVM classification which we have termed the twin support vector hypersphere classifier.

As mentioned earlier, the TSVH is similar to the TSVM in spirit, as it also obtains two QPPs for binary pattern recognition, and each QPP has the SVM-type formulation, except that not all patterns appear in the constraints of either problem at the same time. However, it is based on an entirely different fact that the samples of different classes get clustered, which can be covered by two hyperspheres, respectively, that is, the samples of one class are covered by one hypersphere, whereas the samples of the other class are as possible as far from this hypersphere. In short, it is the combination of the idea of TSVM and SVDD.

The TSVH classifier is obtained by solving the following pair of QPPs:

$$ \begin{aligned} &\min\quad \frac{1}{2}\sum\limits_{i\in {{\mathcal{I}}}^+} \left|\left|{\varvec{\varphi}}(\user2{x}_i)-\user2{c}_+\right|\right|^2 -\nu_1R_+^2 +C_1\sum\limits_{j\in {{\mathcal{I}}}^-}\xi_j\\ &\hbox{s.t.}\quad\left|\left|{\varvec{\varphi}}(\user2{x}_j) -\user2{c}_+\right|\right|^2\geq R_+^2-\xi_j,\\ &\,R_+^2\geq0,\;\xi_j\geq0,\;j\in {{\mathcal{I}}}^-, \\ \end{aligned} $$
(19)
$$ \begin{aligned} &\min\quad \frac{1}{2}\sum\limits_{j\in {{\mathcal{I}}}^-} \left|\left|{\varvec{\varphi}}(\user2{x}_j)-\user2{c}_-\right|\right|^2 -\nu_2R_-^2+C_2\sum\limits_{i\in {{\mathcal{I}}}^-}\xi_i\\ &\hbox{s.t.}\quad\left|\left|{\varvec{\varphi}}(\user2{x}_i) -\user2{c}_-\right|\right|^2\geq R_-^2-\xi_i,\\ &\,R_-^2\geq0,\;\xi_i\geq0,\;i\in {{\mathcal{I}}}^+, \\ \end{aligned} $$
(20)

where C 1C 2 > 0 and ν1, ν2 > 0 are the pre-specified penalty factors, and \(\user2{c}_\pm\in\mathcal{H}\) and R ±  are the centers and radiuses of corresponding hyperspheres, respectively.

This model finds two hyperspheres, one for each class, and classifies points according to which hypersphere a given point is closest to. The first term in the objective function of (19) or (20) is the sum of squared distances from the center of hypersphere to points of one class. Therefore, minimizing it tends to keep the center of hypersphere close to the points of one class. The second term maximizes the squared radius of this hypersphere, which makes this hypersphere be as large as possible. The first constraints require the center of hypersphere to be at a distance of at least its radius from points of the opposite class; a set of error variables is used to measure the errors wherever the points of opposite class are covered by this hypersphere, that is, the distances from the points to the center are less than the radius. The last term of the objective function of (19) or (20) minimizes the sum of error variables, thus attempting to minimize misclassification due to points belonging to the opposite class.

In short, similar to the TSVM, the TSVH is comprised of a pair of QPPs such that, in each QPP, the objective function corresponds to a particular class and the constraints are determined by the samples of the other class. Thus, the TSVH gives rise to two smaller sized QPPs compared with a single large QPP in the classical SVM, leading the TSVH to be approximately four times faster than the classical SVM for learning its hyperspheres. However, unlike the formulation of TSVM that around each hyperplane, the data points of the corresponding class get clustered, Our TSVH, in the spirit of the SVDD, uses a pair of hyperspheres to depict the two classes. In the problem (19), the positive samples are covered by a hypersphere and clustered around the center \(\user2{c}_+\), and in the problem (20), the negative samples are covered by a hypersphere and clustered around the center \(\user2{c}_-\). For most cases, this strategy is more reasonable than that of the TSVM.

The corresponding Lagrangian function of (19) is

$$ \begin{aligned} L(\user2{c}_+,R^2_+,\user2{\xi},\user2{\alpha},\user2{r},s)&=\frac{1}{2} \sum\limits_{i\in{{\mathcal{I}}}^+}\left|\left|{\varvec{\varphi}}(\user2{x}_i) -\user2{c}_+\right|\right|^2-\nu_1R_+^2+C_1\sum\limits_{j\in {{\mathcal{I}}}^-}\xi_j\\ &\quad-\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_j \left(\left|\left|{\varvec{\varphi}}(\user2{x}_j)-\user2{c}_+\right|\right|^2 - R_+^2+\xi_j\right)\\ &\quad-\sum\limits_{j\in {{\mathcal{I}}}^-}r_j\xi_j-sR_+^2,\\ \end{aligned} $$
(21)

where \({\alpha_j,\,r_j,\,j\in {\mathcal{I}}^-}\) and s are the Lagrangian multipliers, \({\varvec{\alpha}}=(\alpha_{j_1};\alpha_{j_2};\ldots;\alpha_{j_{l^-}}), \) \(\user2{r}=(r_{j_1};r_{j_2};\ldots;r_{j_{l^-}}), \) and \(\user2{\xi}=(\xi_{j_1};\xi_{j_1};\ldots;\xi_{j_{l^-}}), \,j_k\in {\mathcal{I}}^-,\, k=1,\ldots,l^-, \) respectively. Differentiating the Lagrangian function (21) with respect to \(\user2{c}_+,\,R^2_+, \) and \(\xi_j,\,j\in{\mathcal{I}}^-\) yields the following KKT necessary and sufficient optimality conditions:

$$ \begin{aligned} \frac{\partial L}{\partial\user2{c}_+}&=-\sum\limits_{i\in {{\mathcal{I}}}^+} \left({\varvec{\varphi}}(\user2{x}_i)-\user2{c}_+\right) +2\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_j\left({\varvec{\varphi}}(\user2{x}_j) -\user2{c}_+\right)=0\\ &\quad\rightarrow\user2{c}_+=\frac{1}{l^+-2\sum\limits_{j\in {{\mathcal{I}}}^-} \alpha_j}\left(\sum\limits_{i\in {{\mathcal{I}}}^+}{\varvec{\varphi}}(\user2{x}_i) -2\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_j{\varvec{\varphi}}(\user2{x}_j)\right),\\ \end{aligned} $$
(22)
$$ \frac{\partial L}{\partial R^2_+}=-\nu_1+\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_j-s=0\rightarrow\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_j\geq\nu_1, $$
(23)
$$ \frac{\partial L}{\partial \xi_j}=C_1-\alpha_j-r_j=0\rightarrow0\leq\alpha_j\leq C_1,\quad j\in {{\mathcal{I}}}^-, $$
(24)
$$ \left|\left|{\varvec{\varphi}}(\user2{x}_j)-\user2{c}_+\right|\right|^2\geq R_+^2-\xi_j,\quad j\in {{\mathcal{I}}}^-, $$
(25)
$$ \alpha_j\left(\left|\left|{\varvec{\varphi}}(\user2{x}_j)-\user2{c}_+\right|\right|^2- R_+^2+\xi_j\right)=0,\quad \alpha_j\geq0,\quad j\in {{\mathcal{I}}}^-, $$
(26)
$$ r_j\xi_j=0,\quad \xi_j\geq0,\quad r_j\geq0,\quad j\in {{\mathcal{I}}}^-, $$
(27)
$$ sR_+^2=0,\quad R_+^2\geq0,\quad s\geq0. $$
(28)

Since R 2+  > 0 holds in the optimality result of problem (19) if given a suitable parameter ν1, then from (23) and (28), we have

$$ \sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_j=\nu_1. $$
(29)

Substituting (29) into (22) leads to

$$ \user2{c}_+=\frac{1}{l^+-2\nu_1}\left(\sum\limits_{i\in {{\mathcal{I}}}^+}{\varvec{\varphi}}(\user2{x}_i)-2\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_j{\varvec{\varphi}}(\user2{x}_j)\right), $$
(30)

which obtains a by-product for determining ν1 that 2ν1 < l +. Substituting (23), (24), and (30) into (21), we obtain the dual QPP of (19), which is:

$$ \begin{aligned} &\max\quad\frac{1}{2}\sum\limits_{i\in {{\mathcal{I}}}^+}k(\user2{x}_i,\user2{x}_i)-\frac{1}{2(l^+-2\nu_1)} \sum\limits_{i_1,i_2\in {{\mathcal{I}}}^+}k(\user2{x}_{i_1},\user2{x}_{i_2})-\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_{j}k(\user2{x}_{j},\user2{x}_{j})\\ &\quad -\frac{2}{l^+-2\nu_1}\sum\limits_{j_1,j_2\in {{\mathcal{I}}}^-}\alpha_{j_1}\alpha_{j_2}k(\user2{x}_{j_1},\user2{x}_{j_2})+\frac{2}{l^+-2\nu_1}\sum\limits_{i\in {{\mathcal{I}}}^+,j\in {{\mathcal{I}}}^-}\alpha_{j}k(\user2{x}_{j},\user2{x}_{i})\\ &\hbox{s.t.}\quad\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_{j}=\nu_1,\,0\leq\alpha_j\leq C_1,\,j\in {{\mathcal{I}}}^-. \\ \end{aligned} $$
(31)

Defining \(t_1=\frac{2}{l^+-2\nu_1}\) and discarding the constant items in the objective function of (31), the dual QPP (31) can be simplified as following:

$$ \begin{aligned} &\max\quad-t_1\sum\limits_{j_1,j_2\in {{\mathcal{I}}}^-}\alpha_{j_1}\alpha_{j_2}k (\user2{x}_{j_1},\user2{x}_{j_2})+\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_{j} \left(t_1\sum\limits_{i\in {{\mathcal{I}}}^+}k(\user2{x}_{j},\user2{x}_{i}) -k(\user2{x}_{j},\user2{x}_{j})\right)\\ &\hbox{s.t.}\quad\sum\limits_{j\in {{\mathcal{I}}}^-}\alpha_{j}=\nu_1,\,0\leq\alpha_j \leq C_1,\,j\in {{\mathcal{I}}}^-. \end{aligned} $$
(32)

Similarly, we consider (20) and obtain its dual QPP as

$$ \begin{aligned} &\max\quad -t_2\sum\limits_{i_1,i_2\in {{\mathcal{I}}}^+}\beta_{i_1}\beta_{i_2}k (\user2{x}_{i_1},\user2{x}_{i_2})+\sum\limits_{i\in {{\mathcal{I}}}^+}\beta_{i} \left(t_2\sum\limits_{j\in {{\mathcal{I}}}^-}k(\user2{x}_{i},\user2{x}_{j}) -k(\user2{x}_{i},\user2{x}_{i})\right)\\ &\hbox{s.t.}\quad \sum\limits_{i\in {{\mathcal{I}}}^+}\beta_{i}=\nu_2,\,0\leq\beta_i \leq C_2,\,i\in {{\mathcal{I}}}^+. \\ \end{aligned} $$
(33)

Here, \({t_2=\frac{2}{l^--2\nu_2},\,\beta_i,\,i\in {\mathcal{I}}^+}\) are the Lagrangian multipliers, and the center \(\user2{c}_-\) is

$$ \user2{c}_-=\frac{1}{l^--2\nu_2}\left(\sum\limits_{j\in {{\mathcal{I}}}^-}{\varvec{\varphi}}(\user2{x}_j)-2\sum\limits_{i\in {{\mathcal{I}}}^+}\beta_i{\varvec{\varphi}}(\user2{x}_i)\right). $$
(34)

Next, define the index sets

$$ I^+_R=\left\{i|0<\beta_i<C_2,\,i\in {{\mathcal{I}}}^+\right\}\quad \hbox{and}\,\,\; I^-_R=\left\{j|0<\alpha_j<C_1,\,j\in {{\mathcal{I}}}^-\right\}, $$
(35)

according to the KKT conditions, we derive the square radiuses R 2 ±  as

$$ R^2_+=\frac{1}{|I^-_R|}\sum\limits_{j\in I^-_R}\left|\left|{\varvec{\varphi}}(\user2{x}_j)-\user2{c}_+\right|\right|^2, $$
(36)
$$ R^2_-=\frac{1}{|I^+_R|}\sum\limits_{i\in I^+_R}\left|\left|{\varvec{\varphi}}(\user2{x}_i)-\user2{c}_-\right|\right|^2. $$
(37)

Once the centers \(\user2{c}_\pm\) and square radiuses R 2 ±  are known from (30), (34), (36), and (37), two hyperspheres

$$ \left|\left|{\varvec{\varphi}}(\user2{x})-\user2{c}_+\right|\right|^2\leq R^2_+\quad \hbox{and}\,\,\;\left|\left|{\varvec{\varphi}}(\user2{x}) -\user2{c}_-\right|\right|^2\leq R^2_- $$
(38)

are obtained. A new test sample \(\user2{x}\) is assigned to class + or −, depending on which of the two hyperspheres given by (38) it lies closest to, i.e.,

$$ f(\user2{x})=\arg\min_{+,-} \left\{\frac{\left|\left|{\varvec{\varphi}}(\user2{x})-\user2{c}_+\right|\right|^2} {R^2_+},\frac{\left|\left|{\varvec{\varphi}}(\user2{x})-\user2{c}_-\right|\right|^2} {R^2_-}\right\}. $$
(39)

Compared the TSVM with TSVH for the nonlinear case, the former requires inversion of two matrices of size (l + 1) × (l + 1), respectively, along with two dual QPPs to be solved, whereas the latter only needs to solve two SVM-type dual QPPs without any matrix inversion in the objective functions of its dual QPPs, making the TSVH surpass the TSVM in learning speed for large-scale problems. Further, this difference leads the learning algorithms of SVM can be easily extended to the TSVH, such as the SMO [13, 14] and geometric algorithms [18, 19].

In the next Section, we describe a learning algorithm for the TSVH based on the Gilbert’s algorithm.

4 Learning algorithm for TSVH

In this Section, we describe a learning algorithm for the TSVH. Here, only (32) is considered because (33) is similar to (32).

Without the loss of generality, let us rewrite (32) with the matrix expression as follows:

$$ \begin{aligned} &\max\quad-t_1\user2{\alpha}^T{\bf K}_-\user2{\alpha}+t_1\user2{\alpha}^T\user2{u}\\ &\hbox{s.t.}\quad\user2{\alpha}^T\user2{e}_2=\nu_1,\,0\leq\user2{\alpha} \leq C_1\user2{e}_2,\\ \end{aligned} $$
(40)

where \(\user2{e}_2\) is a vector of ones of l dimension, the kernel matrix \({\bf K}_-=\left(k_{j_1j_2}\right)=\left(k(\user2{x}_{j_1},\user2{x}_{j_2})\right)\in \mathcal{R}^{l^-\times l^-}, \) and \(\user2{u}=(u_1;u_2;\ldots;u_{l^-})\) with \({u_j=\sum\nolimits_{i\in {\mathcal{I}}^+}k(\user2{x}_i,\user2{x}_j)-k(\user2{x}_j,\user2{x}_j),\,j=1,\ldots,l^-. }\) Note that t 1 is a constant (in general t 1 > 0 holds); therefore, the problem (40) can be rewritten as

$$ \begin{aligned} &\min\quad\user2{\alpha}^T{\bf K}_-\user2{\alpha}-\user2{\alpha}^T\user2{u}\\ &\hbox{s.t.}\quad\user2{\alpha}^T\user2{e}_2=\nu_1,\,0\leq\user2{\alpha}\leq C_1\user2{e}_2.\\ \end{aligned} $$
(41)

According to the constraint \(\user2{\alpha}^T\user2{e}_2=\nu_1, \) the linear term in the objective function of (41) can be expressed as

$$ \begin{aligned} \user2{\alpha}^T\user2{u}&=\left(\frac{1}{\nu_1}\nu_1\right)\user2{\alpha}^T\user2{u} =\frac{1}{\nu_1}\left(\user2{\alpha}^T\user2{u}\nu_1\right) =\frac{1}{\nu_1}\left(\user2{\alpha}^T\user2{u}\user2{e}_2^T\user2{\alpha}\right)\\ &=\frac{1}{2\nu_1}\left(\left(\user2{\alpha}^T\user2{u} \user2{e}_2^T\user2{\alpha}\right)+ \left(\user2{\alpha}^T\user2{e}_2\user2{u}^T \user2{\alpha}\right)\right)\\ &=\frac{1}{2\nu_1}\user2{\alpha}^T\left(\user2{u}\user2{e}_2^T +\user2{e}_2\user2{u}^T\right)\user2{\alpha}. \\ \end{aligned} $$
(42)

Substituting (42) into the objective function of (41) leads to

$$ \begin{aligned} \user2{\alpha}^T{\bf K}_-\user2{\alpha}- \user2{\alpha}^T\user2{u}&=\user2{\alpha}^T{\bf K}_-\user2{\alpha} - \frac{1}{2\nu_1}\user2{\alpha}^T\left(\user2{u}\user2{e}_2^T +\user2{e}_2\user2{u}^T\right)\user2{\alpha} \\ &=\user2{\alpha}^T\left({\bf K}_{-}-\frac{1}{2\nu_1}\left(\user2{u} \user2{e}_2^T+\user2{e}_2\user2{u}^T\right)\right)\user2{\alpha}. \\ \end{aligned} $$

Denote \({\bf G}={\bf K}_--\frac{1}{2\nu_1}\left(\user2{u}\user2{e}_2^T+\user2{e}_2\user2{u}^T\right)=(g_{j_1j_2}), \) where

$$ \begin{aligned} g_{j_1j_2}&=k_{j_1j_2}-\frac{1}{2\nu_1}(u_{j_1}+u_{j_2}) \\ &=k(\user2{x}_{j_1},\user2{x}_{j_2})+\frac{1}{2\nu_1t_1} \left(k(\user2{x}_{j_1},\user2{x}_{j_1})+k(\user2{x}_{j_2},\user2{x}_{j_2})\right) \\ &\quad-\frac{1}{2\nu_1}\sum\limits_{i\in {{\mathcal{I}}}^+} \left(k(\user2{x}_{i},\user2{x}_{j_1})+k(\user2{x}_{i},\user2{x}_{j_2})\right), \\ \end{aligned} $$

and set \({\alpha_j:=\alpha_j/\nu_1,\,j\in {\mathcal{I}}^-, }\) then the problem (41) is equivalent to the following QPP:

$$ \begin{aligned} &\min\quad\user2{\alpha}^T{\bf G}\user2{\alpha}\\ &\hbox{s.t.}\quad\user2{\alpha}^T\user2{e}_2=1,\,0\leq\user2{\alpha}\leq \mu_1\user2{e}_2,\\ \end{aligned} $$
(43)

where μ1 = C 11.

To optimize (43), let us first consider the problem of finding the minimum norm problem on the data set \({X}=\{\user2{x}_1,{\user2{x}}_2,\ldots,{\user2{x}}_n\}, \) which can be expressed as

$$ \begin{aligned} &\hbox{min}\quad\left|\left|\sum\limits_{k=1}^n\alpha_k\user2{x}_k\right|\right|^2= \user2{\alpha}^T{\bf M}\user2{\alpha}\cr &\hbox{s.t.}\quad\user2{\alpha}^T\user2{e}=1,\,0\leq\user2{\alpha}\leq\user2{e}, \end{aligned} $$
(44)

where \(\user2{e}\) is a vector of ones of n dimension, and \({\bf M}=(m_{ij})\in\mathcal{R}^{n\times n}\) with \(m_{ij}=\user2{x}_i^T\user2{x}_j,\,i,j=1,\ldots,n. \) For this problem, Gilbert [39] proposed an iterative geometric algorithm for (44) to obtain the \(\varepsilon\)-optimal solution, called the Gilbert’s algorithm, which is represented by the extreme points of the convex hull comprised of the points of X. This Gilbert’s algorithm is described by the following three steps:

Algorithm 1: Gilbert’s algorithm

\(1^{\circ}\) Initialization: Set the initial point \(\user2{x}\) as the any point of X, such as \(\user2{x}=\user2{x}_1; \)

\(2^{\circ}\) Stopping condition: Find the point \(\user2{x}_r\) such that \(\user2{x}_r=\arg\min_{\user2{x}_k \in X}\left\{\langle \user2{x}_{k},\user2{x}\rangle\right\}, \) If the \(\varepsilon\)-optimality condition \(||\user2{x}||^2-\langle\user2{x}_{r},\user2{x}\rangle<\varepsilon\) holds, then the point \(\user2{x}\) is the \(\varepsilon\)-optimal solution. Otherwise, go to Step \(3^{\circ}; \)

\(3^{\circ}\) Adaptation: Set \(q=\frac{\langle\user2{x},\,\user2{x}-\user2{x}_r\rangle}{||\user2{x}-\user2{x}_r||^2}\) if \(\user2{x}-\user2{x}_r\neq{0}, \) otherwise q = 0; set \(\user2{x}=\user2{x}+q(\user2{x}_r-\user2{x}). \) Continue with Step \(2^{\circ}. \)

Obviously, (43) is similar to (44). Thus, we consider to solve (43) by using the Gilbert’s algorithm. We first introduce the notion of reduced convex hull (RCH) [18, 40] for a point set, which is

Definition

Let X be a set with n different points. For a real number 0 < μ < 1, the reduced convex hull (RCH) of X, denoted RCH(X, μ), is defined as

$$ {\text{RCH}}(X,\mu)=\left\{\user2{x}\,\left|\,\user2{x}=\sum\limits_{i=1}^na_i \user2{x}_i,\user2{x}_i\in X,\sum\limits_{i=1}^na_i=1, 0\leq a_i\leq\mu\right.\right\}. $$
(45)

Therefore, (43) can be regarded as the problem of finding the minimum norm problem on RCH(D 1) with an unknown distance. Fortunately, it is not necessary to consider this distance for this problem since in the Gilbert’s algorithm only the inner product \(\langle\cdot,\cdot\rangle\) is used. For briefly, we use \(\langle\user2{u}_{j_1},\user2{u}_{j_2}\rangle_{\bf G}\) to denote the inner product of \({\varvec{\varphi}}(\user2{x}_{j_1})\) and \({\varvec{\varphi}}(\user2{x}_{j_2})\) with this unknown distance \(||\cdot||_{\bf G}, \) i.e., \(g_{j_1j_2}=\langle\user2{u}_{j_1},\user2{u}_{j_2}\rangle_{\bf G}, \) where \({\user2{u}}={\varvec{\varphi}}(\user2{x}). \) Mavroforakis and Theodoridis [18] pointed out that the extreme points of RCH(X, μ) are the combinations of points in X.

Proposition

The extreme point of the RCH(X, μ) has coefficient a i belonging to the set S = {0, μ, 1 − ⌊1/μ⌋μ}, that is, each extreme point of RCH(X, μ) is a convex combination of \(m=\lceil1/\mu\rceil\) different points in X. Furthermore, if \(1/\mu=\lceil1/\mu\rceil, \) then all a i  = μ; otherwise, a i  = μ, \(i=1,\ldots,m-1, \) and a m  = 1 − ⌊1/μ⌋μ.

This proposition provides the necessary but not sufficient condition for determining the extreme points of RCH, in which the number of points satisfying the condition is obviously larger than that of extreme points. Therefore, it is impossible to inspect all candidate points in the process of finding the closest point. In fact, it need not consider all possible extreme points of RCH in the geometric algorithm, but only compute the projections of all negative points in \({{\mathcal{I}}^-}\), and choose the first \(\lceil1/\mu_1\rceil\) points with the smallest projections to update the current point.

Based on the above discussion on RCH, we describe the learning algorithm for TSVH based on the Gilbert’s one as follows.

Algorithm 2: Learning algorithm for TSVH

\(1^{\circ}\) Initialization: Set the parameters C 1, ν1 and kernel; set μ1 = min{1, C 11}, λ1 = 1 − ⌊1/μ1⌋, and \(m_1=\lceil1/\mu_1\rceil; \) set the initial point \(\user2{u}\) as the centroid point of RCH(D , μ1), i.e., set \({\alpha_j=1/|{\mathcal{I}}^-|,\,j\in {\mathcal{I}}^-. }\)

\(2^{\circ}\) Stopping condition: Find the point \(\user2{u}_{r}^{opt}=\arg\min_{r}\left\{\langle \user2{u}_{r},\user2{u}\rangle_{\bf G}\right\}, \) where \({\user2{u}_{r}=\sum\limits_{j\in {\mathcal{I}}^-}b_j\user2{u}_j, }\) \(b_j\in\{0,\lambda_1,\mu_1\}, \) \(\sum\limits_{j\in {\mathcal{I}}^-}b_j=1. \) If the ɛ-optimality condition \(||\user2{u}||_{\bf G}^2-\langle\user2{u}_{r}^{opt},\user2{u}\rangle_{\bf G}<\varepsilon\) holds, then the point \(\user2{u}\) is the ɛ-optimal solution. Otherwise, go to Step \(3^{\circ}. \)

\(3^{\circ}\) Adaptation: Set \(q=\frac{\langle\user2{u},\,\user2{u}-\user2{u}_{r}^{opt} \rangle_{\bf G}}{||\user2{u}-\user2{u}_{r}^{opt}||_{\bf G}^2}\) if \(\user2{u}-\user2{u}_{r}^{opt}\neq{0}, \) otherwise q = 0; update \(\user2{u}=\user2{u}+q(\user2{u}_{r}^{opt}-\user2{u}),\)i.e., set \(\alpha_j=\alpha_j+q(b_j-\alpha_j),\,j\in {\mathcal{I}}^-. \) Continue with Step \(2^{\circ}. \)

For this algorithm, it has almost the same complexity as the Gilbert’s algorithm; the same caching scheme can be used, with only \(\mathcal{O}(l^\pm)\) storage requirements.

5 Experiments

To validate the effectiveness of our proposed method, we compare the performances of TSVH, TSVM, and SVM on some data sets. All these algorithms are implemented in Matlab 7.8 [41] on Windows XP running on a PC with system configuration Intel(R) Core(TM)2 Duo CPU (2.26 GHz) with 2 GB of RAM. We simulate them on several artificial and UCI benchmark data setsFootnote 2 which are commonly used in testing machine learning algorithms. Note that we only consider the Gaussian kernel in the experiments. To fairly compare these algorithms, we demonstrate the performances of these algorithms using the accuracy and CPU time for learning. Another important problem is the parameter selection for these algorithms. To fairly reflect the performances of these algorithms, we select the parameters C’s and ν’s from the set of values {2i· l  ± |i =  −9,  −8, …, 10} and select the kernel parameters γ’s from the set of values {2i|i =  − 9,  − 8, …, 10} by tuning a set comprising of about random 10–30 percent of the sample set in the simulations. For the learning of the SVM and TSVM, we employ the geometric methods to learn their separating hyperplanes, see Refs. [18, 19, 27]. In addition, the kernel matrices (or the inversions of kernel matrices) of these algorithms are cached in memory before learning.

5.1 Toy examples

In order to illustrate graphically the effectiveness of our TSVH, we first test its ability to learn the artificially generated Ripley’s synthetic data set [42] in two dimensions. In this Ripley’s data set, both class “+” and class “−” are generated from mixtures of two Gaussians with the classes overlapping to the extent that the Bayes accuracy is around 92 %.

Figure 1 shows the classification results of SVM, TSVM, and TSVH with Gaussian kernels on this Ripley’s data set. It can be seen that the hyperspheres of TSVH effectively cover the different classes of samples and the corresponding separating hyperplane obtains the best classification result. While the nonparallel hyperplanes of TSVM do not effectively reflect the distributions of two classes of samples, they only cross as many as possible samples of the corresponding classes. Table 1 also lists the test accuracies of these algorithms. It can be seen that our TSVH obtains the best accuracy (about 92.1 %) among these methods, which is close to the Bayes classification result. Whereas the TSVM obtains the worst test accuracy (about 90.2 %) among these methods, which effectively confirms the analysis on TSVM, that its nonparallel hyperplanes do not effectively describe the characterizations of two classes of samples. A more possible reason for the good performance of TSVH is that the hyperspheres of TSVH are more effective to depict the class information of data.

Fig. 1
figure 1

Classification results of SVM (a), TSVM (b), and TSVH (c) on Ripley’s data set. The two classes of training samples are symbolled by “ × ” and “+,” the separating hyperplanes are shown with solid curves, and the nonparallel hyperplanes/hyperspheres of TSVM/TSVH are shown with dashed curves

Table 1 Performance comparisons on Ripley and checkerboard data sets

Another artificial example is the checkerboard problem. For the checkerboard problem, it consists of a series of uniform points in \(\mathcal{R}^2\) of red and blue points taken from 4 × 4 red and blue squares of a checkerboard. This is a tricky test case in data mining algorithms for testing performances of nonlinear classifiers. In the simulations, we consider the one non-overlapping and two overlapping cases for the checkerboard problem, that is, three sets of 800 (Class +: 400, Class −: 400) randomly generated training points on a 2-dimensional checkerboard of 4 × 4 cells are used. Each sample attribute ranges from 0 to 4 and the margins are 0, −0.05, and −0.1, respectively (the negative value indicating the overlapping between classes, that is, the overlapping of the cells), and the 3200 test samples are randomly generated with margin 0.

Figures 2 and 3 show the simulation results of these three algorithms on the first two training set because of the limitation of paper. It can be seen that our TSVH still obtains the best classification results, while TSVM obtains the worst results among these methods, especially for the overlapping case. Formally, the two hyperspheres of TSVH effectively cover the corresponding classes of samples, while the hyperplanes of TSVM do not reflect the information of samples. For the latter, they still pass through as many as possible samples of the corresponding classes, leading TSVM to obtain the worst separating hyperplane. However, we can obtain a pair of more smooth hyperplanes for the TSVM if we add the terms \(\epsilon_i(||\user2{w}_\pm||^2+b_\pm^2)/{2},\lambda_i>0,i=1,2\) into the objective functions of the TSVM. Table 1 also gives the test accuracies and learning CPU time of these methods on these cases with 30 independent runs, which confirm effectively our results. As for the learning CPU time, Table 1 shows that the TSVH and TSVM are about four times faster than the SVM, indicating the learning speed of the TSVH is much faster than the SVM, whereas the learning speed of the TSVH is slightly faster than the TSVM. However, if we consider the extra time for matrix inversions in the TSVM, the learning time of the TSVM will be obviously larger than the TSVH. In addition, we should point that the TSVH introduces an extra parameter compared with the TSVM, which means the total CPU time required for cross-validation may be much larger than the TSVM and SVM. Hence, an important further work is to discuss a suitable method for determining the parameters of TSVH.

Fig. 2
figure 2

Classification results of SVM (a), TSVM (b), and TSVH (c) with Gaussian kernel on checkerboard problem with margin 0. The separating hyperplanes are shown with solid curves, and the nonparallel hyperplanes/hyperspheres of TSVM/TSVH are shown with dashed curves

Fig. 3
figure 3

Classification results of SVM (a), TSVM (b), and TSVH (c) with Gaussian kernel on checkerboard problem with margin −0.05. The separating hyperplanes are shown with solid curves, and the nonparallel hyperplanes/hyperspheres of TSVM/TSVH are shown with dashed curves

5.2 Benchmark data sets

To further test the performance of TSVH, we run these models on several publicly available benchmark data sets, which are commonly used in test machine learning algorithms, and investigate the results in terms of accuracy and training CPU time. We use the one-vs-one method to deal with the multi-classification data sets and take the average CPU time as the learning time. Table 2 lists the descriptions of these benchmark data sets. Note that we use the tenfold cross-validation method to simulate the data sets in Table 2.

Table 2 Descriptions of the UCI benchmark data sets

Table 3 lists the average learning results of SVM, TSVM, and TSVH on these benchmark data sets with tenfold cross-validation run, including the accuracies and learning CPU time. It can be seen that the TSVH obtains the comparable generalization with the SVM and TSVM. As for the learning CPU time, it can be found that both methods are about four times faster than that of the SVM. This is because they solve two smaller sized QPPs instead of a single QPP in the classical SVM. Besides, as the artificial examples, the learning time of the TSVH is similar to that of the TSVM for the same reason. However, the CPU time for the model selection of the TSVH is larger than the TSVM and SVM since an extra penalty factor is introduced in it.

Table 3 Performance comparisons on the UCI benchmark data sets

We now use the Wilcoxon signed-ranks test [43], which is a simple, yet safe and robust nonparametric tests for statistical comparisons of classifiers, to compare the generalization performance of the TSVH, TSVM, and SVM. Let d i be the difference between the test accuracies of the two classifiers on ith out of N data sets. The differences are ranked according to their absolute values. Let r + be the sum of ranks for the data sets on which the second algorithm outperformed the first, and r the sum of ranks for the opposite. Ranks of d i  = 0 are split evenly among the sums; if there is an odd number of them, one is ignored. Let T be the smaller of the sums, T = min(r +r ). Then, the statistics

$$ Z=\frac{T-N(N+1)/4}{\sqrt{N(N+1)(2N+1)/24}} $$

is distributed approximately normally.

Table 4 shows the comparisons on the benchmark data sets between the results of TSVH and TSVM, TSVH and SVM, and TSVM and SVM, respectively. It can be seen that, given α = 0.05 (then, Z α/2 = −1.96), our TSVH derives the obviously better generalization than the TSVM and SVM, while the TSVM obtains the comparable generalization with the SVM. Therefore, our TSVH is the better machine learning algorithm for classifications than the SVM and TSVM. This confirms that the hyperspheres of TSVH can more effectively describe the data information than the nonparallel hyperplanes of TSVM.

Table 4 Statistics comparisons of accuracies of TSVH, TSVM, and SVM

6 Conclusions

In this paper, we have proposed a novel SVM approach to data classification, termed the TSVH. The introduction of TSVH is motivated by the SVDD, a special case of the OCSVM, and the TSVM, that is, in the TSVH, we solve two QPPs of smaller sizes instead of a large sized one as we do in the classical SVM, each QPP is based on the SVDD, which requires the samples of one class to be covered by a hypersphere as many as possible, while the samples of the other class to be as possible as far from this hypersphere. The strategy that solving two smaller sized QPPs, roughly of size (l/2), makes the TSVH be almost four times faster than a classical SVM classifier. Compared with the TSVM, the TSVH has some obvious merits. First, it does not need to find the inversion matrices in its pair dual QPPs, while the TSVM needs the inversions of two matrices of size (l + 1) × (l + 1). This leads the TSVH to be more suitable large-scale problems by combining efficient SVM algorithms. Second, the TSVH uses two hyperspheres but not hyperplanes to describe the samples of two classes, which leads the TSVH to be more suitable for general cases; for instance, the hyperplanes in the TSVM cannot effectively describe the samples in Ripley’s data set. Computational comparisons with the TSVM and SVM in terms of classification accuracy and learning time have shown that the proposed TSVH is not only faster than the other two methods, but also obtains comparable generalization.

The further work is to validate the performance of TSVH in more real classification problems, such as face recognition, image classification, and text categorization. Another important further work is to discuss the theoretical illustration for TSVH since TSVH is an indirect classifier. The third important further work is to find an efficient prediction algorithm for TSVH in further because of the loss of sparsity in TSVH. Last, notice that TSVH introduces more parameters, which makes the efficiency of model selection for TSVH be low; thus, it is necessary to find some efficient method for determining the parameters in TSVH.