1 Introduction

Image classification is one of the most fundamental problems in computer vision and pattern recognition, which is to assign one or more category labels to an image. However, image classification is a complex process that may be affected by many factors. There are several stages in image classification such as preprocessing, sample selection, feature extraction and classifier design and so on. In those stages, feature extraction and classifier design are especially important. In feature extraction, there are a lot of different methods, such as PCA [1], LDA [2], 2DPCA [3], STL [4] and GTDA [5] and so on. Actually, varied features describe different properties of the same image. In order to make full use of various features, multi-view learning [6, 7] and multiple feature fusion [8, 9] have been widely studied. In classifier design, support vector machine (SVM) [10, 11], as a powerful tool for pattern classification and regression, has been widely used to a variety of real-world problems such as pattern recognition [12, 13], bioinformatics [14], text categorization [15, 16] and financial applications [17]. Moreover, some related works with regard to SVM are also worthy of attention [1820]. In order to improve the performance of SVM-based relevance feedback in image retrieval, the asymmetric bagging and random subspace for SVM was proposed [18]. In order to address the problems in LR-based (Laplacian regularization) image annotation, Liu and Tao [19] proposed a multiview framework of SVM for image annotation. Furthermore, aiming to solve the class separation problem, Tao et al. [20] discussed subspace learning for SVM. However, SVM is computationally expensive because its solution follows from solving a quadratic programming problem (QPP) with numerous constraints, especially when handing large-scale data. In addition, SVM seeks only a separating plane such that it cannot well cope with the complex XOR problems. Recently, in order to improve the generalization ability and computational speed of SVM, multi-hyperplane support vector machine, which seeks two nonparallel hyperplanes instead of one hyperplane in SVM, has been widely investigated. In 2006, Mangasarian and Wild [21] proposed a generalized eigenvalue proximal support vector machine (GEPSVM), which is originally motivated to solve XOR problem and reduce the computing time of SVM. In this approach, training points of each class are proximal to one of two nonparallel hyperplanes. Next year, Jayadeva, Khemchandani and Chandra [22] proposed another nonparallel hyperplane classifier named twin support vector machine (TWSVM), which aims at seeking two nonparallel hyperplanes such that each hyperplane is close to one of the two classes and is as far as possible from the other simultaneously. Experimental results in [22] and [23] have shown the effectiveness of TWSVM over both standard SVC and GEPSVM, at the same time, TWSVM can also well deal with the XOR problems. Therefore, from then on, the methods of constructing the nonparallel hyperplanes support vector machine have been extensively studied, such as twin bounded support vector machine (TBSVM) [24], TPMSVM [25], RTSVM [26], STSVM [27], PPSVC [28] and NHSVM [29] and so on. Similarly, the methods of finding two projection directions [3034] and two hyperspheres [35, 36] also have been widely investigated, such as MVSVM [30], PTSVM [31] and TSVH [36].

However, the above nonparallel hyperplane methods [2224, 29] are almost in seeking two nonparallel hyperplanes \(w_1^T x+b_1 =0\), \(w_2^T x+b_2 =0\) and minimizing \(||Aw_1 +e_1 b_1 ||_2^2 \) or \(||Bw_2 +e_2 b_2 ||_2^2 \). Inspired by [37], in this paper, we propose a novel twin support vector machine, which aims at seeking two hyperplanes \(w_1^T x+b_1 =1, w_2^T x+b_2 =-1\) and minimizing \(||Aw_1 +e_1 b_1 -e_1 ||_2^2 \) or \(||Bw_2 +e_2 b_2 +e_2 ||_2^2 \) such that the two hyperplanes separate as much as possible. Furthermore, a regularization term is used to overcome overfitting and singularity problem, which is similar to [24]. Thus, the structural risk minimization principle is implemented.

The rest paper is organized as follows. Section 2 gives a brief overview of TWSVM and TBSVM. Section 3 proposes our linear novel twin support vector machine (NTSVM) and extends to nonlinear NTSVM. Experimental results are described in Sect. 4, and finally, conclusions are given in Sect. 5.

2 Brief Reviews of TWSVM and TBSVM

2.1 TWSVM

Consider a binary classification problem of \(m_1 \) data points in positive class and \(m_2 \) data points in negative class. Suppose that all of the data points in positive class are denoted by a matrix \(A\in R^{m_1 \times n}\) and the data points in negative class are denoted by a matrix \(B\in R^{m_2 \times n}\). \(A_i \) is the i-th row of A which is a row vector in \(R^{n}\). For linear case, TWSVM [22] determines two nonparallel hyperplanes:

$$\begin{aligned} w_1^T x+b_1 =0 \quad \hbox {and}\quad w_2^T x+b_2 =0 \end{aligned}$$
(1)

where \(w_i \in R^{n},b_i \in R,i=1,2\). The TWSVM aims at seeking a pair of nonparallel hyperplanes for the two classes such that each hyperplane is close to one of the two classes and is at a distance at least one from the other class. The primal problems of TWSVM are

$$\begin{aligned} \min _{w_1 ,b_1 ,\xi _2} \frac{1}{2}\bigg |\bigg |Aw_1 +e_1 b_1 \bigg |\bigg |_2^2 +c_1 e_2^T \xi _2 \quad s.t.\;\;-(Bw_1 +e_2 b_1)+\xi _2 \ge e_2 , \quad \xi _2 \ge 0 \end{aligned}$$
(2)
$$\begin{aligned} \min _{w_2 ,b_2 ,\xi _1} \frac{1}{2}\bigg |\bigg |Bw_2 +e_2 b_2 \bigg |\bigg |_2^2 +c_2 e_1^T \xi _1 \quad s.t.\;\;(Aw_2 +e_1 b_2)+\xi _1 \ge e_1 , \quad \xi _1 \ge 0 \end{aligned}$$
(3)

where \(c_1 \) and \(c_2 \) are the positive penalty parameters, \(\xi _1\) and \(\xi _2 \) are nonnegative slack variables, \(e_1 \) and \(e_2 \) are vectors of all ones of appropriate dimensions. By introducing the Lagrangian multipliers, the Wolfe dual of quadratic programming problems (QPPs) (2) and (3) can be represented as follows, respectively.

$$\begin{aligned} \max _\alpha \;e_2^T \alpha -\frac{1}{2}\alpha ^{T}G(H^{T}H)^{-1}G^{T}\alpha \quad s.t.\;\;0\le \alpha \le c_1 e_2 \end{aligned}$$
(4)
$$\begin{aligned} \max _\gamma \;e_1^T \gamma -\frac{1}{2}\gamma ^{T}H(G^{T}G)^{-1}H^{T}\gamma \quad s.t.\;\;0\le \gamma \le c_2 e_1 \end{aligned}$$
(5)

where \(G=[B\;e_2] ,\;H=[A\;e_1]\) and \(\alpha \in R^{m_2} ,\gamma \in R^{m_1} \) are Lagrangian multipliers.

The nonparallel hyperplanes (1) can be obtained from the solutions \(\alpha \) and \(\gamma \) to the optimization problems (4) and (5) by

$$\begin{aligned} v_1= & {} -(H^{T}H)^{-1}G^{T}\alpha , \quad \hbox { where } \quad v_1 =\bigg [w_1^T \;b_1\bigg ]^{T}, \end{aligned}$$
(6)
$$\begin{aligned} v_2= & {} (G^{T}G)^{-1}H^{T}\gamma , \quad \hbox { where } \quad v_2 =\bigg [w_2^T \;b_2\bigg ]^{T}, \end{aligned}$$
(7)

A new data point \(x\in R^{n}\) is then assigned to the positive class \(W_1 \) or negative class \(W_2 \), depending on which of the hyperplanes in (1) it lies closer to, i.e.

$$\begin{aligned} x\in W_i ,\quad i= \arg \min _{i=1,2} \;\bigg |w_i^T x+b_i \bigg | \end{aligned}$$
(8)

where \(|\cdot |\) is the absolute value.

2.2 TBSVM

Following the basic idea of SVC and TWSVM, TBSVM [24] also seeks a pair of nonparallel hyperplanes such that each hyperplane is proximal to the data points of one class and far from the data points of another class. The main different between TBSVM and TWSVM is that the structural risk minimization principle is implemented by adding the regularization term in TBSVM. The primal problems of TBSVM are

$$\begin{aligned}&\min _{w_1 ,b_1 ,\xi _2} \;\frac{1}{2}c_3 \bigg (||w_1 ||^{2}+b_1^2\bigg )+\frac{1}{2}\bigg |\bigg |Aw_1 +e_1 b_1 \bigg |\bigg |_2^2 +c_1 e_2^T \xi _2 \nonumber \\&\quad s.t.\;\;-(Bw_1 +e_2 b_1)+\xi _2 \ge e_2 , \quad \xi _2 \ge 0 \end{aligned}$$
(9)
$$\begin{aligned}&\min _{w_2 ,b_2 ,\xi _1} \;\frac{1}{2}c_4 \bigg (||w_2 ||^{2}+b_2^2\bigg )+\frac{1}{2}\bigg |\bigg |Bw_2 +e_2 b_2 \bigg |\bigg |_2^2 +c_2 e_1^T \xi _1 \nonumber \\&\quad s.t.\;\;(Aw_2 +e_1 b_2)+\xi _1 \ge e_1 , \quad \xi _1 \ge 0 \end{aligned}$$
(10)

where \(c_1 \) and \(c_2 \) are the positive penalty parameters, \(c_3\) and \(c_4 \) are the positive trade-off parameters, \(\xi _1 \) and \(\xi _2\) are nonnegative slack variables, \(e_1 \) and \(e_2 \) are vectors of all ones of appropriate dimensions. By introducing the Lagrangian multipliers, the Wolfe dual of QPPs (9) and (10) can be represented as follows.

$$\begin{aligned} \mathop {\max } \limits _\alpha \;e_2^T \alpha -\frac{1}{2}\alpha ^{T}G\bigg (H^{T}H+c_3 I\bigg )^{-1}G^{T}\alpha \quad s.t.\;\;0\le \alpha \le c_1 e_2 \end{aligned}$$
(11)
$$\begin{aligned} \mathop {\max } \limits _\gamma \;e_1^T \gamma -\frac{1}{2}\gamma ^{T}H\bigg (G^{T}G+c_4 I\bigg )^{-1}H^{T}\gamma \quad s.t.\;\;0\le \gamma \le c_2 e_1 \end{aligned}$$
(12)

where \(G=[B\;e_2],\;H=[A\;e_1], I\) is an identity matrix and\(\alpha \in R^{m_2} ,\gamma \in R^{m_1} \) are Lagrangian multipliers.

The nonparallel hyperplanes (1) can be obtained from the solutions \(\alpha \) and \(\gamma \) to the optimization problems (11) and (12) by

$$\begin{aligned} v_1= & {} -\bigg (H^{T}H+c_3 I\bigg )^{-1}G^{T}\alpha , \quad \hbox { where } \quad v_1 =\bigg [w_1^T \;b_1\bigg ]^{T} \end{aligned}$$
(13)
$$\begin{aligned} v_2= & {} \bigg (G^{T}G+c_4 I\bigg )^{-1}H^{T}\gamma , \quad \hbox { where } \quad v_2 =\bigg [w_2^T \;b_2\bigg ]^{T} \end{aligned}$$
(14)

Once the solutions \([w_1^T \;b_1]\) and \([w_2^T \;b_2]\) to (11) and (12) are obtained from (13) and (14), a new data point \(x\in R^{n}\) is assigned to the positive class \(W_1 \) or negative class \(W_2 \), depending on which of the hyperplanes in (1) it lies closer to, i.e.

$$\begin{aligned} x\in W_i ,\quad i=\mathop {\arg \min } \limits _{i=1,2} \;\frac{|w_i^T x+b_i |}{||w_i ||}\; \end{aligned}$$
(15)

where \(|\cdot |\) is the absolute value.

3 Novel Twin Support Vector Machines

3.1 Linear NTSVM

Now, let us construct our linear NTSVM. Following the idea of PSVM [37] and TBSVM [24], we propose our method, which seeks two nonparallel hyperplanes:

$$\begin{aligned} w_1^T x+b_1 =1 \quad \hbox {and} \quad w_2^T x+b_2 =-1 \end{aligned}$$
(16)

by considering the following primal problems.

$$\begin{aligned}&\mathop {\min } \limits _{w_1 ,b_1 ,\xi _2} \;\frac{1}{2}c_3 \bigg (||w_1 ||_2^2 +b_1^2\bigg )+\frac{1}{2}\bigg |\bigg |Aw_1 +e_1 b_1 -e_1 \bigg |\bigg |_2^2 +c_1 e_2^T \xi _2 \nonumber \\&\quad s.t.\;\;-(Bw_1 +e_2 b_1)+\xi _2 \ge e_2 , \quad \xi _2 \ge 0 \end{aligned}$$
(17)
$$\begin{aligned}&\mathop {\min } \limits _{w_2 ,b_2 ,\xi _1} \;\frac{1}{2}c_4 \bigg (||w_2 ||_2^2 +b_2^2\bigg )+\frac{1}{2}\bigg |\bigg |Bw_2 +e_2 b_2 +e_2 \bigg |\bigg |_2^2 +c_2 e_1^T \xi _1 \nonumber \\&\quad s.t.\;\;(Aw_2 +e_1 b_2)+\xi _1 \ge e_1 , \quad \xi _1 \ge 0 \end{aligned}$$
(18)

where \(c_1 \) and \(c_2 \) are the positive penalty parameters, \(c_3 \) and \(c_4 \) are the positive trade-off parameters, \(\xi _1 \) and \(\xi _2 \) are nonnegative slack variables, \(e_1 \) and \(e_2 \) are vectors of all ones of appropriate dimensions.

Then, we discuss the difference between the primal problems of TBSVM and our NTSVM, by comparing optimization problems (9) and (17). Obviously, minimizing \(||Aw_1 +e_1 b_1 -e_1 ||_2^2 \) in (17) instead of minimizing \(||Aw_1 +e_1 b_1 ||_2^2 \) in (9) means our NTSVM aims to obtain \([w_1^T \;b_1]\) such that the positive data points are proximal to the hyperplane\(w_1^T x+b_1 =1\) and the negative data points are far away from it at the same time. Based on this change, our optimization problem will be similar to regression problem, i.e., we generate a function \(f_1 (x)=w_1^T x+b_1 \) to fit the positive class label \(+1\). Meanwhile, we find another function \(f_2 (x)=w_2^T x+b_2 \) to fit the negative class label \(-1\). In addition, similar to TBSVM, our NTSVM also adds a regularization term to the objective function in (17). Thus, the structural risk minimization principle is considered.

In order to get the solutions to problems (17) and (18), we need to derive their dual problems. For (17), by introducing the Lagrangian multipliers, the Lagrangian function is given by

$$\begin{aligned} L(w_1 ,b_1 ,\xi _2 ,\alpha ,\beta )= & {} \frac{1}{2}c_3 \bigg (||w_1 ||_2^2 +b_1^2\bigg )+\frac{1}{2}\bigg |\bigg |Aw_1 +e_1 b_1 -e_1 \bigg |\bigg |_2^2 +c_1 e_2^T \xi _2 \nonumber \\&-\,\alpha ^{T}\bigg [-(Bw_1 +e_2 b_1)+\xi _2 -e_2\bigg ]-\beta ^{T}\xi _2 \end{aligned}$$
(19)

where \(\alpha \in R^{m_2}\) and \(\beta \in R^{m_2} \) are the vectors of Lagrangian multipliers. By using Karush-Kuhn-Tucker (KKT) conditions, we can get

$$\begin{aligned}&\nabla _{w_1} =c_3 w_1 +A^{T}(Aw_1 +e_1 b_1 -e_1)+B^{T}\alpha =0 \end{aligned}$$
(20)
$$\begin{aligned}&\nabla _{b_1} =c_3 b_1 +e^{T}(Aw_1 +e_1 b_1 -e_1)+e_2^T \alpha =0 \end{aligned}$$
(21)
$$\begin{aligned}&\nabla _{\xi _2} =c_1 e_2 -\alpha -\beta =0 \end{aligned}$$
(22)
$$\begin{aligned}&\qquad \ \quad -(Bw_1 +e_2 b_1)+\xi _2 \ge e_2 , \quad \xi _2 \ge 0 \end{aligned}$$
(23)
$$\begin{aligned}&\alpha ^{T}(Bw_1 +e_2 b_1 -\xi _2 +e_2)=0, \quad \beta ^{T}\xi _2 =0 \end{aligned}$$
(24)
$$\begin{aligned}&\alpha \ge 0, \quad \beta \ge 0 \end{aligned}$$
(25)

Since \(\beta \ge 0\), from (22) and (25), we can get

$$\begin{aligned} 0\le \alpha \le c_1 e_2 \end{aligned}$$
(26)

Let \(G=[B\;e_2],\;H=[A\;e_1],v_1 =[w_1^T \;b_1]^{T}\), (20) and (21) imply that

$$\begin{aligned} \bigg (H^{T}H+c_3 I\bigg )v_1 -H^{T}e_1 +G^{T}\alpha =0 \end{aligned}$$
(27)

Thus, we can get the augmented vector

$$\begin{aligned} v_1 =\bigg (H^{T}H+c_3 I\bigg )^{-1}\bigg [H^{T}e_1 -G^{T}\alpha \bigg ] \end{aligned}$$
(28)

where I is an identity matrix. Then putting (28) into (19) and using (2022), omitting the constant term, we obtain the Wolfe dual problem of (17) as follows

$$\begin{aligned}&\mathop {\max } \limits _\alpha \;\bigg [e_2^T +e_1^T H\bigg (H^{T}H+c_3 I\bigg )^{-1}G^{T}\bigg ]\alpha -\frac{1}{2}\alpha ^{T}G\bigg (H^{T}H+c_3 I\bigg )^{-1}G^{T}\alpha \nonumber \\&\quad s.t.\;\;0\le \alpha \le c_1 e_2 \end{aligned}$$
(29)

In the same way, we can obtain the Wolfe dual problem of (18) as follows

$$\begin{aligned}&\mathop {\max } \limits _\gamma \;\bigg [e_1^T +e_2^T G\bigg (G^{T}G+c_4 I\bigg )^{-1}H^{T}\bigg ]\gamma -\frac{1}{2}\gamma ^{T}H\bigg (G^{T}G+c_4 I\bigg )^{-1}H^{T}\gamma \nonumber \\&\quad s.t.\;\;0\le \gamma \le c_2 e_1 \end{aligned}$$
(30)

where \(\alpha \in R^{m_2} ,\gamma \in R^{m_1} \) are Lagrangian multipliers.

The nonparallel hyperplanes (16) can be obtained from the solutions \(\alpha \) and \(\gamma \) to the optimization problems (29) and (30) by

$$\begin{aligned}&v_1 =\bigg (H^{T}H+c_3 I\bigg )^{-1}\bigg [H^{T}e_1 -G^{T}\alpha \bigg ], \quad \hbox {where} \quad v_1 =\bigg [w_1^T \;b_1\bigg ]^{T} \end{aligned}$$
(31)
$$\begin{aligned}&v_2 =\bigg (G^{T}G+c_4 I\bigg )^{-1}\bigg [-G^{T}e_2 +H^{T}\gamma \bigg ], \quad \hbox {where} \quad v_2 =\bigg [w_2^T \;b_2\bigg ]^{T} \end{aligned}$$
(32)

Once \(v_1 \) and \(v_2 \) are obtained from (31) and (32), the two nonparallel hyperplanes (16) are known. A new data point \(x\in R^{n}\) is then assigned to the positive class \(W_1 \) or negative class \(W_2 \), depending on which of the hyperplanes in (16) it lies closer to, i.e.

$$\begin{aligned} x\in W_i ,\quad i=\mathop {\arg \min } \limits _{i=1,2} \bigg \{\;\bigg |w_1^T x+b_1 -1\bigg |,\bigg |w_2^T x+b_2 +1\bigg |\bigg \} \end{aligned}$$
(33)

where \(|\cdot |\) is the absolute value.

3.2 Nonlinear NTSVM

In this subsection, we show that our NTSVM can be extended to nonlinear case. Here, we consider the following kernel-generated hyperplanes

$$\begin{aligned} K(x^{T},C^{T})w_1 +b_1 =1 \quad \hbox {and} \quad K(x^{T},C^{T})w_2 +b_2 =-1 \end{aligned}$$
(34)

where \(C=[A\;B]\in R^{(m_1 +m_2)\times n}\) and K is an appropriate kernel. Similar to linear case, the nonlinear optimization problems can be expressed as

$$\begin{aligned}&\mathop {\min } \limits _{w_1 ,b_1 ,\eta _2} \;\frac{1}{2}c_3 \bigg (||w_1 ||_2^2 +b_1^2\bigg )+\frac{1}{2}\bigg |\bigg |K(A,C^{T})w_1 +e_1 b_1 -e_1 \bigg |\bigg |_2^2 +c_1 e_2^T \eta _2 \nonumber \\&\quad s.t.\;\;-\bigg (K(B,C^{T})w_1 +e_2 b_1\bigg )+\eta _2 \ge e_2 , \quad \eta _2 \ge 0 \end{aligned}$$
(35)
$$\begin{aligned}&\mathop {\min } \limits _{w_2 ,b_2 ,\eta _1} \;\frac{1}{2}c_4 \bigg (||w_2 ||_2^2 +b_2^2\bigg )+\frac{1}{2}\bigg |\bigg |K(B,C^{T})w_2 +e_2 b_2 +e_2 \bigg |\bigg |_2^2 +c_2 e_1^T \eta _1 \nonumber \\&\quad s.t.\;\;\bigg (K(A,C^{T})w_2 +e_1 b_2\bigg )+\eta _1 \ge e_1 , \quad \eta _1 \ge 0 \end{aligned}$$
(36)

By using the Lagrangian method and KKT conditions, we can obtain Wolfe dual problems as follows

$$\begin{aligned}&\mathop {\max } \limits _\alpha \;\bigg [e_2^T +e_1^T \tilde{H}\bigg (\tilde{H}^{T}\tilde{H}+c_3 I\bigg )^{-1}\tilde{G}^{T}\bigg ]\alpha -\frac{1}{2}\alpha ^{T}\tilde{G}\bigg (\tilde{H}^{T}\tilde{H}+c_3 I\bigg )^{-1}\tilde{G}^{T}\alpha \nonumber \\&\quad s.t.\;\;0\le \alpha \le c_1 e_2 \end{aligned}$$
(37)
$$\begin{aligned}&\mathop {\max } \limits _\gamma \;\bigg [e_1^T +e_2^T \tilde{G}\bigg (\tilde{G}^{T}\tilde{G}+c_4 I\bigg )^{-1}\tilde{H}^{T}\bigg ]\gamma -\frac{1}{2}\gamma ^{T}\tilde{H}\bigg (\tilde{G}^{T}\tilde{G}+c_4 I\bigg )^{-1}\tilde{H}^{T}\gamma \nonumber \\&\quad s.t.\;\;0\le \gamma \le c_2 e_1 \end{aligned}$$
(38)

where \(\tilde{G}=[K(B,C^{T})\;e_2],\;\tilde{H}=[K(A,C^{T})\;e_1]\). From (35)-(38), and the augmented vectors \(v_1 =[w_1^T \;b_1]^{T}\) and \(v_2 =[w_2^T \;b_2]^{T}\) can be obtained by

$$\begin{aligned} v_1= & {} \bigg (\tilde{H}^{T}\tilde{H}+c_3 I\bigg )^{-1}\bigg [\tilde{H}^{T}e_1 -\tilde{G}^{T}\alpha \bigg ] \end{aligned}$$
(39)
$$\begin{aligned} v_2= & {} \bigg (\tilde{G}^{T}\tilde{G}+c_4 I\bigg )^{-1}\bigg [-\tilde{G}^{T}e_2 +\tilde{H}^{T}\gamma \bigg ] \end{aligned}$$
(40)

Once the augmented vector \(v_1 \) and \(v_2 \) are obtained from (39) and (40), the two nonparallel hyperplanes (34) are known. A new data point \(x\in R^{n}\) is then assigned to the positive class \(W_1 \) or negative class \(W_2 \), depending on which of the hyperplanes in (34) it lies closer to, i.e.

$$\begin{aligned} x\in W_i ,\quad i=\mathop {\arg \min } \limits _{i=1,2} \bigg \{\;\bigg |w_1^T K(x^{T},C^{T})+b_1 -1\bigg |,\bigg |w_2^T K(x^{T},C^{T})+b_2 +1\bigg |\bigg \} \end{aligned}$$
(41)

where \(|\cdot |\) is the absolute value.

3.3 Implementation

In this subsection, we discuss the implementation of our proposed NTSVM. In our NTSVM, the dual problem can be rewritten as the following unified form

$$\begin{aligned}&\mathop {\min } \limits _\alpha \quad \frac{1}{2}\alpha ^{T}Q\alpha -f^{T}\alpha \nonumber \\&\quad s.t.\quad \;0\le \alpha \le ce \end{aligned}$$
(42)

where Q is a positive definite matrix and f is a vector. For example, if we choose \(Q=G(H^{T}H+c_3 I)^{-1}G^{T}\) and \(f=[e_2^T +e_1^T H(H^{T}H+c_3 I)^{-1}G^{T}]^{T}\), the problem (42) becomes the problem (29). Similarly, the problem (42) becomes problem (37), when we choose \(Q=\tilde{G}(\tilde{H}^{T}\tilde{H}+c_3 I)^{-1}\tilde{G}^{T}\) and \(f=[e_2^T +e_1^T \tilde{H}(\tilde{H}^{T}\tilde{H}+c_3 I)^{-1}\tilde{G}^{T}]^{T}\).

Algorithm 1. SOR Algorithm

Step 1. Select the parameter \(t\in (0,2)\) and the initial vector \(\alpha ^{0}\), set \(k=0\);

Step 2. Compute \(\alpha ^{k+1}\) by \(\alpha ^{k+1}=(\alpha ^{k}-t\cdot D^{-1}(Q\alpha ^{k}-f+L(\alpha ^{k+1}-\alpha ^{k})))_\# \); where Q and f are set according to (29), (30), (37) and (38). \((u)_\# \) denotes the 2-norm projection on the feasible region, that is

$$\begin{aligned} ((u)_\#)_i =\left\{ {\begin{array}{ll} 0,&{}\quad \mathrm{if}\;u_i \le 0 \\ u_i,&{}\quad \mathrm{if}\;0<u_i <c \\ c,&{}\quad \mathrm{if}\;u_i \ge c \\ \end{array}} \right. \end{aligned}$$

And define \(L+D+L^{T}=Q\), where L and D are the strictly lower triangular matrix and the diagonal matrix, respectively;

Step 3. Stop if \(||\alpha ^{k+1}-\alpha ^{k}||<\varepsilon \), where \(\varepsilon \) is desired tolerance; else replace \(\alpha ^{k}\) by \(\alpha ^{k+1}\), k by \(k+1\) and go to Step2.

Table 1 The parameters for different classifiers for linear kernel

In our proposed NTSVM, most of the computational cost is incurred in solving the dual QPP (42). In order to solve the above QPP quickly, we use a very efficient optimization technique called successive overrelaxation (SOR) algorithm, which can be seen in [24, 29, 38]. SOR is an excellent QPP solver because it is able to deal with very large datasets that need not reside in memory [38]. The experimental results in the following section will show that the SOR technique has remarkable acceleration effect on our proposed NTSVM. Note that the SOR technique can also be used in the original TBSVM [24] and NHSVM [29]. In practice, for nonlinear NTSVM, if the number of training points in positive class or negative class is large, then the rectangular kernel technique [39] can be applied to reduce the dimensionality of (37) and (38).

4 Experimental Results

In order to evaluate the proposed NTSVM, we investigate its classification accuracies and computational efficiencies on two artificial datasets, a number of real-world UCI benchmark datasets and five NDC datasets. In the experiments, we focus on the comparison between our proposed NTSVM and some state-of-the-art classifiers, e.g. GEPSVM [21], TWSVM [22], TBSVM [24] and NHSVM [29]. All methods were implemented in MATLAB R2013a on a personal computer (PC) with an Intel (R) Core (TM) processor (3.40GHz) and 4 GB random-access memory (RAM). TBSVM, NHSVM and our proposed NTSVM are solved by SOR algorithm. The eigenvalue problem in GEPSVM is solved by MATLAB function ‘eig.m’ and the QPP problems in TWSVM are solved by the optimization toolbox QP in MATLAB. In our evaluation of the classifiers, we used “Accuracy” which is defined as Accuracy = (TP+TN)/(TP+FP+TN+FN), where TP, TN, FP and FN are the number of true positives, true negatives, false positives and false negatives, respectively. Note that all of these classifiers contained different parameters, which were selected from the set \(\{2^{-8}, \ldots , 2^{8}\}\) by employing the standard 10-fold cross-validation methodology [40]. The parameters used with the different classifiers for linear case are listed in Table 1.

4.1 Toy Examples

In this subsection, two artificial datasets, including crossplane (XOR) and Ripley’s synthetic datasets [41] have been used to show that our proposed NTSVM can deal with linearly inseparable problems. Ripley’s synthetic dataset contains 250 training points and 1000 test points. The average results of linear GEPSVM, TWSVM, TBSVM, NHSVM and our proposed NTSVM on crossplane (XOR) dataset are reported in Table 2 and the average results of nonlinear GEPSVM, TWSVM, TBSVM, NHSVM and our proposed nonlinear NTSVM with Gaussian kernels on Ripley’s synthetic dataset are listed in Table 3.

Table 2 Classification accuracy (%) on crossplane (XOR) datasets
Table 3 Classification accuracy (%) on ripley datasets
Fig. 1
figure 1

Crossplane and the hyperplanes of our proposed NTSVM

From Table 2, we can observe that all method above can gain good performance on crossplane (XOR) problem and our proposed NTSVM get the best performance. From Table 3, we can find that the accuracy of our proposed NTSVM is 89.20 %, which is little lower than NHSVM but better than other methods. Figure 1 illustrates the crossplane (XOR) dataset and hyperplanes of our proposed NTSVM. Ripley’s dataset and the hyperplanes of GEPSVM, TWSVM, TBSVM, NHSVM and our proposed nonlinear NTSVM are shown in Fig. 2.

Fig. 2
figure 2

Ripley’s dataset and results of nonlinear methods on Ripley’s dataset

4.2 UCI Datasets

To further compare our proposed NTSVM with GEPSVM, TWSVM and TBSVM, we choose 13 datasets from the UCI machine learning repository [42]. The numerical results of their linear version are given in Table 4. We also list the mean accuracy for each of the classifiers and the best accuracy is shown in boldface. According to the W-T-L summarization, we can find that the accuracy of our proposed linear NTSVM is better than that of GEPSVM, TWSVM and TBSVM on most of the datasets. For example, for the Australian dataset, the accuracy of our NTSVM is 87.54 %, while that of GEPSVM is 85.65 %, TWSVM is 86.81 % and TBSVM is 86.96 %.

Table 4 Ten-fold testing percentage accuracy of linear classifiers

Table 5 displays the experimental results for nonlinear GEPSVM, TWSVM, TBSVM and our NTSVM on the above chosen UCI datasets. The Gaussian kernel \(K(x,y)=e^{-\frac{||x-y||^{2}}{2\sigma ^{2}}}\) is used. The kernel parameter \(\sigma \) is obtained by searching in the range from \(2^{-8}\) to \(2^{8}\). The accuracy and the training CPU time for these methods are also listed. The results in Table 5 are similar to those in Table 4, and therefore confirm the observation above.

Table 5 Ten-fold testing percentage accuracy of nonlinear classifiers

In order to compare the performance of different classifiers, the two-dimensional scatter plots were used in [19, 26, 28, 30]. The corresponding scatter plots are shown in Fig. 3 for Australian UCI datasets with about 20 % of data points, where the coordinates \((d_i^+ ,d_i^-)\) are the respective distances of a test point \(x_i \) given in (33). In Fig. 3, the figures (a–d) are the results obtained by GEPSVM, TWSVM, TBSVM and our NTSVM respectively.

Fig. 3
figure 3

Two-dimensional projection for test points from Australian dataset

As we know, there are four parameters \(c_i (i=1,2,3,4)\) and a kernel parameter \(\sigma \) in our linear NTSVM and nonlinear NTSVM, respectively. We will analyze the effect of \(c_i\) to linear NTSVM and the effect of kernel parameter \(\sigma \) to nonlinear NTSVM. All the experiments in this subsection will be carried out on four benchmark datasets: “Australian”, “House votes”, “Heart-c” and “Ionosphere”. We first evaluate the influence of the parameters \(c_i (i=1,2,3,4)\). In real experiments, for convenience, we assume \(c_1 =c_2 \) and \(c_3 =c_4 \) for each dataset. The results are depicted in Fig. 4. As we can see, proper selection of parameters \(c_i \) will improve the performance of the classifier in a certain extent. We next illustrate the relationship between kernel parameter \(\sigma \) and classification accuracy in the nonlinear case. To analyze the effect of kernel parameter \(\sigma \) on the performance of NTSVM for the nonlinear case, the comparison is carried out under the condition that the parameter \(c_i \) are all selected as 1 for convenience. To clarify the influence of the kernel parameter, we draw the parameter-accuracy curves of above four datasets with parameter \(\sigma \) belongs to the set \(\{2^{-8},2^{-7},\cdots ,2^{7},2^{8}\}\). Based on the results in Fig. 5, we can see that the performance of NTSVM is closely related to the chosen of kernel parameters.

Fig. 4
figure 4

Relationship of parameters \(c_i \) and classification accuracy

Fig. 5
figure 5

Relationship of kernel parameters \(\sigma \) and classification accuracy

4.3 NDC Datasets

In this subsection, we conducted experiments on some large datasets, generated using David Musicants NDC Data Generator [43]. For experiments with NDC datasets, we fixed parameters of all classifiers to be the same (i.e.\(c_i =1\)). The training time and accuracy are reported in Tables 6 and 7, respectively. Table 6 shows the comparison results for linear TWSVM, TBSVM, NHSVM and our NTSVM on NDC datasets. In Table 6, it is not hard to see that our NTSVM obtains the comparable or better accuracies and performs faster than other classifiers on most datasets.

For nonlinear case, Table 7 shows the comparison results of computing times and accuracies of all four classifiers considered on several NDC datasets with Gaussian kernel. For NDC-2k, NDC-3k and NDC-5k datasets, the rectangular Generator [39] is applied by using 10 % of total data points. The results on these datasets show that our NTSVM and TBSVM are much faster than TWSVM and NHSVM. This is because that the QPP problems in TWSVM are directly solved by the optimization toolbox QP in MATLAB, while our NTSVM and TBSVM are solved by SOR algorithm. Although the QPP problem in NHSVM is also solved by SOR algorithm, the computing time is more than NTSVM and TBSVM. The main reason is that the size of QPP in NHSVM is large than the size of QPPs in NTSVM and TBSVM. Tables 6 and 7 demonstrate the good generalization ability and efficiency of our NTSVM when dealing with large-scale problems.

Table 6 Comparison on NDC datasets for linear classifiers
Table 7 Comparison on NDC datasets for nonlinear classifiers

5 Conclusions

For binary classification, an improved twin support vector machine (NTSVM) based on TWSVM is proposed in this paper. Unlike TWSVM, our proposed NTSVM seeks two nonparallel hyperplanes \(w_1^T x+b_1 =1\) and \(w_2^T x+b_2 =-1\). Similar to TBSVM, a regularization term is used to overcome overfitting and singularity problem. In addition, SOR technique is utilized to speed up training process. Experimental results illustrate that our NTSVM is feasible and effective. It should be pointed out that there are four parameters in our NTSVM, so the parameter selection is a practical problem and should be investigated in the future. Furthermore, expanding the applications of our NTSVM in practical scenarios and extending our NTSVM to multi-view learning [44] are also interesting.