1 Introduction

Support vector machine (SVM), introduced by Vapnik and coworkers [6, 44], is an excellent kernel based tool for binary data classification problems. Over the past decades, SVM has played an important role in solving problems emerged in pattern recognition and machine learning community due to its novel state of art technique. Its applications include a wide spectrum of research areas, ranging from pattern recognition [30], text categorization [16], biomedicine [5] etc. They have shown excellent performance on wide variety of problems [15, 30] due to its method of constructing a hyper plane such that the band between the two hyper planes separate both the classes and distance between two hyper planes is maximized, leading to the introduction of regularization term. Unlike other machine learning methods such as artificial neural networks (ANNs), training of SVMs leads to solving a linearly constrained quadratic programming problem (QPP) having unique optimal solution. Combined with the advantage of having unique optimal solution and better generalization performance, SVM becomes one of the most popular methods for solving classification problems. One of the main challenges for SVM is the large computational cost that is associated with the quadratic programming problems (QPPs). To reduce the learning complexity of SVM, various algorithms with comparable classification abilities have been reported, see [2, 6, 15, 19, 20, 24, 33, 43]. Specifically, Fung and Mangasarian [11] proposed an implicit Lagrangian formulation for SVM and solved using Newton method. The presented method is simple, fast and can be applied to problems with large scale data.

To improve the computational speed, Jayadeva et al. [14] proposed a twin support vector machine (TWSVM) [14] for the binary classification data in the spirit of the proximal SVM [10] and generalized eigenvalue proximal support vector machine (GEPSVM) [26]. TWSVM generates two nonparallel hyper planes by solving a pair of smaller-sized QPPs such that each plane is closest to one of the classes and as far as possible from the other class. A fundamental difference between TWSVM and SVM is that TWSVM solves two small QPPs rather than solving one large QPP makes the learning speed of TWSVM approximately four times faster than that of the standard SVM. Now, TWSVM has become popular and widely used because of its low computational complexity. Recently, some scholars proposed variants of TWSVM to reduce the time complexity and keep the effectiveness of TWSVM, see [3, 4, 12, 17, 29, 31, 32, 34, 3739, 41, 42, 45, 46]. Specifically, least square twin SVM (LSTWSVM) [18] has been proposed by using the squared loss function instead of the hinge one in TWSVM, leading to very fast training speed since two QPPs are replaced by two systems of linear equations.

One of the principle advantages of SVM is the implementation of structural risk minimization (SRM) principle [36]. However, in the primal formulations of TWSVM and LSTWSVM [18], only the empirical risk is minimized. Also, we notice that the inverse of \(G^t G\) appears in the dual formulation of TWSVM. Using the extra regularization term \(G^t G\) is nonsingular. This is not perfect way from the theoretical point of view although it has been handled by modifying the dual problems technically and elegantly [37]. Recently, Shao et al. [37] proposed twin bounded support vector machines (TBSVM) based on TWSVM. Unlike TWSVM, the SRM principle is implemented in TBSVM and SOR technique is applied to speed-up the computational time.

Motivated by the works of [11, 24, 37], we proposed in this paper an implicit Lagrangian formulation for the twin support vector machine (TWSVM) classifiers by formulating a pair of unconstrained minimization problems in dual variables whose solutions will be obtained using finite Newton method. There are some differences in our formulation and TWSVM. The idea of our formulation is to reformulate TWSVM as a strongly convex problem by incorporated regularization techniques to improve the training speed and robustness. The solution of two modified unconstrained minimization problems reduces to solving just two systems of linear equations as opposed to solving two quadratic programming problems in TWSVM and TBSVM, which leads to extremely simple and fast algorithm. It is also worth mentioning that the proposed formulation does not need any specialized optimization packages. The results of experiments conducted on both artificial and publicly available benchmark datasets confirm its efficacy and feasibility.

In this work, all vectors are taken as column vectors. The inner product of two vectors xy in the n-dimensional real space \(R^n,\) will be denoted by: \(x^t y,\) where \(x^t\) is the transpose of x. Whenever x is orthogonal to y,  we write \(x\perp y\). For \(x=(x_1,x_2,\ldots ,x_n)^t \in R^n\), the plus function \(x_+\) is defined as: \((x_+)_i = \max \{0,x_i\},\) where \(i=1,2,\ldots ,n\). The 2-norm of a vector x and a matrix Q will be denoted by \(\Vert x \Vert\) and \(\Vert Q \Vert\) respectively. For simplicity, we drop the 2 from \(\Vert x \Vert _2\). We denote the vector of ones of dimension m by e and the identity matrix of appropriate size by I. If f is a real valued function of the variable \(x=(x_1,x_2,\ldots ,x_n)^t \in R^n\) then the gradient of f is denoted by \(\nabla f = (\partial f/ \partial x_1,\ldots ,\partial f/ \partial x_n)^t\) and the Hessian matrix of f is denoted by \(\nabla ^2 f=(\partial ^2 f/ \partial x_i \partial x_j)_{i,j=1,2,\ldots ,n}.\)

The remainder of this paper is organized as follows. Section 2 reviews the TWSVM formulation. Our proposed formulation is presented and discuss the related theoretical analysis in Sect. 3. Numerical experiments have been performed on a number of interesting synthetic and real-world benchmark datasets and their results are compared with other SVMs in Sect. 4. Finally, concluding remarks and future work are given in Sect. 5.

2 Twin support vector machine for classification

In this section, we give a brief description of TWSVM formulation. For details, the interested reader is referred to [8, 14].

In 2007, Jayadeva et al. [14] proposed a new non-parallel support vector machine for binary classification, termed as TWSVM, which is in the spirit of GEPSVM. However, TWSVM has the formulation similar to typical support vector machine (SVM) formulation [6, 44] except that not all the patterns appear in the constraints of either problem at the same time. This makes TWSVM [14] faster than standard SVM.

Suppose that all the data points in class +1 are denoted by a matrix \(A\in R^{m_{1} \times n},\) where the ith row \(A_{i} \in R^{n}\) and the matrix \(B\in R^{m_{2} \times n}\) represent the data points of class \(-1\). Unlike SVM, the linear TWSVM [14] seeks a pair of non-parallel hyperplanes

$$\begin{aligned} f_{1} (x)=w_{1} ^{t} x+b_{1} \quad \text{ and } \quad f_{2} (x)=w_{2} ^{t} x+b_{2}. \end{aligned}$$
(1)

The idea in the linear TWSVM is to solve the following two QPPs with objective functions corresponding to one of the two classes and constraints corresponding to the other class:

$$\begin{aligned} \mathop {\min _{\left( w_1,b_1\right) \in R^{n+1}}}&\quad \frac{1}{2} \left\| Aw_{1} +e_{2} b_{1} \right\| ^{2} +C_{1} \left\| \xi _{1} \right\| \nonumber \\ s.t.&\quad -\left( Bw_{1} +e_{1} b_{1} \right) +\xi _{1} \ge e_{1} ,\quad \xi _{1} \ge 0,\end{aligned}$$
(2)
$$\begin{aligned} \mathop {\min _{\left( w_2,b_2\right) \in R^{n+1}}}&\quad \frac{1}{2} \left\| Bw_{2} +e_{1} b_{2} \right\| ^{2} +C_{2} \left\| \xi _{2} \right\| \nonumber \\ s.t.&\quad \left( Aw_{2} +e_{2} b_{2} \right) +\xi _{2} \ge e_{2} ,\quad \xi _{2} \ge 0, \end{aligned}$$
(3)

where \(C_{1} ,C_{2} >0\) are parameters and \(e_{1} ,e_{2}\) are vectors of one of appropriate dimensions. It is evident that the idea in TWSVM is to solve two QPPs (2) and (3), each of the QPPs in the TWSVM pair is a typical SVM formulation, except that not all data points appear in the constraints of either problem [14].

In order to derive the corresponding dual formulation, TWSVM assumes that the matrices \(G^{t} G\) and \(H^{t} H\), where \(G=[A\, \, \, e_{2} ]\) and \(H=[B\, \, \, e_{1} ],\) are non-singular. The dual QPPs are

$$\begin{aligned} \mathop {\min _{u_1 \in R^{m_2}}}&\quad \frac{1}{2} u_1 ^{t} H\left( G^{t} G\right) ^{-1} H^{t} u_1 - e_{1} ^{t} u_1 \nonumber \\ s.t.&\quad 0\le u_1 \le C_{1}, \end{aligned}$$
(4)
$$\begin{aligned} \mathop {\min _{u_2 \in R^{m_1}}}&\quad \frac{1}{2} u_2 ^{t} G\left( H^{t} H\right) ^{-1} G^{t} u_2 - e_{2} ^{t} u_2\nonumber \\ s.t.&\quad 0\le u_2 \le C_{2}. \end{aligned}$$
(5)

Remark 1

The matrices \(G^{t} G\) and \(H^{t} H\) appearing in the dual formulation (4) and (5) may be singular. To avoid the possible ill conditioning, the inverse matrices \((G^{t} G)^{-1}\) and \((H^{t} H)^{-1}\) are approximately replaced by \((G^{t} G+\delta I)^{-1}\) and \((H^{t} H+\delta I)^{-1}\), where \(\delta\) is a very small positive scalar and I is an identity matrix of appropriate dimensions.

Thus the nonparallel proximal hyperplanes are obtained from the solution \(u_1\) and \(u_2\) of (4) and (5) by

$$\begin{aligned} \left[ \begin{array}{l} {w_1}\\ {b_1} \end{array} \right] = - {\left( {{G^t}G + \delta I} \right) ^{ - 1}}{H^t}u_1 \quad \text{ and } \quad \left[ \begin{array}{l} {w_2}\\ {b_2} \end{array} \right] = {\left( {{H^t}H + \delta I} \right) ^{ - 1}}{G^t}u_2. \end{aligned}$$
(6)

The dual problems (4) and (5) are derived and solved in [14]. Experimental results of Jayadeva et. al. [14] show that the performance of TWSVM is better than the conventional SVM and GEPSVM on UCI machine learning datasets.

3 Proposed Newton method for implicit Lagrangian twin support vector machines (NLTSVM)

Motivated by the works of [11, 24, 37], we proposed in this paper an implicit Lagrangian formulation for the twin support vector machine (TWSVM) classifiers by formulating a pair of unconstrained minimization problems in dual variables whose solutions will be obtained using finite Newton method. The idea of our NLTSVM is to reformulate TWSVM as a strongly convex problem by incorporated regularization techniques to improve the training speed and robustness. The solution of two modified unconstrained minimization problems reduces to solving just two systems of linear equations as opposed to solving two quadratic programming problems in TWSVM and TBSVM, which leads to extremely simple and fast algorithm.

3.1 Linear case

Different from TWSVM, the primal problems of TWSVM (2) and (3) are modified as follows:

$$\begin{aligned}\begin{array}{ll} \mathop {\min }\limits _{(w_{1}, b_{1}) \in R^{n+1}} & \frac{1}{2} \, \left\| Aw_{1} +e_{2}b_{1} \right\| ^{2} +\frac{C_{1} }{2} \left\| \xi _{1} \right\| ^{2} +\frac{C _{3} }{2} \left\| \left[ \begin{array}{ll} {w_{1} } \\ {b_{1} } \end{array}\right] \right\| ^{2} \\ s.t.& -\left( Bw_{1}+e_{1} b_{1} \right) +\xi _{1} \ge e_{1} ,\end{array} \end{aligned}$$
(7)
$$\begin{aligned}\begin{array}{ll} \mathop {\min }\limits_{(w_{2},b_2)\in R^{n+1}} & \frac{1}{2} \, \left\| Bw_{2} +e_{1}b_{2} \right\| ^{2} +\frac{C_{2} }{2} \left\| \xi _{2} \right\| ^{2}+\frac{C _{4} }{2} \left\| \left[ \begin{array}{l} {w_{2} } \\{b_{2} } \end{array}\right] \right\| ^{2} \\s.t.& \left( Aw_{2} +e_{2} b_{2} \right) +\xi _{2} \ge e_{2} .\end{array} \end{aligned}$$
(8)

Note that there are three terms in the objective functions of (7) and (8). The first term in the objective function of (7) or (8) is the sum of squared distances from the hyperplane to points of one class. Therefore, minimizing it tends to keep the hyperplane close to points of one class (say positive class). The constraints require the hyperplane to be at a distance of atleast 1 from points of the other class (say negative class). The second term is the sum of error variables related with the constraints, requiring the distances from the two hyperplanes to points in the other class to be one or greater. The third term is a regularization term that is similar to [6, 23, 37, 40, 41] which make the objective functions strongly convex. Thus, it has a unique global optimal solution.

Our NLTSVM replaces the slack variables with 2-norm instead of 1-norm as in TWSVM and TBSVM, which make the constraints (\(\xi _1,\xi _2\ge 0\)) redundant. The solution of two modified unconstrained minimization problems reduces to solving just two systems of linear equations as opposed to solving two quadratic programming problems in TWSVM and TBSVM, which leads to extremely simple and fast algorithm. Similar to TBSVM, SRM principle is implemented in our NLTSVM by adding a regularization term with the idea of maximizing the margin and only empirical risk is minimized in the primal problems of TWSVM, STWSVM and LSTWSVM. Similar to standard SVM, this strategy leads our formulation to be more theoretically sound than the original TWSVM, STWSVM and LSTWSVM. The computational results, given in Sect. 4, clearly show that our NLTSVM does not compromise on generalization performance.

One can obtain the dual QPPs of (7) and (8) as follows:

$$\begin{aligned} \underset{{0\le u_1 \in R^{m_2}}}{\text {min}}\;L_1(u_1)= & {} \frac{1}{2}{u_1}^tQ_1u_1-{e_1}^tu_1 \end{aligned}$$
(9)
$$\begin{aligned} \underset{{0\le u_2 \in R^{m_1}}}{\text {min}}\;L_2(u_2)= & {} \frac{1}{2}{u_2}^tQ_2u_2-{e_2}^tu_2 \end{aligned}$$
(10)

where

$$\begin{aligned} Q_1\quad= & {} \frac{I}{C_1}+N_1;\quad Q_2=\frac{I}{C_2}+N_2; \quad N_1=H(G^tG+C_3I)^{-1}H^t;\\ N_2\quad= & {} G(H^tH+C_4I)^{-1}G^t;\quad G=[A \;e_2] \quad \text{ and }\quad H=[B\; e_1]. \end{aligned}$$

The nonparallel proximal hyper planes are obtained from the solution \(u_1\) and \(u_2\) of (9) and (10) by

$$\begin{aligned} \left[\begin{array}{l} w_1\\b_1 \end{array}\right]=-(G^tG+C_3I)^{-1}H^tu_1\quad \text{and}\quad \left[\begin{array}{l} w_2\\b_2 \end{array}\right]=(H^tH+C_4I)^{-1}G^tu_2. \end{aligned}$$
(11)

Remark 2

Unlike TWSVM, the matrices appearing in the dual objective functions (9) and (10) of our NLTSVM are positive definite and there is no upper bound on the dual variables \(u_1\) and \(u_2\).

Note that the regularization parameter \(\delta\) used in TWSVM formulation is just a fixed small scalar while penalty parameters \(C_3,C_4\) used in our NLTSVM are weighting factors which determine the trade-off between the regularization term and the empirical risk. Therefore, selecting appropriate parameters \(C_3,C_4\) reflects the SRM principle. We have seen in the experimental section that the classification accuracy improved on adjusting the values of \(C_3,C_4\).

3.2 Nonlinear case

In this subsection, we extend our NLTSVM to the nonlinear case using kernel trick. We consider the following kernel based surfaces instead of hyperplanes:

$$\begin{aligned} K(x^t,C^t)w_1+b_1=0 \quad \text{and}\quad K(x^t,C^t)w_2+b_2=0, \end{aligned}$$
(12)

where \(C^t=[A\;B]^t\) and K is appropriately chosen kernel.

Similar to linear case, the optimization problem for our NLTSVM in the kernel feature space can be reformulated as

$$\begin{aligned}\begin{array}{ll} \mathop {\min }\limits_{(w_{1},b_1) \in R^{m+1}} & \frac{1}{2} \, \left\|K(A,C^t)w_1+e_2b_1 \right\| ^{2} +\frac{C_{1} }{2} \left\| \xi _{1}\right\| ^{2} +\frac{C _{3} }{2} \left\| \left[ \begin{array}{ll}{w_{1} } \\ {b_{1} } \end{array}\right] \right\| ^{2} \\ s.t.&-\left( K(B,C^t)w_1+e_1b_1 \right) +\xi _{1} \ge e_{1} , \end{array}\end{aligned}$$
(13)
$$\begin{aligned}\begin{array}{ll} \mathop {\min }\limits_{(w_{2},b_2) \in R^{m+1}} & \frac{1}{2} \, \left\|K(B,C^t)w_2+e_1b_2 \right\| ^{2} +\frac{C_{2} }{2} \left\| \xi _{2}\right\| ^{2} +\frac{C _{4} }{2} \left\| \left[ \begin{array}{ll}{w_{2} } \\ {b_{2} } \end{array}\right] \right\| ^{2} \\ s.t.&\left( K(A,C^t)w_2+e_2b_2 \right) +\xi _{2} \ge e_{2} .\end{array}\end{aligned}$$
(14)

where \(K(A,C^t)\) and \(K(B,C^t)\) are kernel matrices of sizes \(m_1\times m\) and \(m_2\times m\) respectively, where \(m=m_1 +m_2\). Similar to linear case, the nonparallel proximal hyper planes are obtained as

$$\begin{aligned} \left[\begin{array}{l} w_1\\b_1 \end{array}\right]=-(S^tS+C_3I)^{-1}R^tu_1 \quad \text{and}\quad \left[\begin{array}{l} w_2\\b_2 \end{array}\right]=(R^tR+C_4I)^{-1}S^tu_2, \end{aligned}$$
(15)

where \(S=[K(A,C^t)\;e_2]\) and \(R=[K(B,C^t) \;e_1].\)

3.3 Method of solution

In this subsection, we will discuss a fast and effective Newton algorithm for solving UMPs (9) and (10) by solving a system of linear equations in a finite number of times.

Our NLTSVM algorithm is based directly on applying the Karush–Kuhn–Tucker (KKT) necessary and sufficient optimality conditions [21, 24] for the dual problems (9) and (10):

$$\begin{aligned} 0\le u_1\perp (Q_1u_1-e_1)\ge 0\quad \text{and}\quad 0\le u_2\perp (Q_2u_2-e_2)\ge 0. \end{aligned}$$
(16)

By using the well-known identity between two vectors (or real numbers) a and b:

$$\begin{aligned} 0\le a \perp b \ge 0 \quad \text{ if } \text{ and } \text{ only } \text{ if }\quad a=(a-\alpha b)_+ \quad \text{ for } \text{ any }\quad \alpha \ge 0. \end{aligned}$$

The solutions of the following equivalent pair of problems will be considered [24]: for any \(\alpha _1,\alpha _2 >0\),

$$\begin{aligned} (Q_1u_1-e_1)=(Q_1u_1-\alpha _1u_1-e_1)_+ \quad\text{ and}\quad (Q_2u_2-e_2)=(Q_2u_2-\alpha _2u_2-e_2)_+. \end{aligned}$$
(17)

The optimality condition (17) becomes necessary and sufficient condition to be satisfied by the unconstrained minimum of the following pair of implicit Lagrangian’s [25] associated to the pair of dual problems (9) and (10): for any \(\alpha _1,\alpha _2 >0\),

$$\begin{aligned}\mathop {\min _{u_1\in R^{m_2}}} L_1(u_1)&=\frac{1}{2}{u_1}^tQ_1u_1-{e_1}^tu_1\\ &\quad+ \frac{1}{2\alpha _1}\left( \Vert (Q_1u_1-\alpha _1u_1-e_1)_+\Vert ^2-\Vert Q_1u_1-e_1\Vert ^2\right) \end{aligned}$$
(18)
$$\begin{aligned}\mathop {\min _{u_2\in R^{m_1}}}L_2(u_2)&=\frac{1}{2}{u_2}^tQ_2u_2-{e_2}^tu_2\\ &\quad+ \frac{1}{2\alpha _2}\left( \Vert (Q_2u_2-\alpha _2u_2-e_2)_+\Vert ^2-\Vert Q_2u_2-e_2\Vert ^2\right) \end{aligned}$$
(19)

where \(\alpha _k, k=1,2\) is sufficiently large but finite positive parameter. The implicit Lagrangian formulations [11, 24, 36] consist of replacing the non-negativity constrained quadratic minimization problems (9) and (10) by the equivalent unconstrained piecewise quadratic minimization problems (18) and (19).

Our finite Newton method consists of applying Newton’s method to UMPs (18) and (19) and showing that it terminates in a finite number of steps at the global minimum. The gradient of \(L_k(u_k)\) is:

$$\begin{aligned} \nabla L_k(u_k)=\left( \frac{\alpha _k I-Q_k}{\alpha _k}\right) \left[ \left( Q_ku_k-e\right) - \left( Q_ku_k-\alpha _ku_k-e\right) _{+}\right] . \end{aligned}$$
(20)

Notice that for \(k=1,2,\) the gradient \(\nabla L_k(u_k)\) is not differentiable and therefore the Hessian matrix of second order partial derivatives of \(L_k(u_k)\) is not defined in the usual sense. The generalized Hessian of \(L_k(u_k)\) in the sense of Hiriart-Urruty et al. [13] exists and is defined as follows:

$$\begin{aligned} {\partial }^2 L_k(u_k)=\left( \frac{\alpha _k I-Q_k}{\alpha _k}\right) \left[ Q_k + \hbox {diag} \left( {Q_k} {u_k}-{\alpha _k} {u_k}- e\right) _{*}({\alpha _k} I - {Q_k})\right] , \end{aligned}$$
(21)

where \(\text{ diag }(.)_*\) denote a diagonal function and \((.)_*\) denotes the step function. For \(k=1,2,\) the basic Newton step of the iterative algorithm is in determining the unknowns \(u_k^{i+1}\) at the \((i+1)^{th}\) iteration using the current \(i^{th}\) iterate \(u_k^i\) using

$$\begin{aligned} \nabla L_k(u_k^i)+ \partial ^2 L_k(u_k^i)(u_k^{i+1}-u_k^i) =0 \quad \text{ where }\quad i=0,1,2,\ldots . \end{aligned}$$
(22)

Using the property that the matrices \(Q _k, k=1,2\) defined in (9) and (10) are positive definite and choosing the parameter \(\alpha _k\) satisfying the condition [11]: \(\alpha _k > \Vert Q_k\Vert\) where \(k=1,2,\) the Newton’s iterative step can be rewritten in the following simpler form: for \(i=0,1,2,\ldots\)

$$\begin{aligned} u_k^{i+1}=u_k^{i}- \partial h_{k}(u_k^{i})^{-1}h_k(u_k^i). \end{aligned}$$
(23)

For \(k=1,2, h_{k}(u_k) \quad\text{ and }\quad \partial h_{k}(u_k)\) are defined as:

$$\begin{aligned} h_k(u_k)&:=(Q_ku_k-e)-(Q_ku_k-\alpha _ku_k-e)_{+} \\ &=\left( \frac{\alpha _k I-Q_k}{\alpha _k}\right) ^{-1} \nabla L(u_k),\nonumber \\ \partial h_k(u_k)&:=Q_k + diag (Q_ku_k-\alpha _ku_k-e)_{*}(\alpha _{k}I-Q_{k})\\ &= \left( \frac{\alpha _k I-Q_k}{\alpha _k}\right) ^{-1} \partial ^2 L(u_k). \end{aligned}$$
(24)

We will be used simpler iteration (24) in our implementation instead of the equivalent iteration (20). By defining the following matrix: for \(k=1,2,\)

$$\begin{aligned} E_k=\text{ diag }\left( Q_k u_k - \alpha _k u_k - e\right) _*, Z_k={F_k}^{-1} (I-E_k)\quad \text{ and }\quad F_k=\alpha E_k + \frac{I-E_k}{C_k}. \end{aligned}$$

One can obtain

$$\begin{aligned} \partial h_k(u_k)^{-1}=(I + Z_kN_k)^{-1}{F_k}^{-1}. \end{aligned}$$
(25)

Now we state Newton algorithm for solving unconstrained minimization problems (18) and (19) for an arbitrary positive definite matrix \(Q_k\) using the simplified iteration (23) together with an Armijo stepsize [1, 20] in order to guarantee finite termination from any starting point.

figure a

The convergence and finite termination of the above Newton algorithm with Armijo stepsize will follow as a simple extension of [22, 24].

Note that the computational complexity of SVM and TWSVM are \(O(m^3)\) and \(O(2\times (m/2)^3)\) respectively, where m is the total size of training data. It means that TWSVM is approximately four times faster than SVM. The linear NLTSVM solves just two matrix inversions with order of \((n+1)\times (n+1)\) where \(n<<m\). For nonlinear NLTSVM, the inverses of the matrices with order of \((m+1)\times (m+1)\) is required. We have been utilized Sherman–Morrison–Woodbury (SMW) formula to reduce the computational cost and need inverses of smaller dimension \((m_1\times m_1)\) and \((m_2\times m_2)\) to solve (15).

4 Experimental results

In order to evaluate the efficiency of our NLTSVM, numerical experiments were performed on ‘Cross-Planes’ and ‘Ripley’ datasets as examples of synthetic datasets and several well-known, publicly available, benchmark datasets, and their results were compared with GEPSVM, TWSVM, STWSVM and LSTWSVM. All the experiments were performed in MATLAB R2010a environment on a PC running on Windows XP OS with 3.30 GHz Intel (R) Core (TM) i3-2120 processor having 4 GB of RAM. In our experiment, we employ optimization toolbox of MATLAB for GEPSVM and TWSVM. In all the examples considered, the Gaussian kernel function with parameter \(\mu >0\), defined by: for \(x_{1},x_{2} \in R^{m}\)

$$\begin{aligned} K(x_{1},x_{2})=\exp (-\mu \Vert x_{1}-x_{2}\Vert ^{2}), \end{aligned}$$

is taken. The classification accuracy of each algorithm was computed using the well-known ten-fold cross-validation methodology [9]. For brevity’s sake, we set \(C_{1}=C_{2}\) for TWSVM, STWSVM and LSTWSVM, \(C_{1}=C_{2}, C_{3}=C_{4}\) for our NLTSVM, and the kernel parameter value \(\mu\) were allowed to vary from the sets \(\{10^{-5}, 10^{-4},\ldots , 10^{5}\}\) and \(\{2^{-10}, 2^{-9},\ldots , 2^{10}\}\) respectively. For GEPSVM, the range of \(\delta\) was allowed to vary from the set \(\{2^{-7}, 2^{-6},\ldots , 2^{7}\}.\) Finally, choosing these optimal values, the classification accuracies and computational efficiencies on the test dataset was calculated.

The accuracy used to evaluate methods is defined as follows:

$$\begin{aligned} \text{ Accuracy } = (TP+TN)/(TP+TN+FP+FN), \end{aligned}$$

where TP, TN, FP and FN are the number of true positive, true negative, false positive and false negative respectively.

We conduct the experiments on five synthetic datasets and fifteen real world benchmark datasets to investigate the performance of our NLTSVM. The performance comparisons of five algorithms are summarized in Table 1, where “Accuracy” denotes the mean value of ten-testing results and “Time” denotes the mean value of the time taken by ten experiments. We draw some conclusions after implementing these experiments on twenty datasets. In terms of prediction accuracy, our proposed NLTSVM yields the highest accuracy among five algorithms on most of the datasets considered. The main reason of our proposed NLTSVM yields such good testing accuracy is that our NLTSVM solves two systems of linear equations instead of solving two QPPs. The next good algorithm is LSTWSVM which yields slightly lower testing accuracy than TWSVM, STWSVM and our NLTSVM, but higher than GEPSVM for most of the datasets. Among five algorithms, GEPSVM yields lowest testing accuracy for most of the datasets considered. Moreover on Heart-Statlog, Ionosphere, Bupa, Transfusion, Haberman, Sonar, Wpbc, Cleve, Monks2, Monks3 and Splice datasets, GEPSVM produces extremely low testing accuracy compared with four other algorithms. Table 1 shows the comparison of computational time and accuracy for all five algorithms with Gaussian kernel. It is evident that our NLTSVM have performed several orders of magnitude faster than GEPSVM and TWSVM but nearly similar to STWSVM and LSTWSVM. Our NLTSVM outperforms GEPSVM and TWSVM on most of the datasets considered which clearly indicates the overall superiority. Also it is worth mentioning that NLTSVM does not require any special optimizers.

Table 1 Performance comparisons of GEPSVM, TWSVM, STWSVM, LSTWSVM and NLTSVM on synthetic and real world datasets using Gaussian kernel

4.1 Synthetic datasets

In this subsection, we consider five examples to compare our NLTSVM with the other SVMs. First we take a simple two dimensional “Cross Planes” dataset as an example of synthetic dataset which was also tested in [26, 37, 3941]. It was generated by perturbing points lying on two intersecting lines and the intersection point is not in the center. The linear classifiers obtained by TWSVM and our NLTSVM along with the input data are shown in Fig. 1a, b respectively. In the figures, positive points are plotted as “\(+\)” and negative ones are plotted as “\(\circ\)”. The corresponding hyperplanes has also been plotted in Fig. 1a, b. It is easy to see that the result of the proposed NLTSVM is more reasonable than that of TWSVM. This indicate that our NLTSVM can handle the “Cross Planes” dataset much better than TWSVM. The classification accuracy and central processing unit (CPU) time of each algorithm are summarized in Table 1. The results clearly demonstrate the superiority of multi-plane/surface classifiers over GEPSVM, TWSVM, STWSVM and LSTWSVM. The second example is an artificial-generated Ripley’s synthetic dataset [35]. It is also two dimensional dataset which includes 250 training points and 1000 test points. Table 1 shows the learning results of nonlinear GEPSVM, TWSVM, STWSVM, LSTWSVM and NLTSVM. It can be seen that our NLTSVM obtains better classification accuracy with less training time than GEPSVM, TWSVM, STWSVM and LSTWSVM which indicate the suitability of our NLTSVM for these kind of problems since its non-parallel hyperplanes successfully describe the two class of points.

Fig. 1
figure 1

Classification results of a TWSVM b NLTSVM for Cross planes dataset

To further show the advantage of our NLTSVM in the training speed, we have compared the computational time of our NLTSVM with GEPSVM, TWSVM, STWSVM and LSTWSVM on three NDC [28] datasets. The parameters of all the algorithms have been fixed i.e. \(C_{1,2,3,4}=1\) and \(\mu =2^{-8}.\) We have selected 10 % data for testing and rest for training. The central processing unit (CPU) time of each algorithm are summarized in Table 2. One can see from the Table 2 that our NLTSVM is more reasonable than GEPSVM, TWSVM, STWSVM and LSTWSVM.

Table 2 Comparison on NDC datasets with Gaussian kernel

4.2 UCI datasets

To further evaluate the classification ability of our NLTSVM, we compare the behavior of our NLTSVM with GEPSVM, TWSVM, STWSVM and LSTWSVM on several publicly available benchmark datasets [27]. In all the real-world examples considered, each attribute of the original data is normalized as follows:

$$\begin{aligned} \bar{x_{ij}}=\frac{x_{ij}-x_{j}^{{\min }}}{x_{j}^{{\max }}-x_{j}^{{\min }}}, \end{aligned}$$

where \(x_{ij}\) is the (i, j)-th element of the input matrix A, \(\bar{x_{ij}}\) is its corresponding normalized value and \(x_{j}^{{\min }}={\min }_{i=1}^{m}(x_{ij})\) and \(x_{j}^{{\max }}={\max }_{i=1}^{m}(x_{ij})\) denote the minimum and maximum values, respectively, of the j-th column of A.

The sizes of training and test data, the number of attributes, training time and accuracies of each algorithm for nonlinear classifiers were summarized in Table 1 and the best accuracy is shown by bold figures. Clearly One can observe from Table 1 that, in comparison to GEPSVM, TWSVM, STWSVM and LSTWSVM, our NLTSVM shows better generalization performance. In details, for Heart-Statlog dataset, the experimental result (accuracy 85.71 %, time 0.0109 s) by our NLTSVM is higher than other four algorithms in computational time and classification accuracy, i.e., GEPSVM (accuracy 75.71 %, 0.8125 s), TWSVM (accuracy 81.43 %, 0.4404 s), STWSVM (accuracy 80.00 %, 0.0145 s) and LSTWSVM (accuracy 80.00 %, 0.0797 s). We obtained the similar conclusions for Bupa, WPBC, Cleve, Monks-2, Monks-3, Australian and Haberman datasets. For Votes dataset, the classification accuracy obtained by our NLTSVM (accuracy 95.35 %) is slightly lower than TWSVM (accuracy 96.90 %) and LSTWSVM (accuracy 96.12 %), it is higher than GEPSVM (accuracy 93.02 %). The empirical results further reveal that our NLTSVM, whose solutions are obtained by solving system of linear equations, is faster than TWSVM on all the datasets. It is worthwhile notice that choosing the values of the parameters \(C_3\) and \(C_4\) affect the results significantly and these values are varying in our NLTSVM rather than small fixed positive scalar in TWSVM. The details of optimal parameters for Gaussian kernel are listed in Table 3. It clearly indicates that adding the regularization terms in our formulation are useful.

Table 3 Optimal parameters of GEPSVM, TWSVM, STWSVM, LSTWSVM and NLTSVM for Gaussian kernel

To further compare our NLTSVM with TWSVM, we also compare it with the two-dimensional scatter plots that were obtained from the part test points for the WDBC and Heart Statlog datasets. The plots were obtained by plotting points with coordinates: perpendicular distance of a test point x from positive hyperplane 1 and the distance from negative hyperplane 2. In the figures, positive points are plotted as “\(+\)” and negative points are plotted as “\(\circ\)”. Hence, the clusters of points indicate how well the classification criterion is able to discriminate between the two classes. From Figs. 2a, b and 3a, b, it can be seen that our NLTSVM obtained large distances from the test samples to the opposite hyperplanes. In contrast, the TWSVM obtained small distances from the test points to the hyperplane pair. It means that our NLTSVM is much more robust when compared with the TWSVM.

Fig. 2
figure 2

2-D projections of a TWSVM, b NLTSVM from WDBC dataset. Plus scatter plot of the positive points. Circle scatter plot of the negative points

Fig. 3
figure 3

2-D projections of a TWSVM, b NLTSVM from Heart-Statlog dataset. Plus scatter plot of the positive points. Circle scatter plot of the negative points

For further fair comparisons, we use statistical test to demonstrate a correct analysis when comparing the performance of multiple algorithms over multiple datasets, which has been largely discussed. Since the Friedman test with the corresponding post hoc tests is pointed out to be a simple, safe, and robust non parametric test for comparison of more classifiers over multiple datasets [7], we use it to compare the generalization ability of five algorithms. The average ranks of all the algorithms on accuracies were computed and listed in Table 4. We employ the Friedman test to check whether the measured average ranks are significantly different from the mean rank \(R_j =3\) expected under the null hypothesis:

$$\begin{aligned} \chi _{F}^2= \frac{12N}{k\left( k+1\right) }\left[ \sum \limits _{j=1}^3 R_j^2 - \frac{k\left( k+1\right) ^2 }{4}\right] \end{aligned}$$

is distributed according to \(\chi _{F}^2\) with \(k-1\) degree of freedom. where k is the number of methods and N is the number of datasets.

$$\begin{aligned} \chi _{F}^2= & {} \frac{12\times 17}{5\left( 5+1\right) }\\ &\times\left[ 4.56^2 + 2.88^2 + 3.03^2 + 2.79^2+1.74^2 -\frac{5\left( 6\right) ^2}{4}\right] =27.7481.\\ F_F= & {} \frac{\left( N-1\right) \chi _{F}^2}{N\left( k-1\right) -\chi _{F}^2}= \frac{\left( 17-1\right) \times 27.7481}{17\left( 5-1\right) -27.7481}=11.0298. \end{aligned}$$
Table 4 Average ranks of GEPSVM, TWSVM, STWSVM, LSTWSVM and NLTSVM with Gaussian kernel

With five algorithms and seventeen datasets, \(F_F\) is distributed according to the F-distribution with \(\left( k-1\right)\) and \(\left( k-1\right) \left( N-1\right) =\left( 4,64\right)\) degrees of freedom. The critical value of \(F\left( 4,64\right)\) for \(\alpha =0.05\) is 2.52. So, we reject the null hypothesis \(\left( F_F >F(4,64) \right) .\) We use the Nemenyi test for further pairwise comparison. According to [7], at \(p=0.10,\) critical difference (CD) \(=q_\alpha \sqrt{\frac{k\left( k+1\right) }{6N}}=2.459\sqrt{\frac{5\times 6}{6\times 17}}=1.336.\) Since the difference between GEPSVM and our NLTSVM is larger than the critical difference \(1.336 (4.56-1.74 =2.82>1.336),\) we can identify that the performance of NLTSVM is significantly better than GEPSVM. Similarly, we can identify that the performances of TWSVM, STWSVM and LSTWSVM are significantly better than GEPSVM. Further, we see that the difference between TWSVM, STWSVM, LSTWSVM and our NLTSVM is just below the critical difference, we can conclude that the post hoc test is not powerful enough to detect any significant difference between the algorithms. But the average rank of our NLTSVM is always lower than that of GEPSVM, TWSVM, STWSVM and LSTWSVM, once can conclude that our NLTSVM is better than GEPSVM, TWSVM, STWSVM and LSTWSVM.

Regarding the computational time of five algorithms shown in Table 1 reveals that the proposed NLTSVM is nearly seven times faster than GEPSVM and TWSVM. However, the proposed NLTSVM and GEPSVM cost nearly the same time. One of the reasons is that both are solving the systems of linear equations.

5 Conclusions and future work

In this paper, we proposed an implicit Lagrangian twin support vector machine (TWSVM) classifiers by formulating a pair of unconstrained minimization problems in dual variables whose solutions will be obtained using finite Newton method. The idea of our formulation is to reformulate TWSVM as a strongly convex problem by incorporated regularization techniques to improve the training speed and robustness. The solution of two modified unconstrained minimization problems reduces to solving just two systems of linear equations as opposed to solving two quadratic programming problems in TWSVM and TBSVM, which leads to extremely simple and fast algorithm. To demonstrate the effectiveness of the proposed method, we performed numerical experiments on number of interesting real-world datasets and compared their results with other SVMs. Comparison of results with GEPSVM, TWSVM, STWSVM and LSTWSVM clearly demonstrate the effectiveness and suitability of the proposed method. Similar to TBSVM, there are four parameters in our NLTSVM, so the parameter selection is a practical problem and should be address in the future. Our future work will be on the extension of the proposed method to learning using privileged information.