Newton method for implicit Lagrangian twin support vector machines

Tanveer, M.

doi:10.1007/s13042-015-0414-x

Newton method for implicit Lagrangian twin support vector machines

Original Article
Published: 28 August 2015

Volume 6, pages 1029–1040, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Newton method for implicit Lagrangian twin support vector machines

Download PDF

M. Tanveer¹

397 Accesses
25 Citations
Explore all metrics

Abstract

In this paper, we proposed an implicit Lagrangian twin support vector machine (TWSVM) classifiers by formulating a pair of unconstrained minimization problems (UMPs) in dual variables whose solutions will be obtained using finite Newton method. The advantage of considering the generalized Hessian approach for our modified UMPs reduces to solving just two systems of linear equations as opposed to solving two quadratic programming problems in TWSVM and TBSVM, which leads to extremely simple and fast algorithm. Unlike the classical TWSVM and least square TWSVM (LSTWSVM), the structural risk minimization principle is implemented by adding regularization term in the primal problems of our proposed algorithm. This embodies the essence of statistical learning theory. Computational comparisons of our proposed method against GEPSVM, TWSVM, STWSVM and LSTWSVM have been made on both synthetic and well-known real world benchmark datasets. Experimental results show that our method yields significantly better generalization performance in both computational time and classification accuracy.

A new approach for training Lagrangian twin support vector machine via unconstrained convex minimization

Article 29 July 2016

Lagrangian Twin-Bounded Support Vector Machine Based on L2-Norm

A regularization on Lagrangian twin support vector regression

Article 01 May 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Support vector machine (SVM), introduced by Vapnik and coworkers [6, 44], is an excellent kernel based tool for binary data classification problems. Over the past decades, SVM has played an important role in solving problems emerged in pattern recognition and machine learning community due to its novel state of art technique. Its applications include a wide spectrum of research areas, ranging from pattern recognition [30], text categorization [16], biomedicine [5] etc. They have shown excellent performance on wide variety of problems [15, 30] due to its method of constructing a hyper plane such that the band between the two hyper planes separate both the classes and distance between two hyper planes is maximized, leading to the introduction of regularization term. Unlike other machine learning methods such as artificial neural networks (ANNs), training of SVMs leads to solving a linearly constrained quadratic programming problem (QPP) having unique optimal solution. Combined with the advantage of having unique optimal solution and better generalization performance, SVM becomes one of the most popular methods for solving classification problems. One of the main challenges for SVM is the large computational cost that is associated with the quadratic programming problems (QPPs). To reduce the learning complexity of SVM, various algorithms with comparable classification abilities have been reported, see [2, 6, 15, 19, 20, 24, 33, 43]. Specifically, Fung and Mangasarian [11] proposed an implicit Lagrangian formulation for SVM and solved using Newton method. The presented method is simple, fast and can be applied to problems with large scale data.

To improve the computational speed, Jayadeva et al. [14] proposed a twin support vector machine (TWSVM) [14] for the binary classification data in the spirit of the proximal SVM [10] and generalized eigenvalue proximal support vector machine (GEPSVM) [26]. TWSVM generates two nonparallel hyper planes by solving a pair of smaller-sized QPPs such that each plane is closest to one of the classes and as far as possible from the other class. A fundamental difference between TWSVM and SVM is that TWSVM solves two small QPPs rather than solving one large QPP makes the learning speed of TWSVM approximately four times faster than that of the standard SVM. Now, TWSVM has become popular and widely used because of its low computational complexity. Recently, some scholars proposed variants of TWSVM to reduce the time complexity and keep the effectiveness of TWSVM, see [3, 4, 12, 17, 29, 31, 32, 34, 37–39, 41, 42, 45, 46]. Specifically, least square twin SVM (LSTWSVM) [18] has been proposed by using the squared loss function instead of the hinge one in TWSVM, leading to very fast training speed since two QPPs are replaced by two systems of linear equations.

One of the principle advantages of SVM is the implementation of structural risk minimization (SRM) principle [36]. However, in the primal formulations of TWSVM and LSTWSVM [18], only the empirical risk is minimized. Also, we notice that the inverse of $G^t G$ appears in the dual formulation of TWSVM. Using the extra regularization term $G^t G$ is nonsingular. This is not perfect way from the theoretical point of view although it has been handled by modifying the dual problems technically and elegantly [37]. Recently, Shao et al. [37] proposed twin bounded support vector machines (TBSVM) based on TWSVM. Unlike TWSVM, the SRM principle is implemented in TBSVM and SOR technique is applied to speed-up the computational time.

Motivated by the works of [11, 24, 37], we proposed in this paper an implicit Lagrangian formulation for the twin support vector machine (TWSVM) classifiers by formulating a pair of unconstrained minimization problems in dual variables whose solutions will be obtained using finite Newton method. There are some differences in our formulation and TWSVM. The idea of our formulation is to reformulate TWSVM as a strongly convex problem by incorporated regularization techniques to improve the training speed and robustness. The solution of two modified unconstrained minimization problems reduces to solving just two systems of linear equations as opposed to solving two quadratic programming problems in TWSVM and TBSVM, which leads to extremely simple and fast algorithm. It is also worth mentioning that the proposed formulation does not need any specialized optimization packages. The results of experiments conducted on both artificial and publicly available benchmark datasets confirm its efficacy and feasibility.

In this work, all vectors are taken as column vectors. The inner product of two vectors x, y in the n-dimensional real space $R^n,$ will be denoted by: $x^t y,$ where $x^t$ is the transpose of x. Whenever x is orthogonal to y, we write $x\perp y$. For $x=(x_1,x_2,\ldots ,x_n)^t \in R^n$, the plus function $x_+$ is defined as: $(x_+)_i = \max \{0,x_i\},$ where $i=1,2,\ldots ,n$. The 2-norm of a vector x and a matrix Q will be denoted by $\Vert x \Vert$ and $\Vert Q \Vert$ respectively. For simplicity, we drop the 2 from $\Vert x \Vert _2$. We denote the vector of ones of dimension m by e and the identity matrix of appropriate size by I. If f is a real valued function of the variable $x=(x_1,x_2,\ldots ,x_n)^t \in R^n$ then the gradient of f is denoted by $\nabla f = (\partial f/ \partial x_1,\ldots ,\partial f/ \partial x_n)^t$ and the Hessian matrix of f is denoted by $\nabla ^2 f=(\partial ^2 f/ \partial x_i \partial x_j)_{i,j=1,2,\ldots ,n}.$

The remainder of this paper is organized as follows. Section 2 reviews the TWSVM formulation. Our proposed formulation is presented and discuss the related theoretical analysis in Sect. 3. Numerical experiments have been performed on a number of interesting synthetic and real-world benchmark datasets and their results are compared with other SVMs in Sect. 4. Finally, concluding remarks and future work are given in Sect. 5.

2 Twin support vector machine for classification

In this section, we give a brief description of TWSVM formulation. For details, the interested reader is referred to [8, 14].

In 2007, Jayadeva et al. [14] proposed a new non-parallel support vector machine for binary classification, termed as TWSVM, which is in the spirit of GEPSVM. However, TWSVM has the formulation similar to typical support vector machine (SVM) formulation [6, 44] except that not all the patterns appear in the constraints of either problem at the same time. This makes TWSVM [14] faster than standard SVM.

Suppose that all the data points in class +1 are denoted by a matrix $A\in R^{m_{1} \times n},$ where the ith row $A_{i} \in R^{n}$ and the matrix $B\in R^{m_{2} \times n}$ represent the data points of class $-1$. Unlike SVM, the linear TWSVM [14] seeks a pair of non-parallel hyperplanes

$$\begin{aligned} f_{1} (x)=w_{1} ^{t} x+b_{1} \quad \text{ and } \quad f_{2} (x)=w_{2} ^{t} x+b_{2}. \end{aligned}$$

(1)

The idea in the linear TWSVM is to solve the following two QPPs with objective functions corresponding to one of the two classes and constraints corresponding to the other class:

$$\begin{aligned} \mathop {\min _{\left( w_1,b_1\right) \in R^{n+1}}}&\quad \frac{1}{2} \left\| Aw_{1} +e_{2} b_{1} \right\| ^{2} +C_{1} \left\| \xi _{1} \right\| \nonumber \\ s.t.&\quad -\left( Bw_{1} +e_{1} b_{1} \right) +\xi _{1} \ge e_{1} ,\quad \xi _{1} \ge 0,\end{aligned}$$

(2)

$$\begin{aligned} \mathop {\min _{\left( w_2,b_2\right) \in R^{n+1}}}&\quad \frac{1}{2} \left\| Bw_{2} +e_{1} b_{2} \right\| ^{2} +C_{2} \left\| \xi _{2} \right\| \nonumber \\ s.t.&\quad \left( Aw_{2} +e_{2} b_{2} \right) +\xi _{2} \ge e_{2} ,\quad \xi _{2} \ge 0, \end{aligned}$$

(3)

where $C_{1} ,C_{2} >0$ are parameters and $e_{1} ,e_{2}$ are vectors of one of appropriate dimensions. It is evident that the idea in TWSVM is to solve two QPPs (2) and (3), each of the QPPs in the TWSVM pair is a typical SVM formulation, except that not all data points appear in the constraints of either problem [14].

In order to derive the corresponding dual formulation, TWSVM assumes that the matrices $G^{t} G$ and $H^{t} H$, where $G=[A\, \, \, e_{2} ]$ and $H=[B\, \, \, e_{1} ],$ are non-singular. The dual QPPs are

$$\begin{aligned} \mathop {\min _{u_1 \in R^{m_2}}}&\quad \frac{1}{2} u_1 ^{t} H\left( G^{t} G\right) ^{-1} H^{t} u_1 - e_{1} ^{t} u_1 \nonumber \\ s.t.&\quad 0\le u_1 \le C_{1}, \end{aligned}$$

(4)

$$\begin{aligned} \mathop {\min _{u_2 \in R^{m_1}}}&\quad \frac{1}{2} u_2 ^{t} G\left( H^{t} H\right) ^{-1} G^{t} u_2 - e_{2} ^{t} u_2\nonumber \\ s.t.&\quad 0\le u_2 \le C_{2}. \end{aligned}$$

(5)

Remark 1

The matrices $G^{t} G$ and $H^{t} H$ appearing in the dual formulation (4) and (5) may be singular. To avoid the possible ill conditioning, the inverse matrices $(G^{t} G)^{-1}$ and $(H^{t} H)^{-1}$ are approximately replaced by $(G^{t} G+\delta I)^{-1}$ and $(H^{t} H+\delta I)^{-1}$, where $\delta$ is a very small positive scalar and I is an identity matrix of appropriate dimensions.

Thus the nonparallel proximal hyperplanes are obtained from the solution $u_1$ and $u_2$ of (4) and (5) by

$$\begin{aligned} \left[ \begin{array}{l} {w_1}\\ {b_1} \end{array} \right] = - {\left( {{G^t}G + \delta I} \right) ^{ - 1}}{H^t}u_1 \quad \text{ and } \quad \left[ \begin{array}{l} {w_2}\\ {b_2} \end{array} \right] = {\left( {{H^t}H + \delta I} \right) ^{ - 1}}{G^t}u_2. \end{aligned}$$

(6)

The dual problems (4) and (5) are derived and solved in [14]. Experimental results of Jayadeva et. al. [14] show that the performance of TWSVM is better than the conventional SVM and GEPSVM on UCI machine learning datasets.

3 Proposed Newton method for implicit Lagrangian twin support vector machines (NLTSVM)

Motivated by the works of [11, 24, 37], we proposed in this paper an implicit Lagrangian formulation for the twin support vector machine (TWSVM) classifiers by formulating a pair of unconstrained minimization problems in dual variables whose solutions will be obtained using finite Newton method. The idea of our NLTSVM is to reformulate TWSVM as a strongly convex problem by incorporated regularization techniques to improve the training speed and robustness. The solution of two modified unconstrained minimization problems reduces to solving just two systems of linear equations as opposed to solving two quadratic programming problems in TWSVM and TBSVM, which leads to extremely simple and fast algorithm.

3.1 Linear case

Different from TWSVM, the primal problems of TWSVM (2) and (3) are modified as follows:

$$\begin{aligned}\begin{array}{ll} \mathop {\min }\limits _{(w_{1}, b_{1}) \in R^{n+1}} & \frac{1}{2} \, \left\| Aw_{1} +e_{2}b_{1} \right\| ^{2} +\frac{C_{1} }{2} \left\| \xi _{1} \right\| ^{2} +\frac{C _{3} }{2} \left\| \left[ \begin{array}{ll} {w_{1} } \\ {b_{1} } \end{array}\right] \right\| ^{2} \\ s.t.& -\left( Bw_{1}+e_{1} b_{1} \right) +\xi _{1} \ge e_{1} ,\end{array} \end{aligned}$$

(7)

$$\begin{aligned}\begin{array}{ll} \mathop {\min }\limits_{(w_{2},b_2)\in R^{n+1}} & \frac{1}{2} \, \left\| Bw_{2} +e_{1}b_{2} \right\| ^{2} +\frac{C_{2} }{2} \left\| \xi _{2} \right\| ^{2}+\frac{C _{4} }{2} \left\| \left[ \begin{array}{l} {w_{2} } \\{b_{2} } \end{array}\right] \right\| ^{2} \\s.t.& \left( Aw_{2} +e_{2} b_{2} \right) +\xi _{2} \ge e_{2} .\end{array} \end{aligned}$$

(8)

Note that there are three terms in the objective functions of (7) and (8). The first term in the objective function of (7) or (8) is the sum of squared distances from the hyperplane to points of one class. Therefore, minimizing it tends to keep the hyperplane close to points of one class (say positive class). The constraints require the hyperplane to be at a distance of atleast 1 from points of the other class (say negative class). The second term is the sum of error variables related with the constraints, requiring the distances from the two hyperplanes to points in the other class to be one or greater. The third term is a regularization term that is similar to [6, 23, 37, 40, 41] which make the objective functions strongly convex. Thus, it has a unique global optimal solution.

Our NLTSVM replaces the slack variables with 2-norm instead of 1-norm as in TWSVM and TBSVM, which make the constraints ($\xi _1,\xi _2\ge 0$) redundant. The solution of two modified unconstrained minimization problems reduces to solving just two systems of linear equations as opposed to solving two quadratic programming problems in TWSVM and TBSVM, which leads to extremely simple and fast algorithm. Similar to TBSVM, SRM principle is implemented in our NLTSVM by adding a regularization term with the idea of maximizing the margin and only empirical risk is minimized in the primal problems of TWSVM, STWSVM and LSTWSVM. Similar to standard SVM, this strategy leads our formulation to be more theoretically sound than the original TWSVM, STWSVM and LSTWSVM. The computational results, given in Sect. 4, clearly show that our NLTSVM does not compromise on generalization performance.

One can obtain the dual QPPs of (7) and (8) as follows:

$$\begin{aligned} \underset{{0\le u_1 \in R^{m_2}}}{\text {min}}\;L_1(u_1)= & {} \frac{1}{2}{u_1}^tQ_1u_1-{e_1}^tu_1 \end{aligned}$$

(9)

$$\begin{aligned} \underset{{0\le u_2 \in R^{m_1}}}{\text {min}}\;L_2(u_2)= & {} \frac{1}{2}{u_2}^tQ_2u_2-{e_2}^tu_2 \end{aligned}$$

(10)

where

$$\begin{aligned} Q_1\quad= & {} \frac{I}{C_1}+N_1;\quad Q_2=\frac{I}{C_2}+N_2; \quad N_1=H(G^tG+C_3I)^{-1}H^t;\\ N_2\quad= & {} G(H^tH+C_4I)^{-1}G^t;\quad G=[A \;e_2] \quad \text{ and }\quad H=[B\; e_1]. \end{aligned}$$

The nonparallel proximal hyper planes are obtained from the solution $u_1$ and $u_2$ of (9) and (10) by

$$\begin{aligned} \left[\begin{array}{l} w_1\\b_1 \end{array}\right]=-(G^tG+C_3I)^{-1}H^tu_1\quad \text{and}\quad \left[\begin{array}{l} w_2\\b_2 \end{array}\right]=(H^tH+C_4I)^{-1}G^tu_2. \end{aligned}$$

(11)

Remark 2

Unlike TWSVM, the matrices appearing in the dual objective functions (9) and (10) of our NLTSVM are positive definite and there is no upper bound on the dual variables $u_1$ and $u_2$.

Note that the regularization parameter $\delta$ used in TWSVM formulation is just a fixed small scalar while penalty parameters $C_3,C_4$ used in our NLTSVM are weighting factors which determine the trade-off between the regularization term and the empirical risk. Therefore, selecting appropriate parameters $C_3,C_4$ reflects the SRM principle. We have seen in the experimental section that the classification accuracy improved on adjusting the values of $C_3,C_4$.

3.2 Nonlinear case

In this subsection, we extend our NLTSVM to the nonlinear case using kernel trick. We consider the following kernel based surfaces instead of hyperplanes:

$$\begin{aligned} K(x^t,C^t)w_1+b_1=0 \quad \text{and}\quad K(x^t,C^t)w_2+b_2=0, \end{aligned}$$

(12)

where $C^t=[A\;B]^t$ and K is appropriately chosen kernel.

Similar to linear case, the optimization problem for our NLTSVM in the kernel feature space can be reformulated as

$$\begin{aligned}\begin{array}{ll} \mathop {\min }\limits_{(w_{1},b_1) \in R^{m+1}} & \frac{1}{2} \, \left\|K(A,C^t)w_1+e_2b_1 \right\| ^{2} +\frac{C_{1} }{2} \left\| \xi _{1}\right\| ^{2} +\frac{C _{3} }{2} \left\| \left[ \begin{array}{ll}{w_{1} } \\ {b_{1} } \end{array}\right] \right\| ^{2} \\ s.t.&-\left( K(B,C^t)w_1+e_1b_1 \right) +\xi _{1} \ge e_{1} , \end{array}\end{aligned}$$

(13)

$$\begin{aligned}\begin{array}{ll} \mathop {\min }\limits_{(w_{2},b_2) \in R^{m+1}} & \frac{1}{2} \, \left\|K(B,C^t)w_2+e_1b_2 \right\| ^{2} +\frac{C_{2} }{2} \left\| \xi _{2}\right\| ^{2} +\frac{C _{4} }{2} \left\| \left[ \begin{array}{ll}{w_{2} } \\ {b_{2} } \end{array}\right] \right\| ^{2} \\ s.t.&\left( K(A,C^t)w_2+e_2b_2 \right) +\xi _{2} \ge e_{2} .\end{array}\end{aligned}$$

(14)

where $K(A,C^t)$ and $K(B,C^t)$ are kernel matrices of sizes $m_1\times m$ and $m_2\times m$ respectively, where $m=m_1 +m_2$. Similar to linear case, the nonparallel proximal hyper planes are obtained as

$$\begin{aligned} \left[\begin{array}{l} w_1\\b_1 \end{array}\right]=-(S^tS+C_3I)^{-1}R^tu_1 \quad \text{and}\quad \left[\begin{array}{l} w_2\\b_2 \end{array}\right]=(R^tR+C_4I)^{-1}S^tu_2, \end{aligned}$$

(15)

where $S=[K(A,C^t)\;e_2]$ and $R=[K(B,C^t) \;e_1].$

3.3 Method of solution

In this subsection, we will discuss a fast and effective Newton algorithm for solving UMPs (9) and (10) by solving a system of linear equations in a finite number of times.

Our NLTSVM algorithm is based directly on applying the Karush–Kuhn–Tucker (KKT) necessary and sufficient optimality conditions [21, 24] for the dual problems (9) and (10):

$$\begin{aligned} 0\le u_1\perp (Q_1u_1-e_1)\ge 0\quad \text{and}\quad 0\le u_2\perp (Q_2u_2-e_2)\ge 0. \end{aligned}$$

(16)

By using the well-known identity between two vectors (or real numbers) a and b:

$$\begin{aligned} 0\le a \perp b \ge 0 \quad \text{ if } \text{ and } \text{ only } \text{ if }\quad a=(a-\alpha b)_+ \quad \text{ for } \text{ any }\quad \alpha \ge 0. \end{aligned}$$

The solutions of the following equivalent pair of problems will be considered [24]: for any $\alpha _1,\alpha _2 >0$,

$$\begin{aligned} (Q_1u_1-e_1)=(Q_1u_1-\alpha _1u_1-e_1)_+ \quad\text{ and}\quad (Q_2u_2-e_2)=(Q_2u_2-\alpha _2u_2-e_2)_+. \end{aligned}$$

(17)

The optimality condition (17) becomes necessary and sufficient condition to be satisfied by the unconstrained minimum of the following pair of implicit Lagrangian’s [25] associated to the pair of dual problems (9) and (10): for any $\alpha _1,\alpha _2 >0$,

$$\begin{aligned}\mathop {\min _{u_1\in R^{m_2}}} L_1(u_1)&=\frac{1}{2}{u_1}^tQ_1u_1-{e_1}^tu_1\\ &\quad+ \frac{1}{2\alpha _1}\left( \Vert (Q_1u_1-\alpha _1u_1-e_1)_+\Vert ^2-\Vert Q_1u_1-e_1\Vert ^2\right) \end{aligned}$$

(18)

$$\begin{aligned}\mathop {\min _{u_2\in R^{m_1}}}L_2(u_2)&=\frac{1}{2}{u_2}^tQ_2u_2-{e_2}^tu_2\\ &\quad+ \frac{1}{2\alpha _2}\left( \Vert (Q_2u_2-\alpha _2u_2-e_2)_+\Vert ^2-\Vert Q_2u_2-e_2\Vert ^2\right) \end{aligned}$$

(19)

where $\alpha _k, k=1,2$ is sufficiently large but finite positive parameter. The implicit Lagrangian formulations [11, 24, 36] consist of replacing the non-negativity constrained quadratic minimization problems (9) and (10) by the equivalent unconstrained piecewise quadratic minimization problems (18) and (19).

Our finite Newton method consists of applying Newton’s method to UMPs (18) and (19) and showing that it terminates in a finite number of steps at the global minimum. The gradient of $L_k(u_k)$ is:

$$\begin{aligned} \nabla L_k(u_k)=\left( \frac{\alpha _k I-Q_k}{\alpha _k}\right) \left[ \left( Q_ku_k-e\right) - \left( Q_ku_k-\alpha _ku_k-e\right) _{+}\right] . \end{aligned}$$

(20)

Notice that for $k=1,2,$ the gradient $\nabla L_k(u_k)$ is not differentiable and therefore the Hessian matrix of second order partial derivatives of $L_k(u_k)$ is not defined in the usual sense. The generalized Hessian of $L_k(u_k)$ in the sense of Hiriart-Urruty et al. [13] exists and is defined as follows:

$$\begin{aligned} {\partial }^2 L_k(u_k)=\left( \frac{\alpha _k I-Q_k}{\alpha _k}\right) \left[ Q_k + \hbox {diag} \left( {Q_k} {u_k}-{\alpha _k} {u_k}- e\right) _{*}({\alpha _k} I - {Q_k})\right] , \end{aligned}$$

(21)

where $\text{ diag }(.)_*$ denote a diagonal function and $(.)_*$ denotes the step function. For $k=1,2,$ the basic Newton step of the iterative algorithm is in determining the unknowns $u_k^{i+1}$ at the $(i+1)^{th}$ iteration using the current $i^{th}$ iterate $u_k^i$ using

$$\begin{aligned} \nabla L_k(u_k^i)+ \partial ^2 L_k(u_k^i)(u_k^{i+1}-u_k^i) =0 \quad \text{ where }\quad i=0,1,2,\ldots . \end{aligned}$$

(22)

Using the property that the matrices $Q _k, k=1,2$ defined in (9) and (10) are positive definite and choosing the parameter $\alpha _k$ satisfying the condition [11]: $\alpha _k > \Vert Q_k\Vert$ where $k=1,2,$ the Newton’s iterative step can be rewritten in the following simpler form: for $i=0,1,2,\ldots$

$$\begin{aligned} u_k^{i+1}=u_k^{i}- \partial h_{k}(u_k^{i})^{-1}h_k(u_k^i). \end{aligned}$$

(23)

For $k=1,2, h_{k}(u_k) \quad\text{ and }\quad \partial h_{k}(u_k)$ are defined as:

$$\begin{aligned} h_k(u_k)&:=(Q_ku_k-e)-(Q_ku_k-\alpha _ku_k-e)_{+} \\ &=\left( \frac{\alpha _k I-Q_k}{\alpha _k}\right) ^{-1} \nabla L(u_k),\nonumber \\ \partial h_k(u_k)&:=Q_k + diag (Q_ku_k-\alpha _ku_k-e)_{*}(\alpha _{k}I-Q_{k})\\ &= \left( \frac{\alpha _k I-Q_k}{\alpha _k}\right) ^{-1} \partial ^2 L(u_k). \end{aligned}$$

(24)

We will be used simpler iteration (24) in our implementation instead of the equivalent iteration (20). By defining the following matrix: for $k=1,2,$

$$\begin{aligned} E_k=\text{ diag }\left( Q_k u_k - \alpha _k u_k - e\right) _*, Z_k={F_k}^{-1} (I-E_k)\quad \text{ and }\quad F_k=\alpha E_k + \frac{I-E_k}{C_k}. \end{aligned}$$

One can obtain

$$\begin{aligned} \partial h_k(u_k)^{-1}=(I + Z_kN_k)^{-1}{F_k}^{-1}. \end{aligned}$$

(25)

Now we state Newton algorithm for solving unconstrained minimization problems (18) and (19) for an arbitrary positive definite matrix $Q_k$ using the simplified iteration (23) together with an Armijo stepsize [1, 20] in order to guarantee finite termination from any starting point.

The convergence and finite termination of the above Newton algorithm with Armijo stepsize will follow as a simple extension of [22, 24].

Note that the computational complexity of SVM and TWSVM are $O(m^3)$ and $O(2\times (m/2)^3)$ respectively, where m is the total size of training data. It means that TWSVM is approximately four times faster than SVM. The linear NLTSVM solves just two matrix inversions with order of $(n+1)\times (n+1)$ where $n<<m$. For nonlinear NLTSVM, the inverses of the matrices with order of $(m+1)\times (m+1)$ is required. We have been utilized Sherman–Morrison–Woodbury (SMW) formula to reduce the computational cost and need inverses of smaller dimension $(m_1\times m_1)$ and $(m_2\times m_2)$ to solve (15).

4 Experimental results

In order to evaluate the efficiency of our NLTSVM, numerical experiments were performed on ‘Cross-Planes’ and ‘Ripley’ datasets as examples of synthetic datasets and several well-known, publicly available, benchmark datasets, and their results were compared with GEPSVM, TWSVM, STWSVM and LSTWSVM. All the experiments were performed in MATLAB R2010a environment on a PC running on Windows XP OS with 3.30 GHz Intel (R) Core (TM) i3-2120 processor having 4 GB of RAM. In our experiment, we employ optimization toolbox of MATLAB for GEPSVM and TWSVM. In all the examples considered, the Gaussian kernel function with parameter $\mu >0$, defined by: for $x_{1},x_{2} \in R^{m}$

$$\begin{aligned} K(x_{1},x_{2})=\exp (-\mu \Vert x_{1}-x_{2}\Vert ^{2}), \end{aligned}$$

is taken. The classification accuracy of each algorithm was computed using the well-known ten-fold cross-validation methodology [9]. For brevity’s sake, we set $C_{1}=C_{2}$ for TWSVM, STWSVM and LSTWSVM, $C_{1}=C_{2}, C_{3}=C_{4}$ for our NLTSVM, and the kernel parameter value $\mu$ were allowed to vary from the sets $\{10^{-5}, 10^{-4},\ldots , 10^{5}\}$ and $\{2^{-10}, 2^{-9},\ldots , 2^{10}\}$ respectively. For GEPSVM, the range of $\delta$ was allowed to vary from the set $\{2^{-7}, 2^{-6},\ldots , 2^{7}\}.$ Finally, choosing these optimal values, the classification accuracies and computational efficiencies on the test dataset was calculated.

The accuracy used to evaluate methods is defined as follows:

$$\begin{aligned} \text{ Accuracy } = (TP+TN)/(TP+TN+FP+FN), \end{aligned}$$

where TP, TN, FP and FN are the number of true positive, true negative, false positive and false negative respectively.

We conduct the experiments on five synthetic datasets and fifteen real world benchmark datasets to investigate the performance of our NLTSVM. The performance comparisons of five algorithms are summarized in Table 1, where “Accuracy” denotes the mean value of ten-testing results and “Time” denotes the mean value of the time taken by ten experiments. We draw some conclusions after implementing these experiments on twenty datasets. In terms of prediction accuracy, our proposed NLTSVM yields the highest accuracy among five algorithms on most of the datasets considered. The main reason of our proposed NLTSVM yields such good testing accuracy is that our NLTSVM solves two systems of linear equations instead of solving two QPPs. The next good algorithm is LSTWSVM which yields slightly lower testing accuracy than TWSVM, STWSVM and our NLTSVM, but higher than GEPSVM for most of the datasets. Among five algorithms, GEPSVM yields lowest testing accuracy for most of the datasets considered. Moreover on Heart-Statlog, Ionosphere, Bupa, Transfusion, Haberman, Sonar, Wpbc, Cleve, Monks2, Monks3 and Splice datasets, GEPSVM produces extremely low testing accuracy compared with four other algorithms. Table 1 shows the comparison of computational time and accuracy for all five algorithms with Gaussian kernel. It is evident that our NLTSVM have performed several orders of magnitude faster than GEPSVM and TWSVM but nearly similar to STWSVM and LSTWSVM. Our NLTSVM outperforms GEPSVM and TWSVM on most of the datasets considered which clearly indicates the overall superiority. Also it is worth mentioning that NLTSVM does not require any special optimizers.

Table 1 Performance comparisons of GEPSVM, TWSVM, STWSVM, LSTWSVM and NLTSVM on synthetic and real world datasets using Gaussian kernel

Full size table

4.1 Synthetic datasets

In this subsection, we consider five examples to compare our NLTSVM with the other SVMs. First we take a simple two dimensional “Cross Planes” dataset as an example of synthetic dataset which was also tested in [26, 37, 39–41]. It was generated by perturbing points lying on two intersecting lines and the intersection point is not in the center. The linear classifiers obtained by TWSVM and our NLTSVM along with the input data are shown in Fig. 1a, b respectively. In the figures, positive points are plotted as “$+$” and negative ones are plotted as “$\circ$”. The corresponding hyperplanes has also been plotted in Fig. 1a, b. It is easy to see that the result of the proposed NLTSVM is more reasonable than that of TWSVM. This indicate that our NLTSVM can handle the “Cross Planes” dataset much better than TWSVM. The classification accuracy and central processing unit (CPU) time of each algorithm are summarized in Table 1. The results clearly demonstrate the superiority of multi-plane/surface classifiers over GEPSVM, TWSVM, STWSVM and LSTWSVM. The second example is an artificial-generated Ripley’s synthetic dataset [35]. It is also two dimensional dataset which includes 250 training points and 1000 test points. Table 1 shows the learning results of nonlinear GEPSVM, TWSVM, STWSVM, LSTWSVM and NLTSVM. It can be seen that our NLTSVM obtains better classification accuracy with less training time than GEPSVM, TWSVM, STWSVM and LSTWSVM which indicate the suitability of our NLTSVM for these kind of problems since its non-parallel hyperplanes successfully describe the two class of points.

To further show the advantage of our NLTSVM in the training speed, we have compared the computational time of our NLTSVM with GEPSVM, TWSVM, STWSVM and LSTWSVM on three NDC [28] datasets. The parameters of all the algorithms have been fixed i.e. $C_{1,2,3,4}=1$ and $\mu =2^{-8}.$ We have selected 10 % data for testing and rest for training. The central processing unit (CPU) time of each algorithm are summarized in Table 2. One can see from the Table 2 that our NLTSVM is more reasonable than GEPSVM, TWSVM, STWSVM and LSTWSVM.

Table 2 Comparison on NDC datasets with Gaussian kernel

Full size table

4.2 UCI datasets

To further evaluate the classification ability of our NLTSVM, we compare the behavior of our NLTSVM with GEPSVM, TWSVM, STWSVM and LSTWSVM on several publicly available benchmark datasets [27]. In all the real-world examples considered, each attribute of the original data is normalized as follows:

$$\begin{aligned} \bar{x_{ij}}=\frac{x_{ij}-x_{j}^{{\min }}}{x_{j}^{{\max }}-x_{j}^{{\min }}}, \end{aligned}$$

where $x_{ij}$ is the (i, j)-th element of the input matrix A, $\bar{x_{ij}}$ is its corresponding normalized value and $x_{j}^{{\min }}={\min }_{i=1}^{m}(x_{ij})$ and $x_{j}^{{\max }}={\max }_{i=1}^{m}(x_{ij})$ denote the minimum and maximum values, respectively, of the j-th column of A.

The sizes of training and test data, the number of attributes, training time and accuracies of each algorithm for nonlinear classifiers were summarized in Table 1 and the best accuracy is shown by bold figures. Clearly One can observe from Table 1 that, in comparison to GEPSVM, TWSVM, STWSVM and LSTWSVM, our NLTSVM shows better generalization performance. In details, for Heart-Statlog dataset, the experimental result (accuracy 85.71 %, time 0.0109 s) by our NLTSVM is higher than other four algorithms in computational time and classification accuracy, i.e., GEPSVM (accuracy 75.71 %, 0.8125 s), TWSVM (accuracy 81.43 %, 0.4404 s), STWSVM (accuracy 80.00 %, 0.0145 s) and LSTWSVM (accuracy 80.00 %, 0.0797 s). We obtained the similar conclusions for Bupa, WPBC, Cleve, Monks-2, Monks-3, Australian and Haberman datasets. For Votes dataset, the classification accuracy obtained by our NLTSVM (accuracy 95.35 %) is slightly lower than TWSVM (accuracy 96.90 %) and LSTWSVM (accuracy 96.12 %), it is higher than GEPSVM (accuracy 93.02 %). The empirical results further reveal that our NLTSVM, whose solutions are obtained by solving system of linear equations, is faster than TWSVM on all the datasets. It is worthwhile notice that choosing the values of the parameters $C_3$ and $C_4$ affect the results significantly and these values are varying in our NLTSVM rather than small fixed positive scalar in TWSVM. The details of optimal parameters for Gaussian kernel are listed in Table 3. It clearly indicates that adding the regularization terms in our formulation are useful.

Table 3 Optimal parameters of GEPSVM, TWSVM, STWSVM, LSTWSVM and NLTSVM for Gaussian kernel

Full size table

To further compare our NLTSVM with TWSVM, we also compare it with the two-dimensional scatter plots that were obtained from the part test points for the WDBC and Heart Statlog datasets. The plots were obtained by plotting points with coordinates: perpendicular distance of a test point x from positive hyperplane 1 and the distance from negative hyperplane 2. In the figures, positive points are plotted as “$+$” and negative points are plotted as “$\circ$”. Hence, the clusters of points indicate how well the classification criterion is able to discriminate between the two classes. From Figs. 2a, b and 3a, b, it can be seen that our NLTSVM obtained large distances from the test samples to the opposite hyperplanes. In contrast, the TWSVM obtained small distances from the test points to the hyperplane pair. It means that our NLTSVM is much more robust when compared with the TWSVM.

For further fair comparisons, we use statistical test to demonstrate a correct analysis when comparing the performance of multiple algorithms over multiple datasets, which has been largely discussed. Since the Friedman test with the corresponding post hoc tests is pointed out to be a simple, safe, and robust non parametric test for comparison of more classifiers over multiple datasets [7], we use it to compare the generalization ability of five algorithms. The average ranks of all the algorithms on accuracies were computed and listed in Table 4. We employ the Friedman test to check whether the measured average ranks are significantly different from the mean rank $R_j =3$ expected under the null hypothesis:

$$\begin{aligned} \chi _{F}^2= \frac{12N}{k\left( k+1\right) }\left[ \sum \limits _{j=1}^3 R_j^2 - \frac{k\left( k+1\right) ^2 }{4}\right] \end{aligned}$$

is distributed according to $\chi _{F}^2$ with $k-1$ degree of freedom. where k is the number of methods and N is the number of datasets.

$$\begin{aligned} \chi _{F}^2= & {} \frac{12\times 17}{5\left( 5+1\right) }\\ &\times\left[ 4.56^2 + 2.88^2 + 3.03^2 + 2.79^2+1.74^2 -\frac{5\left( 6\right) ^2}{4}\right] =27.7481.\\ F_F= & {} \frac{\left( N-1\right) \chi _{F}^2}{N\left( k-1\right) -\chi _{F}^2}= \frac{\left( 17-1\right) \times 27.7481}{17\left( 5-1\right) -27.7481}=11.0298. \end{aligned}$$

Table 4 Average ranks of GEPSVM, TWSVM, STWSVM, LSTWSVM and NLTSVM with Gaussian kernel

Full size table

With five algorithms and seventeen datasets, $F_F$ is distributed according to the F-distribution with $\left( k-1\right)$ and $\left( k-1\right) \left( N-1\right) =\left( 4,64\right)$ degrees of freedom. The critical value of $F\left( 4,64\right)$ for $\alpha =0.05$ is 2.52. So, we reject the null hypothesis $\left( F_F >F(4,64) \right) .$ We use the Nemenyi test for further pairwise comparison. According to [7], at $p=0.10,$ critical difference (CD) $=q_\alpha \sqrt{\frac{k\left( k+1\right) }{6N}}=2.459\sqrt{\frac{5\times 6}{6\times 17}}=1.336.$ Since the difference between GEPSVM and our NLTSVM is larger than the critical difference $1.336 (4.56-1.74 =2.82>1.336),$ we can identify that the performance of NLTSVM is significantly better than GEPSVM. Similarly, we can identify that the performances of TWSVM, STWSVM and LSTWSVM are significantly better than GEPSVM. Further, we see that the difference between TWSVM, STWSVM, LSTWSVM and our NLTSVM is just below the critical difference, we can conclude that the post hoc test is not powerful enough to detect any significant difference between the algorithms. But the average rank of our NLTSVM is always lower than that of GEPSVM, TWSVM, STWSVM and LSTWSVM, once can conclude that our NLTSVM is better than GEPSVM, TWSVM, STWSVM and LSTWSVM.

Regarding the computational time of five algorithms shown in Table 1 reveals that the proposed NLTSVM is nearly seven times faster than GEPSVM and TWSVM. However, the proposed NLTSVM and GEPSVM cost nearly the same time. One of the reasons is that both are solving the systems of linear equations.

5 Conclusions and future work

In this paper, we proposed an implicit Lagrangian twin support vector machine (TWSVM) classifiers by formulating a pair of unconstrained minimization problems in dual variables whose solutions will be obtained using finite Newton method. The idea of our formulation is to reformulate TWSVM as a strongly convex problem by incorporated regularization techniques to improve the training speed and robustness. The solution of two modified unconstrained minimization problems reduces to solving just two systems of linear equations as opposed to solving two quadratic programming problems in TWSVM and TBSVM, which leads to extremely simple and fast algorithm. To demonstrate the effectiveness of the proposed method, we performed numerical experiments on number of interesting real-world datasets and compared their results with other SVMs. Comparison of results with GEPSVM, TWSVM, STWSVM and LSTWSVM clearly demonstrate the effectiveness and suitability of the proposed method. Similar to TBSVM, there are four parameters in our NLTSVM, so the parameter selection is a practical problem and should be address in the future. Our future work will be on the extension of the proposed method to learning using privileged information.

References

Armijo L (1966) Minimization of functions having Lipschitz-continuous first partial derivatives. Pac J Math 16:1–3
Article MathSciNet MATH Google Scholar
Balasundaram S, Tanveer M (2012) On proximal bilateral-weighted fuzzy support vector machine classifiers. Int J Adv Intell Paradig 4(3/4):199–210
Article Google Scholar
Balasundaram S, Tanveer M (2013) On Lagrangian twin support vector regression. Neural Comput Appl 22(1):257–267
Article Google Scholar
Balasundaram S, Tanveer M (2013) Smooth Newton method for implicit Lagrangian twin support vector regression. KES J 17(4):267–278
Google Scholar
Brown MPS, Grundy WN, Lin D (2000) Knowledge-based analysis of micro-array gene expression data using support vector machine. Proc Natl Acad Sci USA 97(1):262–267
Article Google Scholar
Cortes C, Vapnik VN (1995) Support vector networks. Mach Learn 20:273–297
MATH Google Scholar
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Deng NY, Tian YJ, Zhang CH (2013) Support vector machines: optimization based theory, algorithms, and extensions. CRC Press, Boca Raton
MATH Google Scholar
Duda RO, Hart PR, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York
MATH Google Scholar
Fung G, Mangasarian OL (2001) Proximal support vector machines. In: Provost F, Srikant R, (eds) Proceedings KDD-2001: knowledge discovery and data mining, August 26–29, 2001, San Francisco, CA, New York, pp 76–86
Fung G, Mangasarian OL (2003) Finite Newton method for Lagrangian support vector machine classification. Neurocomputing 55(1–2):39–55
Article Google Scholar
Gao S, Ye Q, Ye N (2011) 1-Norm least squares twin support vector machines. Neurocomputing 74:3590–3597
Article Google Scholar
Hiriart-Urruty JB, Strodiot JJ, Nguyen VH (1984) Generalized Hessian matrix and second order optimality conditions for problems with CL1 data. Appl Math Optim 11:43–56
Article MathSciNet MATH Google Scholar
Jayadeva Khemchandani R, Chandra S (2007) Twin support vector machines for pattern classification. IEEE Trans Pattern Anal Mach Intell 29(5):905–910
Article MATH Google Scholar
Joachims T (1999) Making large-scale support vector machine learning practical. In: Advances in kernel methods. In: Support vector learning. MIT Press, Cambridge
Joachims T, Ndellec C, Rouveriol C (1998) Text categorization with support vector machines: learning with many relevant features. Eur Conf Mach Learn Chemnitz Ger 10:137–142
Google Scholar
Kumar MA, Gopal M (2008) Application of smoothing technique on twin support vector machines. Pattern Recogn Lett 29:1842–1848
Article Google Scholar
Kumar MA, Gopal M (2009) Least squares twin support vector machines for pattern classification. Expert Syst Appl 36:7535–7543
Article Google Scholar
Lee YJ, Mangasarian OL (2001a) RSVM: reduced support vector machines. In: Proceedings of the first SIAM international conference on data mining, pp 5–7
Lee YJ, Mangasarian OL (2001b) SSVM: a Smooth support vector machine for classification. Comput Optim Appl 20(1):5–22
Article MathSciNet MATH Google Scholar
Mangasarian OL (1994) Nonlinear programming. SIAM, Philadelphia
Book MATH Google Scholar
Mangasarian OL (2002) A finite Newton method for classification. Optim Methods Softw 17:913–929
Article MathSciNet MATH Google Scholar
Mangasarian OL, Musicant DR (1999) Successive overrelaxation for support vector machines. IEEE Trans Neural Netw 10:1032–1037
Article Google Scholar
Mangasarian OL, Musicant DR (2001) Lagrangian support vector machines. J Mach Learn Res 1:161–177
MathSciNet MATH Google Scholar
Mangasarian OL, Solodov MV (1993) Nonlinear complementarity as unconstrained and constrained minimization. Math Program Ser B 62:277–297
Article MathSciNet MATH Google Scholar
Mangasarian OL, Wild EW (2006) Multisurface proximal support vector classification via generalized eigenvalues. IEEE Trans Pattern Anal Mach Intell 28(1):69–74
Article Google Scholar
Murphy PM, Aha DW (1992) UCI repository of machine learning databases. University of California, Irvine. http://www.ics.uci.edu/ mlearn
Musicant D (1998) NDC: normally distributed clustered datasets. Computer Sciences Department, University of Wisconsin, Madison. www.cs.wisc.edu/dmi/svm/ndc
Nasiri JA, Charkari NM, Jalili S (2014) Least squares twin multi-class classification support vector machine. Pattern Recogn. doi:10.1016/j.patcog.2014.09.020
Osuna E, Freund R, Girosi F (1997) Training support vector machines: an application to face detection. In: Proceedings of computer vision and pattern recognition, pp 130–136
Peng X (2010) TSVR: an efficient twin support vector machine for regression. Neural Netw 23(3):365–372
Article Google Scholar
Peng X, Xu D (2013) Robust minimum class variance twin support vector machine classifier. Neural Comput Appl 22(5):999–1011
Article MathSciNet Google Scholar
Platt J (1999) Fast training of support vector machines using sequential minimal optimization. In: Scholkopf B, Burges CJC, Smola AJ (eds) Advances in kernel methods-support vector learning. MIT Press, Cambridge, pp 185–208
Google Scholar
Qi Z, Tian Y, Shi Y (2012) Laplacian twin support vector machine for semi-supervised classification. Neural Netw 35:46–53
Article MATH Google Scholar
Ripley BD (2008) Pattern recognition and neural networks. Cambridge University Press, Cambridge
MATH Google Scholar
Scholkopf B, Smola A (2002) Learning with kernels. MIT Press, Cambridge
MATH Google Scholar
Shao Y-H, Zhang CH, Wang XB, Deng NY (2011) Improvements on twin support vector machines. IEEE Trans Neural Netw 22(6):962–968
Article Google Scholar
Shao YH, Deng NY, Yang ZM, Chen WJ, Wang Z (2012) Probabilistic outputs for twin support vector machines. Knowl-Based Syst 33:145–151
Article Google Scholar
Shao YH, Chen WJ, Deng NY (2014) Nonparallel hyperplane support vector machine for binary classification problems. Inf Sci 263:22–35
Article MathSciNet MATH Google Scholar
Tanveer M (2014) Application of smoothing techniques for linear programming twin support vector machines. Knowl Inf Syst. doi:10.1007/s10115-014-0786-3
Tanveer M (2015) Robust and sparse linear programming twin support vector machines. Cogn Comput 7(1):137–149
Article Google Scholar
Tian Y, Ping Y (2014) Large-scale linear nonparallel support vector machine solver. Neural Netw 50:166–174
Article MATH Google Scholar
Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: fast SVM training on very large datasets. J Mach Learn Res 6:363–392
MathSciNet MATH Google Scholar
Vapnik VN (2000) The nature of statistical learning theory, 2nd edn. Springer, New York
Book MATH Google Scholar
Xu Y, Wang L (2012) A weighted twin support vector regression. Knowl-Based Syst 33:92–101
Article Google Scholar
Zhong P, Xu Y, Zhao Y (2012) Training twin support vector regression via linear programming. Neural Comput Appl 21(2):399–407
Article Google Scholar

Download references

Acknowledgments

The author acknowledges the valuable comments of the anonymous reviewers and the Editor of Journal of Machine Learning and Cybernetics whose enthusiasm is gladly appreciated. Also, the author would like to express his sincere gratitude to Prof. Yuh-Jye Lee, NTUST Taiwan and Prof. S. Balasundaram, JNU New Delhi, India for their help during the preparation of this manuscript.

Author information

Authors and Affiliations

School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
M. Tanveer

Authors

M. Tanveer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Tanveer.

Additional information

This work was carried out at the School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi, India, and Department of Computer Science and Engineering, The LNM Institute of Information Technology (LNMIIT) Jaipur, INDIA.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tanveer, M. Newton method for implicit Lagrangian twin support vector machines. Int. J. Mach. Learn. & Cyber. 6, 1029–1040 (2015). https://doi.org/10.1007/s13042-015-0414-x

Download citation

Received: 13 October 2014
Accepted: 07 August 2015
Published: 28 August 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s13042-015-0414-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Newton method for implicit Lagrangian twin support vector machines

Abstract

Similar content being viewed by others

A new approach for training Lagrangian twin support vector machine via unconstrained convex minimization

Lagrangian Twin-Bounded Support Vector Machine Based on L2-Norm

A regularization on Lagrangian twin support vector regression

1 Introduction