1 Introduction

Support vector machines (SVMs), proposed by Vapnik [25], are computationally powerful machine learning tools applied to classification. They have been successfully applied to problems of practical importance from wider areas like face detection [18], gene prediction [7], and text characterization [10].

The objective of SVM is in determining a separating hyperplane by maximizing the margin between the samples of positive and negative classes and assigning class labels to test samples in accordance with the half-space in which they lie. It is well-known that SVM determines the optimal hyperplane as the solution of a quadratic programming problem (QPP) having linear inequality constraints [3, 25].

As SVM derives a robust, sparse, global solution owning better generalization ability than other machine learning approaches like artificial neural networks, it becomes the state of the art method for classification. However, one of the major challenges of SVM is its high computational cost associated with training which restricts its application to problems with large data sets. To overcome this problem, over the past decade, efficient learning algorithms and models have been proposed in the literature [2, 9, 11, 16, 19, 24]. Recently, Mangasarian and Wild [16] proposed a new SVM model called generalized eigenvalue proximal SVM (GEPSVM) wherein the binary classification is obtained by constructing two non-parallel hyperplanes having the property that samples of each class will be clustered around its corresponding hyperplane. This results in solving two generalized eigenvalue problems. Similar in sprit to GEPSVM, a novel formulation called twin SVM (TWSVM) has been proposed in [9] wherein two non-parallel hyperplanes are constructed by solving two QPPs of smaller size than solving a single QPP as in the case of the standard SVM. This strategy makes TWSVM work four times faster than the standard SVM and further showing good generalization ability [9, 12]. Due to these advantages, TWSVM becomes one of the most popular methods for classification. For the interesting work on least squares TWSVM (LS-TWSVM) and smooth TWSVM, see [11, 12]. For other extensions to TWSVM, the interested reader is referred to [20, 21, 23].

With the aim of obtaining an efficient TWSVM model, following the novel approach in solving the dual SVM [1, 26], a naïve unconstrained Lagrangian twin SVM (ULTSVM) formulation has been proposed in this paper. Since the objective functions contain a term having non-smooth ‘plus’ function, the proposed minimization problems are solved either by considering its generalized Hessian [5, 8] or by introducing the smooth approximation function of [13] in place of the non-smooth ‘plus’ function and then applying Newton-Armijo algorithm [13]. Its convergence and finite termination will follow directly from the results of [13, 14]. Finally, the effectiveness of the proposed ULTSVM problem is demonstrated by performing experiments on a number of interesting real-world datasets and comparing their results with SVM, LS-TWSVM and TWSVM.

Throughout this work, all vectors are assumed as column vectors. The inner product of two vectors x , y in the n−dimensional real space R n is denoted by: x t y, where x tis the transpose of x. For any vector x=(x 1,...,x n )tR n, the plus function x + is defined as: (x +) i = max{0,x i } and i=1,...,n. The 2-norm of a vector x will be denoted by: ||x||. We denote the vector of ones of dimension m by e and the identity matrix of appropriate size by I. If f is a real valued function of the variable x=(x 1,...,x n )tR n then its gradient vector and Hessian matrix are denoted by: ∇f=( f/ x 1,..., f/ x n )t and ∇2 f=( 2 f/ x i x j ) i,j=1,...,n respectively.

The paper is organized as follows. In Section 2, the standard SVM, LS-TWSVM and TWSVM are reviewed. The proposed unconstrained TWSVM problem in its dual form and Newton iterative method of solving it are described in Section 3. Numerical experiments have been performed on a number of real-world datasets and their results have been compared with that of SVM, LS-TWSVM and TWSVM in Section 4 and finally the conclusions and future work are drawn in Section 5.

2 Related work

In this section, we briefly describe the standard SVM for binary classification problems and one of its important variants, the twin support vector machine.

Consider the binary classification problem assuming that the training set \(\{(\textbf {x}_{i} ,y_{i})\}_{i=1}^{m} \) be given where for the input sample x i R n let its corresponding class label be y i ∈{−1, +1}.

2.1 Support vector machine (SVM)

Assume that the input samples are mapped into a higher dimensional feature space via a nonlinear function φ(.). Then, an SVM classifier seeks for an optimal hyperplane of the form w t φ ( x )+b= 0 in the feature space, where the bias term bR and the vector normal to the hyperplane w are the unknowns which are determined by solving the following QPP [3, 25]

$$\begin{array}{cc} &{\underset{\textbf{w},b,\boldsymbol{\xi}}{ {\min}}} \frac{1}{2}\textbf{w}^{t}\textbf{w}+C\textbf{e}^{t}\boldsymbol{\xi}\\ \text{subject to:}\\ &y_{i} \,\,(\,\textbf{w}^{t}\varphi (\textbf{x}_{i} )+b\,)\,\,\ge \,\,1\,-\,\xi_{i} \end{array} $$

and

$$ \xi_{i} \ge 0 \,\, \text{for} \,\, i= 1,2,\ldots,m. $$
(1)

Here, ξ=(ξ 1,..., ξ m )t is the vector of slack variables and C>0 is a parameter.

Usually, problem (1) is solved by minimizing its Wolfe dual obtained to be

$$\begin{array}{cc} &{\underset{\mathbf{u}}{{\min}}} \,\,\frac{1}{2}\sum\limits_{i,j=1}^{m} {y_{i} y_{j} \varphi (\mathbf{x}_{i} )^{t}\varphi (\mathbf{x}_{j} )u_{i} u_{j} -} \sum\limits_{i=1}^{m} {u_{i} } \end{array} $$

subject to

$$ \sum\limits_{i=1}^{m} {u_{i} y_{i} =0} \text{ and } 0\le u_{i} \le C \text{ for } i= 1,2,\ldots,m, $$
(2)

where u=(u 1,...,u m )tR m is the Lagrangian multiplier vector. In this case, the decision function f(.) is taken as

$$ f(\mathbf{x})=\textit{sign}\,\left({\sum\limits_{s=1}^{N_{SV} } {u_{s} y_{s} \varphi (\mathbf{x})^{t}\varphi (\mathbf{x}_{s} )+b} } \right), $$
(3)

where N S V is the number of support vectors x s R n in which 0 < u s < C.

By applying the kernel trick, i.e. taking k(x,z)=φ(x)t φ(z) for x,zR n in (2) and (3), where k(.,.) is a given kernel function, the explicit construction of the nonlinear mapping φ(.) will be avoided. In this work, the Gaussian kernel function of the form

$$k(\mathbf{x},\mathbf{z})= \exp (\,-\mu \,\| \mathbf{x}-\mathbf{z}\|^{2}\,) $$

is considered, where μ > 0 is a parameter.

2.2 Twin support vector machine (TWSVM)

Assume that the training set consists of m 1 and m 2 number of samples belonging to class (+1) and class (−1) respectively so that m=m 1+m 2. Further, let the samples from class (+1) and class (−1) be represented by matrices \(A\in \boldsymbol {\Re }^{{m}_{1} \times n}\) and \(B\in \boldsymbol {\Re }^{{m}_{2} \times n}\) respectively.

Similar to SVM, the nonlinear TWSVM problem can be formulated by mapping the training samples into a higher dimensional feature space via a kernel function k(.,.) and performing the linear TWSVM classification in the feature space. More precisely, TWSVM determines two kernel generated surfaces of the form [9]

$$ K(\mathbf{x}^{t},C^{t})\,\mathbf{w}_{1} +b_{1} =0\,\text{ and }\,K(\mathbf{x}^{t},C^{t})\,\mathbf{w}_{2} +b_{2} =0 $$
(4)

such that each one of them will be as close as possible to samples of one class and also will be at a distance of at least one unit from samples of the other class where C=[ A ; B ] is an augmented matrix of size m×n and K(x t,C t) =(k(x,x 1), ... ,k(x,x m ) ) is a row vector in R m. In fact, unlike solving two eigenvalue problems as in the case of GEPSVM [16], the nonparallel kernel surfaces (4) are obtained by solving the following pair of QPPs defined by [9]

$$ \begin{array}{cc} &{\underset{(\mathbf{w}_{1} ,\,b_{1} ,\,\boldsymbol{\xi}_{2} )\,\in\,\boldsymbol{\Re}^{m+1+m_{2}}}{{\min}}} \frac{1}{2}\left\| K(A,C^{t})\,{\mathbf{w}}_{1} + \mathbf{e}_{1} b_{1} \right\|^{2}+ C_{2} \mathbf{e}_{2}^{t} \boldsymbol{\xi}_{2}\\ \text{subject to}\\ &-(K(B,C^{t})\mathbf{w}_{1} + \mathbf{e}_{2}b_{1})+ \boldsymbol{\xi}_{2}\ge \mathbf{e}_{2}, \boldsymbol{\xi}_{2} \ge 0 \end{array} $$
(5a)

and

$$ \begin{array}{cc} &{\underset{(\mathbf{w}_{2} ,\,b_{2} ,\,\boldsymbol{\xi}_{1} )\,\in \,\boldsymbol{\Re}^{m+1+m_{1}}}{{\min}}}\frac{1}{2}\left\| K(B,C^{t})\mathbf{w}_{2} + \mathbf{e}_{2}b_{2}\right\|^{2} +{C}_{1}\,\, \mathbf{e}^{t}_{1}\boldsymbol{\xi}_{1}\\ \text{subject to}\\ &(K(A,C^{t})\mathbf{w}_{2} + \mathbf{e}_{1}b_{2})+ \boldsymbol{\xi}_{1}\ge \mathbf{e}_{1}, \boldsymbol{\xi}_{1} \ge 0 \end{array} $$
(5b)

where \(\boldsymbol {\xi }_{1} \in \boldsymbol {\Re }^{{m}_{1} }\), \(\boldsymbol {\xi }_{2} \in \boldsymbol {\Re }^{{m}_{2}}\) are vectors of slack variables; C 1,C 2>0 are the regularization parameters and the unknowns are w 1,w 2R mand b 1,b 2R.

In practice, the solution of the above pair of primal problems (5a) and (5b) is obtained by constructing their Wolfe duals and solving them. Finally, for any test sample xR n, its class label is assigned according to its proximity to the non-parallel surfaces, i.e.

$$ \text{class } i=\arg \,\underset{k=1,2}{\min} \,\vert K({\mathbf{ x}}^{t},C^{t})\,{\mathbf{w}}_{k} +b_{k} \vert, $$
(6)

where |K(x t,C t) w k +b k | is the perpendicular distance from xR n to the hyperplane K(x t,C t) w k +b k . For more details on TWSVM, see [9].

2.3 Least squares twin support vector machine (LS-TWSVM)

Similar to the study of least squares SVM (LS-SVM) [24], the extension of TWSVM to least squares TWSVM (LS-TWSVM) was proposed in [11] leading to solving a pair of QPPs with equality constraints. More precisely, the nonlinear LS-TWSVM is defined as

$$ \begin{array}{cc} &{\underset{(\mathbf{w}_{1} ,\,b_{1} ,\,\boldsymbol{\xi}_{2} )\,\,\in \,\boldsymbol{\Re}^{m+1+m_{2}}}{ {\min}}}\frac{1}{2}\vert\vert K(A, C^{t})\mathbf{w}_{1} + \mathbf{e}_{1}b_{1}\vert\vert^{2} +\frac{C_{2}}{2}\boldsymbol{\xi}^{t}_{2}\boldsymbol{\xi}_{2}\\ \text{subject to}\\ &-(K(B,C^{t})\mathbf{w}_{1} + \mathbf{e}_{2}b_{1})+ \boldsymbol{\xi}_{2}= \mathbf{e}_{2} \end{array} $$
(7a)

and

$$ \begin{array}{cc} &{\underset{(\mathbf{w}_{2} ,\,b_{2} ,\,\boldsymbol{\xi}_{1} )\,\in \,\boldsymbol{\Re}^{m+1+m_{1}}}{{\min}}} \frac{1}{2} \vert \vert K(B,C^{t})\,{\mathbf{w}}_{2} +\mathbf{e}_{2} b_{2} \vert \vert^{2}+ \frac{C_{1} }{2}{\boldsymbol{\xi}}_{1}^{t} {\boldsymbol{\xi }}_{1}\\ \text{subject to}\\ &(K(A,C^{t})\,{\mathbf{w}}_{2} +{\mathbf{e}}_{1} b_{2} )+ {\boldsymbol{\xi}}_{1} ={\mathbf{e}}_{1} . \end{array} $$
(7b)

In fact, on substituting the equality constraints into the object functions, the problems (7a) and (7b) become

$$ {\underset{({\mathbf{w}}_{1} ,\,b_{1} )\,\in\,\boldsymbol{\Re}^{m+1}}{{\min}}} \frac{1}{2} \vert \vert K(A,C^{t})\,{\mathbf{w}}_{1} +{\mathbf{e}}_{1} b_{1} \vert \vert^{2}+ \frac{C_{2}}{2}\vert \vert K(B,C^{t})\,{\mathbf{w}}_{1} +{\mathbf{ e}}_{2} b_{1} +{\mathbf{e}}_{2} \vert \vert^{2} $$
(8a)

and

$$ {\underset{({\mathbf{w}}_{2} ,\,b_{2} )\,\in \,\boldsymbol{\Re}^{m+1}}{{\min}}} \frac{1}{2} \vert \vert K(B,C^{t})\,{\mathbf{w}}_{2} +{\mathbf{e}}_{2} b_{2} \vert\vert^{2}+ \frac{C_{1}}{2}\vert \vert \,\,K(A,C^{t})\,{\mathbf{w}}_{2} +{\mathbf{e}}_{1} b_{2}\,-\,{\mathbf{e}}_{1}\,\vert \vert^{2}. $$
(8b)

In this case, their solutions become

$$\begin{array}{@{}rcl@{}}\left[ {{\begin{array}{*{20}c} {\mathbf{w}}_{1} \hfill \\ {b_{1} } \hfill \\ \end{array}}} \right]\,&=&\,-(H^{t}H+\frac{1}{C_{2} }G^{t}G)^{-1}H^{t}\,{\mathbf{e}}_{2} \,\, \text{and} \\\, \left[{{\begin{array}{l} {\mathbf{w}}_{2} \hfill \\ {b_{2} } \hfill \\ \end{array}}} \right]\,&=&\,(G^{t}G+\frac{1}{C_{1} }H^{t}H)^{-1}G^{t}\,{\mathbf{e}}_{1} , \end{array} $$

where

$$ G=[K(A,C^{t})\,{\mathbf{e}}_{1}] \, \text{and} \, H=[K(B,C^{t})\,{\mathbf{e}}_{2} ] $$
(9)

are augmented matrices of size m 1 × (m+1) and m 2 × (m+1) respectively. Since LS-TWSVM results in solving two systems of linear equations it is faster in learning than TWSVM. Further, its classification accuracy results are comparable to TWSVM [11].

3 Proposed unconstrained Lagrangian twin SVM (ULTWSVM)

Following the work of [1, 26], a new variant of TWSVM in its dual is proposed in this section as a pair of unconstrained minimization problems whose solutions will be obtained by the Newton iterative method. We discuss our proposed model for both the linear and nonlinear cases.

Instead of assuming the 1-norm of the vector of slack variables ξ k with weight C k >0 in (5a) and (5b) where k=1,2; the square of the 2-norm of ξ k with weight \(\frac {C_{k} }{2}\) is minimized in our modified TWSVM formulation. In fact, the linear TWSVM in 2-norm solves the pair of QPPs [9]

$$ \begin{array}{cc} &{\underset{({\mathbf{w}}_{1} ,b_{1} ,\boldsymbol{\xi}_{2} )\,\in \,\boldsymbol{\Re}^{n+1+m_{2}}}{{\min}}} \frac{1}{2} \vert \vert A{\mathbf{w}}_{1} +{\mathbf{e}}_{1} b_{1} \vert \vert^{2}+ \, \frac{C_{2} }{2}{\boldsymbol{\xi}}_{2}^{t} {\boldsymbol{\xi}}_{2}\\ \text{subject to}\\ &-(B{\mathbf{w}}_{1} +{\mathbf{e}}_{2} b_{1} )+\boldsymbol{\xi}_{2} \ge {\mathbf{e}}_{2} \end{array} $$
(10a)

and

$$ \begin{array}{cc} &{\underset{({\mathbf{w}}_{2} ,b_{2} ,\boldsymbol{\xi}_{1} )\,\,\in \,\boldsymbol{\Re}^{n+1+m_{1} }}{{\min}}} \frac{1}{2} \vert \vert B{\mathbf{w}}_{2} +{\mathbf{e}}_{2} b_{2} \vert \vert^{2}+ \, \frac{C_{1}}{2}\boldsymbol{\xi}_{1}^{t} \boldsymbol{\xi}_{1} \\ \text{subject to}\\ &(A{\mathbf{w}}_{2} +{\mathbf{e}}_{1} b_{2} )+\boldsymbol{\xi}_{1} \ge { \mathbf{e}}_{1} , \end{array} $$
(10b)

where w 1,w 2R nand b 1,b 2R are the unknowns.

Note that since the non-negativity constraints of the slack variables will be automatically satisfied at optimality [15], they have been dropped in (10a) and (10b).

By considering Lagrangian functions and using Karush-Kuhn-Tucker (KKT) conditions, the Wolfe duals of (10a) and (10b) can be formulated as QPPs of the following form

$$ {\underset{0\,\le \,{\mathbf{u}}_{1} \,\in \,\boldsymbol{\Re}^{{m}_{1}}}{{\min}}} \,L_{1} ({\mathbf{u}}_{1} )\,=\,\frac{1}{2}{\mathbf{u}}_{1}^{t} Q_{1} { \mathbf{u}}_{1} -{\mathbf{e}}_{1}^{t} {\mathbf{u}}_{1} $$
(11a)

and

$$ {\underset{0\,\le \,{\mathbf{u}}_{2} \,\in \,\boldsymbol{\Re}^{{m}_{2}}}{{\min}}} \,L_{2} ({\mathbf{u}}_{2} )\,=\,\frac{1}{2}{\mathbf{u}}_{2}^{t} Q_{2} {\mathbf{u}}_{2} -{\mathbf{e}}_{2}^{t} {\mathbf{u}}_{2} $$
(11b)

where

$$ Q_{1} =\left({\frac{I}{C_{1}}+G(H^{t}H)^{-1}G^{t}} \right) \, \text{and} \, Q_{2} =\left({\frac{I}{C_{2}}+H(G^{t}G)^{-1}H^{t}} \right); $$
(12)

and, G=[A e 1] and H=[B e 2] are augmented matrices of sizes m 1 × (n+1) and m 2 × (n+1) respectively. Here \({\mathbf {u}}_{1} \in \boldsymbol {\Re }^{{m}_{1} }\) and \({ \mathbf {u}}_{2} \in \boldsymbol {\Re }^{{m}_{2}}\) are Lagrange multipliers satisfying the following

$$ \left[ {{\begin{array}{*{20}c} {{\mathbf{w}}_{1}} \hfill \\ {b_{1} } \hfill \\ \end{array} }} \right]\,=\,-(G^{t}G)^{-1}H^{t}\,{\mathbf{u}}_{2}~\text{and}~\left[ {{\begin{array}{*{20}c} {{\mathbf{w}}_{2}} \hfill \\ {b_{2}} \hfill \\ \end{array}}} \right]\,=\,(H^{t}H)^{-1}G^{t}\,{\mathbf{u}}_{1} . $$
(13)

Finally, using the solutions of (11a) and (11b), and (13) one can determine the end classifier (6).

The nonlinear TWSVM in 2-norm determines two non-parallel hyperplanes in the feature space of the form (4) by solving the following pair of QPPs:

$$\begin{array}{cc} &{\underset{({\mathbf{w}}_{1} ,b_{1} ,\boldsymbol{\xi}_{2} )\,\in \,\boldsymbol{\Re}^{m+1+m_{2}}}{{\min}}} \frac{1}{2} \vert \vert K(A,C^{t}){\mathbf{w}}_{1} +{\mathbf{e}}_{1} b_{1} \vert \vert^{2}+ \, \frac{C_{2} }{2}\boldsymbol{\xi}_{2}^{t} \boldsymbol{\xi}_{2}\\ \text{subject to:}\\ &-(K(B,C^{t}){\mathbf{w}}_{1} +{\mathbf{e}}_{2} b_{1} )+ \boldsymbol{\xi}_{2} \ge {\mathbf{e}}_{2} \end{array} $$

and

$$\begin{array}{cc} &{\underset{({\mathbf{w}}_{2} ,b_{2} ,\boldsymbol{\xi}_{1} )\,\in \,\boldsymbol{\Re}^{m+1+m_{1}}}{{\min}}} \frac{1}{2} \vert \vert K(B,C^{t}){\mathbf{w}}_{2} +{\mathbf{e}}_{2} b_{2} \vert \vert^{2}+ \, \frac{C_{1} }{2}\boldsymbol{\xi}_{1}^{t} \boldsymbol{\xi}_{1}\\ \text{subject to:} \\ &(K(A,C^{t}){\mathbf{w}}_{2} +{\mathbf{e}}_{1} b_{2} )+ \boldsymbol{\xi}_{1} \ge {\mathbf{e}}_{1} \end{array} $$

where \(\boldsymbol {\xi }_{1} \in \boldsymbol {\Re }^{{m}_{1} }\), \(\boldsymbol {\xi }_{2} \in \boldsymbol {\Re }^{{m}_{2}}\) are slack variable vectors and C 1,C 2>0 are input parameters.

With the introduction of Lagrange multipliers \({\mathbf {u}}_{1} \in \boldsymbol {\Re }^{m_{1} }\) and \({\mathbf {u}}_{2} \in \boldsymbol {\Re }^{{m}_{2}}\), the duals of the above problems can be derived again of the same form as (11a) and (11b) where the matrices Q 1,Q 2 are defined by (12). However, in this case, the augmented matrices G and H are given by (9). Using their solutions and (13), the kernel generated functions (4) can be easily obtained.

For the purpose of reformulating the pair of duals (11a) and (11b) into a pair of equivalent, unconstrained minimization problems as our proposed problem formulation and obtaining their solutions by iterative methods, one can rewrite them either for the linear or nonlinear case as

$$ {\underset{0\,\le \,{\mathbf{u}}_{1} \,\in \,\boldsymbol{\Re}^{{m}_{1} }}{{\min}}} L_{1} ({\mathbf{u}}_{1} )=\frac{1}{2}{\mathbf{u}}_{1}^{t} \left({\frac{I}{C_{1}}+ \boldsymbol{\mathcal{G}}\,\boldsymbol{\mathcal{G}}^{t}} \right){\mathbf{u}}_{1} -{\mathbf{e}}_{1}^{t} {\mathbf{u}}_{1} $$
(14)

and

$$ {\underset{0\,\,\le \,\,{\mathbf{u}}_{2} \,\in \,\,\boldsymbol{\Re}^{{m}_{2}}}{{\min}}} \, L_{2} ({\mathbf{u}}_{2} )=\frac{1}{2}{\mathbf{u}}_{2}^{t} \left({\frac{I}{C_{2} }+\boldsymbol{\mathcal{H}}\,\boldsymbol{\mathcal{H}}^{t}} \right){\mathbf{u}}_{2} -{\mathbf{e}}_{2}^{t} {\mathbf{u}}_{2} , $$
(15)

where \(\boldsymbol {\mathcal {G}}=G(H^{t}H)^{-1}H^{t}\) and \(\boldsymbol {\mathcal {H}}=H(G^{t}G)^{-1}G^{t}\)are matrices of sizes m 1 × m 2 and m 2 × m 1 respectively.

Now, let us consider the minimization problem (14). It can be equivalently written as

$$\begin{array}{cc} &{\underset{{\begin{array}{l} 0\,\le\,{\mathbf{u}}_{1} \,\in\,\boldsymbol{\Re}^{{m}_{1} } \\ {\mathbf{v}}_{2} \,\in \,\boldsymbol{\Re}^{{m}_{2}} \\ \end{array}}}{\min}} \left({\frac{1}{2C_{1} }{\mathbf{u}}_{1}^{t} {\mathbf{u}}_{1} -{\mathbf{ e}}_{1}^{t} {\mathbf{u}}_{1} } \right)+\frac{1}{2}{\mathbf{v}}_{2}^{t} {\mathbf{v}}_{2}\\ \text{subject to}\\ &{\mathbf{v}}_{2} =\boldsymbol{\mathcal{G}}^{t}\,{\mathbf{u}}_{1} \in \boldsymbol{\Re}^{{m}_{2}}. \end{array} $$

By introducing the Lagrangian multiplier \(\text {\textbf {z}}_{2} \in \boldsymbol {\Re }^{{m}_{2} }\), the dual of the above problem can be expressed as

$$\begin{array}{@{}rcl@{}} &&{\underset{{\mathbf{z}}_{2} \,\,\in \,\,\boldsymbol{\Re}^{{m}_{2}}}{\max}} \,\,\{{\underset{{\begin{array}{l} 0 \le{\mathbf{u}}_{1} \in \boldsymbol{\Re}^{{m}_{1} } \\ {\mathbf{v}}_{2} \in \boldsymbol{\Re}^{{m}_{2}} \\ \end{array}}}{\min}} [\,\,\left({\frac{1}{2C_{1} }{\mathbf{u}}_{1}^{t} {\mathbf{u}}_{1} -{\mathbf{e}}_{1}^{t} {\mathbf{u}}_{1} } \right)\\&&{\kern5pt}+\frac{1}{2}{\mathbf{ v}}_{2}^{t} {\mathbf{v}}_{2} +{\mathbf{z}}_{2}^{t} (\boldsymbol{\mathcal{G}}^{t}{\mathbf{u}}_{1} -{\mathbf{v}}_{2} )\,\,]\,\,\}\\ &&= {\underset{{\mathbf{z}}_{2} \,\,\in \,\,\boldsymbol{\Re}^{{m}_{2}}}{\max}} \,\,\{ {\underset{0\,\,\le \,\,{\mathbf{u}}_{1} \,\in \,\,\boldsymbol{\Re}^{{m}_{1}}}{\min}} [\,\,\left({\frac{1}{2C_{1}}{\mathbf{u}}_{1}^{t} {\mathbf{u}}_{1} -{\mathbf{e}}_{1}^{t} {\mathbf{u}}_{1} +{\mathbf{z}}_{2}^{t} \boldsymbol{\mathcal{G}}^{t}{\mathbf{ u}}_{1} } \right)\,\,]\,\,\\&&{\kern5pt}+{\underset{{\mathbf{v}}_{2} \,\in \,\,\boldsymbol{\Re}^{{m}_{2}}}{\min}} \,(\,\frac{1}{2}{\mathbf{v}}_{2}^{t} {\mathbf{v}}_{2} -{\mathbf{z}}_{2}^{t} {\mathbf{v}}_{2} )\,\,\}. \end{array} $$
(16)

However, it can be easily verified analytically that [22]

$$ {\underset{0 \le {\mathbf{u}}_{1}\, \in\, \boldsymbol{\Re}^{{m}_{1}}}{\min}} [\,\left({\frac{1}{2C_{1} }{\mathbf{u}}_{1}^{t} {\mathbf{u}}_{1} -{\mathbf{e}}_{1}^{t} {\mathbf{u}}_{1} +{\mathbf{z}}_{2}^{t} \boldsymbol{\mathcal{G}}^{t}{\mathbf{ u}}_{1} } \right)\,]\,= -\frac{C_{1} }{2}\vert \vert \,({\mathbf{e}}_{1} -\boldsymbol{\mathcal{G}}\,{\mathbf{z}}_{2} )_{+} \vert \vert^{2}\,\, $$
(17)

and is attained at

$${\mathbf{u}}_{1} =C_{1} \,({\mathbf{e}}_{1} -\boldsymbol{\mathcal{G}}\,{\mathbf{z}}_{2} )_{+} \,. $$

Similarly,

$$ {\underset{{\mathbf{v}}_{2} \,\in \,\boldsymbol{\Re}^{{m}_{2}}}{\min}} \,(\,\frac{1}{2}{\mathbf{v}}_{2}^{t} {\mathbf{v}}_{2} -{\mathbf{ z}}_{2}^{t} {\mathbf{v}}_{2} )\,=\,-\,\frac{1}{2}{\mathbf{ z}}_{2}^{t} {\mathbf{z}}_{2} $$
(18)

when v 2=z 2.

Applying the results of (17) and (18) in (16), the dual problem (14) can be written as an unconstrained, strongly convex minimization problem in its simpler form

$$ {\underset{{\mathbf{z}}_{2} \,\in\,\boldsymbol{\Re}^{{m}_{2}}}{\min}} \,\tilde{{L}}_{1} ({\mathbf{z}}_{2} )\,=\,\frac{1}{2}{\mathbf{ z}}_{2}^{t} {\mathbf{z}}_{2} +\frac{C_{1} }{2}\vert \vert \,({\mathbf{e}}_{1} -\boldsymbol{\mathcal{G}}\,{\mathbf{z}}_{2} )_{+} \vert \vert^{2}. $$
(19)

By following the above procedure, the dual problem (15) can be equivalently written as a strongly convex, minimization problem of the form

$$ {\underset{{\mathbf{z}}_{1} \,\in \,\boldsymbol{\Re}^{{m}_{1}}}{\min}} \,\tilde{{L}}_{2} ({\mathbf{z}}_{1} )\,=\,\frac{1}{2}{\mathbf{ z}}_{1}^{t} {\mathbf{z}}_{1} +\frac{C_{2} }{2}\vert \vert \,({\mathbf{e}}_{2} -\boldsymbol{\mathcal{H}}\,{\mathbf{z}}_{1})_{+} \vert \vert^{2}. $$
(20)

Since, the objective functions \(\tilde {{L}}_{1} (.),\,\tilde {{L}}_{2} (.)\) of our proposed LTWSVM are strongly convex and therefore each problems (19), (20) will have a unique solution. From the above discussion, one can easily conclude that if \({\tilde {\mathbf {z}}}_{1} \) and \({\tilde {\mathbf {z}}}_{2} \) are the unique solutions of (20) and (19) respectively and let \({\tilde {\mathbf {u}}}_{1} =C_{1} \,({\mathbf {e}}_{1} -\boldsymbol {\mathcal {G}}\,{\tilde {\mathbf {z}}}_{2} )_{+} \,\) and \({\tilde {\mathbf {u}}}_{2} =C_{2} \,({\mathbf {e}}_{2} -\boldsymbol {\mathcal {H}}\,{\tilde {\mathbf {z}}}_{1} )_{+} \,\) then \({\tilde {\mathbf {u}}}_{1} \) and \({\tilde {\mathbf {u}}}_{2} \) become solutions of the dual problems (14) and (15) respectively.

The above approach provides an alternative pair of unconstrained optimization problems instead of the pair of constrained optimization problems (14), (15).

Remark 1

For the derivation of the proposed problem formulations (19) and (20), it is required that the inverse of the matrices (H t H) and (G t G) of order (m+1) should exist. However, since the matrix (H t H), similarly (G t G), is positive semi-definite and therefore its inverse may or may not exist. Following [9], a regularization term σ H I may be introduced where σ H > 0 is a very small number so that the matrix (σ H I + H t H) becomes positive definite whose inverse can be computed using SMW identity [6], i.e.

$$(\sigma_{H}\,I\,+\,H^{t}H)^{-1}=\frac{1}{\sigma_{H} }\,(I-H^{t}(\sigma_{H} \,I\,+\,HH^{t})^{-1}H), $$

having the advantage that it is sufficient to compute (σ H I + H H t)−1 of order m 2 only. Similarly, introducing a regularization term σ G I where σ G > 0 is very small, one can compute \((\sigma _{G} \,I\,+\,G^{t}G)^{-1}=\frac {1}{\sigma _{G} }\,(I-G^{t}(\sigma _{G} \,I\,\,+\,GG^{t})^{-1}G)\) in which the matrix (σ G I + G G t)−1 is of order m 1 only.

Finally, for solving (19) and (20), it is proposed to obtain their critical points, i.e. finding the roots of the following system of nonlinear equations by using the Newton iterative method

$$\begin{array}{@{}rcl@{}} &&\nabla \tilde{{L}}_{1} ({\mathbf{z}}_{2} )={\mathbf{z}}_{2} -C_{1} \,\boldsymbol{\mathcal{G}}^{t}({\mathbf{e}}_{1} -\boldsymbol{\mathcal{G}}\,{\mathbf{z}}_{2} )_{+} =0 \,\text{and} \end{array} $$
$$\begin{array}{@{}rcl@{}} &&\nabla \tilde{{L}}_{2} ({\mathbf{z}}_{1} )={\mathbf{z}}_{1} -C_{2} \boldsymbol{\mathcal{H}}^{t}({ \mathbf{e}}_{2} -\boldsymbol{\mathcal{H}}\,{\mathbf{z}}_{1} )_{+} =0. \end{array} $$
(21)

As each of the above equations of (21) contains the non-differentiable ‘plus’ function, it is proposed to solve them by using the Newton iterative method with Armijo step size using the generalized Hessian approach of [5] and smoothing technique detailed in [13].

Assume that k=1,2. Using generalized derivative, a generalized Hessian matrix of \(\tilde {{L}}_{k} (.)\) can be obtained [8] as

$$\begin{array}{@{}rcl@{}} \nabla^{2}\tilde{{L}}_{1} ({\mathbf{z}}_{2} )&=&I+C_{1} \boldsymbol{\mathcal{G}}^{t}\,diag(\,({\mathbf{e}}_{1} -\boldsymbol{\mathcal{G}}\,{\mathbf{z}}_{2} )_{\ast } )\,\boldsymbol{\mathcal{G}} \, \text{and} \\\nabla^{2}\tilde{{L}}_{2} ({\mathbf{z}}_{1} )&=&I+C_{2} \boldsymbol{\mathcal{H}}^{t}\,diag(({ \mathbf{e}}_{2} -\boldsymbol{\mathcal{H}}\,{\mathbf{z}}_{1} )_{\ast } )\,\boldsymbol{\mathcal{H}}. \end{array} $$

Clearly \(\nabla ^{2}\tilde {{L}}_{k} (.)\) is symmetric and positive-definite, and therefore the solution of each system of nonlinear (21) can be computed using fast Newton iterative algorithm with Armijo stepsize whose proof of convergence and finite termination will follow from [14].

The smooth approximation approach is a very popular method [13] used for solving optimization problems, especially QPPs of SVM and SVR, whose objective functions are not twice differentiable. In this work, as another approach, we employ a smoothing technique to make the objective functions \(\tilde {{L}}_{k} (.)\) sufficiently smooth. More precisely, since the plus function (x)+ appearing in (19) and (20) is not differentiable, it will be replaced by the smooth approximation function p(x,α) with smooth parameter α>0, defined as [13]

$$\quad p(x,\alpha )\, = \, x+\frac{1}{\alpha }\log (1+\exp (-\alpha x)). $$

With the above approximation, the smooth reformulation of (19), for example, will become

$$ {\underset{{\mathbf{z}}_{2} \,\in \,\boldsymbol{\Re}^{{m}_{2}}}{\min}} \,\frac{1}{2}{\mathbf{z}}_{2}^{t} {\mathbf{z}}_{2} +\frac{C_{1} }{2}\vert \vert \,\,p({\mathbf{e}}_{1} -\boldsymbol{\mathcal{G}}\,{\mathbf{ z}}_{2} ,\alpha )\,\,\vert \vert^{2}, $$
(22)

where \(\,p(\text {\textbf {e}}_{1} -\boldsymbol {\mathcal {G}}\,\text {\textbf {z}}_{2} ,\alpha )\,\)is a vector in \(\boldsymbol {\Re }^{{m}_{1} }\)whose i-th component can be written as \((\,p(\text { \textbf {e}}_{1} -\boldsymbol {\mathcal {G}}\,\text {\textbf {z}}_{2} ,\alpha )\,)_{i} \,=\,p(1-\boldsymbol {\mathcal {G}}_{i} \,\text {\textbf {z}}_{2} ,\alpha )\,\) and \(\boldsymbol {\mathcal {G}}_{i} \,\) is the i-th row of the matrix \(\boldsymbol {\mathcal {G}}\,.\)

With the advantage of twice differentiability of the objective function (22), one can apply the Newton-Armijo algorithm for solving it.

Remark 2

The Hessian corresponding to the smooth approximation problem (22) can be computed to be a matrix of order m 2 as

$$I+C_{1} \,\boldsymbol{\mathcal{G}}^{t}(\,diag\left({\frac{1}{1+\exp (\alpha \,(\boldsymbol{\mathcal{G}}\,\boldsymbol{z}_{2} -\text{\textbf{ e}}_{1} ))}} \right)\,\,)\,\boldsymbol{\mathcal{G}}\,. $$

Following the work of [13] it can be easily shown that the Newton method with Armijo stepsize for (22) converges to its unique solution with quadratic convergence.

The smoothing technique can be extended in a similar manner to the modified dual problem (20) leading to its smooth reformulation

$${\underset{{\mathbf{z}}_{1}\,\in \,\boldsymbol{\Re}^{{m}_{1}}}{\min}} \,\frac{1}{2}{\mathbf{z}}_{1}^{t} {\mathbf{z}}_{1} +\,\frac{C_{2} }{2}\vert \vert \,p({\mathbf{e}}_{2} -\boldsymbol{\mathcal{H}}\,{\mathbf{ z}}_{1} ,\alpha )\,\,\vert \vert^{2} $$

whose solution can be obtained by applying the Newton-Armijo algorithm.

Remark 3

For simplicity reasons, we apply the Newton algorithm without Armijo stepsize to solve the pair of problems (19) and (20) numerically by either generalized derivative or smooth approximation approach.

Remark 4

Throughout in this work, solving (21) by Newton method using generalized Hessian and smooth approximation will be denoted by NLWTSVM and SLTWSVM respectively.

4 Experimental results

To analyze the generalization performance and the computational efficiency of our proposed ULTWSVM formulation solved by NLTWSVM and SLTWSVM training algorithms, experiments were performed on 17 bench mark data sets from UCI repository [17] and their results were compared with SVM, LS-TWSVM and TWSVM. All the classifiers were implemented on a PC running on Windows XP OS with 64 bit, 3.20 GHz Intel®;core™2 Duo processor having 8 GB of RAM under MATLAB R2008a environment. The standard SVM was solved by MOSEK optimization toolbox for MATLAB available at http://www.mosek.com and, however, no external optimizer was used for solving LS-TWSVM, TWSVM, NLTWSVM and SLTWSVM.

In the implementation of NLTWSVM and SLTWSVM, the values of the termination criteria tol and itmax were set to 0.001 and 10 respectively. The regularization parameters σ H , σ G > 0 were chosen to be 10−5. Since smooth function approximation with parameter α=5 has shown successful results [13], we assumed α=5 in the implementation of SLTWSVM.

All the datasets were normalized so that each feature value lies in [0, 1]. The optimal parameter values of C 1 , C 2 and μ were obtained by performing 10-fold cross validation on the training set by varying their values from the sets {10−5,... ,105} and {2−5,... ,25} respectively. With these optimal values, the classification prediction on the test set was computed by dividing the whole dataset randomly into 10 equal parts of which one of them was taken for testing and the remaining parts for training. Finally, the average test accuracy was taken as the measure of prediction.

In Tables 123 and 4 we have shown the accuracy results, along with the optimal parameter values and training time in seconds, by all the classifiers and their averaged ranks on accuracy values for the linear and Gaussian kernels. We notice immediately from Tables 1 and 3 that nonlinear classifiers perform better than their corresponding linear classifiers in terms of accuracy but not in terms of learning time. From Table 1, one can observe that the best performance in terms of accuracy was shown more number of times by SVM than the rest of the classifiers. However, for the case of Gaussian kernel, we observe from Table 3 that NLTWSVM shows the best performance in comparison with the rest of the learning algorithms considered.

Table 1 Performance comparison of our proposed methods NLTWSVM and SLTWSVM with LS-TWSVM, TWSVM and SVM on real world datasets. Linear kernel was employed. Time is for training in seconds
Table 2 Average ranks of SVM, LS-TWSVM, TWSVM, NLTWSVM and SLTWSVM with linear kernel on accuracy values
Table 3 Performance comparison of our proposed methods NLTWSVM and SLTWSVM with LS-TWSVM, TWSVM and SVM on real world datasets. Gaussian kernel was employed. Time is for training in seconds
Table 4 Average ranks of SVM, LS-TWSVM, TWSVM, NLTWSVM and SLTWSVM with Gaussian kernel on accuracy values

To further analyze statistically the performance of the proposed LTWSVM and SLTWSVM classifiers with SVM, LS-TWSVM and TWSVM, as it was suggested in Demsar [4], we perform a non-parametric Friedman test with the corresponding post hoc tests. For this purpose, the average ranks of all the classifiers in terms of prediction accuracy for the linear and Gaussian kernels were computed and listed in Tables 2 and 4 respectively. From the tables we notice that the average ranks of SVM and TWSVM are the least for the linear and Gaussian kernels respectively.

Under the null hypothesis that all the five algorithms on the seventeen data sets considered are equivalent, we compute the Friedman statistics [4] for the linear kernel as shown below

$$\begin{array}{@{}rcl@{}} {\chi_{F}^{2}} &&=\frac{12\times 17}{5\times 6}\left[ \,(2.4118^{2}+2.8824^{2}+3.5294^{2}+2.9706^{2}+3.2059^{2})\right.\\&&\left.{\kern5pt}-\frac{5\times 6^{2}}{4}\, \right]\approx 4.6512, \end{array} $$
$$F_{F} =\frac{16\times 4.6512}{17\times 4-4.6512}\approx 1.1748, $$

where F F is distributed according to F-distribution with (4,4×16)=(4,64)degrees of freedom. The critical value of F(4,64) for the level of significance α=0.05 is 2.5153. Since the F F value on RMSE, i.e. 1.1748, is smaller than the critical value 2.5153 for α=0.05, there is no significant difference between the five algorithms. Also, from Table 1 we can observe that LS-TWSVM takes the least training time and it is followed, in general, by ULTWSVM solved using the algorithms NLTWSVM and SLTWSVM.

For the Gaussian kernel, we notice from Table 3 that the number of times the best accuracy obtained by SVM, LS-TWSVM, TWSVM, NLTWSVM and SLTWSVM become 3, 4, 3, 5 and 3 respectively. This gives an indication of the effectiveness of the proposed problem formulation solved by NLTWSVM. To analyze statistically the comparative performance of all the algorithms, the Friedman statistic under the null hypothesis that all the algorithms are equivalent can be computed as

$$\begin{array}{@{}rcl@{}} {\chi_{F}^{2}} && =\frac{12\times 17}{5\times 6}\left[ \,(3.4706^{2}+3.2059^{2}+2.7353^{2}+2.7647^{2}+2.8235^{2})\right.\\&&\left.{\kern5pt}-\frac{5\times 6^{2}}{4}\, \right]\approx 2.8601, \end{array} $$
$$F_{F} =\frac{16\times 2.8601}{17\times 4-2.8601}\approx 0.7025. $$

Since the F F value on RMSE, i.e. 0.7025, is again smaller than the critical value 2.5153 for α=0.05, there is no significant difference between the five algorithms, i.e. we conclude that none of the methods are statistically better than the rest. Finally, regarding the computational learning speed, NLTWSVM and SLTWSVM show faster learning speed than SVM and TWSVM except for one data set. The overall superiority of the proposed novel formulation solved by the two iterative algorithms clearly illustrates its effectiveness and applicability.

5 Conclusion and future work

By reformulating the pair of Lagrangian dual problems of the twin support vector machine, a novel equivalent problem formulation was proposed in this work as a problem of solving a pair of unconstrained minimization problems. Since the objective functions contain the non-smooth ‘plus’ function, their solutions were obtained by Newton iterative method using the well-known generalized Hessian and smooth approaches. The efficiency of the proposed model in terms of classification accuracy and learning time was demonstrated by performing numerical experiments and comparing their results with SVM, least squares twin SVM and twin SVM. In summary, a simple problem formulation, very simple MATLAB coding and the computational efficiency clearly illustrate the effectiveness and the applicability of our proposed model. The future work will be on the application of semi-smooth approach of [26] for solving this novel problem formulation.