A support vector approach based on penalty function method

Zheng, Songfeng

doi:10.1007/s43674-021-00026-4

A support vector approach based on penalty function method

Original Article
Published: 17 December 2021

Volume 2, article number 9, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Advances in Computational Intelligence Aims and scope Submit manuscript

A support vector approach based on penalty function method

Download PDF

Songfeng Zheng ORCID: orcid.org/0000-0003-0546-8529¹

832 Accesses
Explore all metrics

Abstract

Support vector machine (SVM) models are usually trained by solving the dual of a quadratic programming, which is time consuming. Using the idea of penalty function method from optimization theory, this paper combines the objective function and the constraints in the dual, obtaining an unconstrained optimization problem, which could be solved by a generalized Newton method, yielding an approximate solution to the original model. Extensive experiments on pattern classification were conducted, and compared to the quadratic programming-based models, the proposed approach is much more computationally efficient (tens to hundreds of times faster) and yields similar performance in terms of receiver operating characteristic curve. Furthermore, the proposed method and quadratic programming-based models extract almost the same set of support vectors.

A fast iterative algorithm for support vector data description

Article 05 March 2018

Nonlinear optimization and support vector machines

Article 23 May 2018

Nonlinear optimization and support vector machines

Article Open access 14 April 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

For more than two decades, support vector machine (SVM) classifier (Cortes and Vapnik 1995) and support vector data description (SVDD) (Tax and Duin 1999, 2004) have attracted much attention in research and have been successfully applied to various scenarios. In the training phase, the formulations of SVM and SVDD lead to a quadratic programming (Cortes and Vapnik 1995; Tax and Duin 1999, 2004). Although the decomposition techniques (Osuna et al. 1997a, b) or sequential minimization methods (Platt 1998) could be employed to solve the quadratic programming, the training of SVM/SVDD has time complexity about $O(n^3)$, where n is the training set size. Therefore, training an SVM/SVDD model is time consuming, especially for large training set. As such, given their wide applications, it is highly desirable to develop a time-efficient yet accurate enough training algorithm for SVM/SVDD. Furthermore, since the support vectors contain important information for SVM/SVDD models, we also want to obtain the support vector information from the fast algorithm.

Instead of relying on a special quadratic programming solver for SVM and SVDD, we apply the idea of quadratic penalty function method from optimization literature (Ruszczyński 2006), converting the constrained quadratic programming to an unconstrained optimization problem. Then, a generalized Newton method, which is known to converge fast, is applied to solve the obtained problem, such that an approximated SVM or SVDD model could be obtained. The proposed algorithms for SVM and SVDD are easy to implement, without requiring particular optimization toolbox other than a standard linear system solver.

We tested the proposed algorithms on several pattern classification problems, and detailed performance comparison demonstrates that the proposed Newton algorithm-based SVM (N-SVM) and SVDD (N-SVDD) often yield similar performances to those of the quadratic programming based SVM (QP-SVM) and SVDD (QP-SVDD), in terms of area under the receiver operating characteristic (ROC) curve and the support vectors extracted. However, N-SVM and N-SVDD are much more computationally efficient (often tens to hundreds of times faster) in training than their quadratic programming-based counterparts.

In literature, to avoid the expensive quadratic programming in training SVM-type models, gradient-based optimization methods were considered. For example, in Lee and Mangasarian (2001) and Zheng (2016), a smooth approximation of the loss functions for SVM and SVDD was applied so that gradient descent algorithm could be applied to the approximated primal objective function, resulting computationally efficient algorithms. Newton’s method was applied to minimize the primal objective function in SVM with $L_2$ and Huber loss function in Chapelle (2007). Stochastic gradient descent was directly used in Shalev-Schwartz et al. (2011) and Wang et al. (2012). However, these methods directly optimize the primal objective function (or its modified versions) of SVM or SVDD, hence could not find the support vectors. On the contrary, the proposed idea works on the dual problem over the Lagrangian multipliers $\varvec{\alpha }$, so that the support vectors could be identified, which are important to save predicting time for nonlinear classifiers and estimating the generalization error of SVM (Opper and Winther 2000; Vapnik and Chapelle 2000).

The rest of this paper is organized as follows: Sect. 1.1 introduces the notations and mathematical tools used in this paper; Sect. 2 briefly reviews the formulations of SVM and SVDD models; Sect. 3 applies the quadratic penalty function method to formulate the quadratic programming as an unconstrained optimization problem, and a generalized Newton algorithm is introduced to solve the obtained problem; Sect. 4 compares the performance measures in terms of ROC curve analysis and training time of the proposed Newton algorithms to those of QP-SVM/QP-SVDD on four real-world datasets, and we also compare the support vectors extracted by the two methods; Sect. 5 summarizes this paper and discusses some future research directions.

1.1 Notations

All scalars are represented by lower case symbols. All vectors will be denoted by bold lower case symbols, and all are column vectors unless transposed to a row vector by a prime superscript $'$. All matrices will be denoted by bold upper case symbols. For vectors $\mathbf{a}$ and $\mathbf{b}$ in ${\mathbb {R}}^n$, $\mathbf{a}\ge \mathbf{b}$ means $a_i\ge b_i$ for each $i=1,\ldots , n$. For vector $\mathbf{x}\in {\mathbb {R}}^n$, $\Vert \mathbf{x}\Vert $ stands for the 2-norm of $\mathbf{x}$, that is, $\Vert \mathbf{x}\Vert =\sqrt{x_1^2+\cdots +x_n^2}$. The plus function $\mathbf{x}_+$ is defined as $(\mathbf{x}_+)_i =\max \{0,x_i\}$, for $i=1,\ldots ,n$. The subgradient of $\mathbf{x}_+$ is denoted by $\mathbf{x}_*$, which is a step function defined as $(\mathbf{x}_*)_i=1$ if $x_i>0$, $(\mathbf{x}_*)_i=0$ if $x_i<0$, and $(\mathbf{x}_*)_i\in [0,1]$ if $x_i=0$, for $i=1,\ldots , n$. If $x_i=0$, we typically take $(\mathbf{x}_*)_i = 0.5$. A column vector of ones (zeros) in ${\mathbb {R}}^n$ will be denoted by $\mathbf{1}_n$ ($\mathbf{0}_n$), and the identity matrix of n-th order will be denoted by $\mathbf{I}_n$.

If f is a real-valued function defined on ${\mathbb {R}}^n$, the gradient of f at $\mathbf{x}$ is denoted by $\nabla f(\mathbf{x})$ which is a column vector in ${\mathbb {R}}^n$, and the Hessian of f at $\mathbf{x}$ is denoted by $\nabla ^2 f(\mathbf{x})$, which is an $n\times n$ matrix. For a piecewise quadratic function $f(\mathbf{x}) = \frac{1}{2}\Vert (\mathbf{A}\mathbf{x}-\mathbf {b})_+\Vert ^2$, where $\mathbf{A}\in {\mathbb {R}}^{m\times n}$, the gradient vector is $\nabla f(\mathbf{x}) = \mathbf{A}'(\mathbf{A}\mathbf{x}-\mathbf {b})_+$, which is not differentiable, thus the ordinary Hessian of f does not exist. However, we can define its generalized Hessian (Hiriart-Urruty et al. 1984) which is the $n\times n$ symmetric positive semi-definite matrix

$$\begin{aligned} \partial ^2 f(\mathbf{x}) = \mathbf{A}'\text {diag}(\mathbf{A}\mathbf{x}-\mathbf {b})_*\mathbf{A}, \end{aligned}$$

where $\text {diag}(\mathbf{A}\mathbf{x}-\mathbf {b})_*$ denotes an $m\times m$ diagonal matrix with diagonal elements $(\mathbf{A}_i\mathbf{x}-b_i)_*$, for $i=1,\ldots , m$, with $\mathbf{A}_i$ being the i-th row of matrix $\mathbf{A}$.

2 Support vector machine and support vector data description

In this section, we briefly review the formulations of support vector machine and support vector data description.

2.1 Support vector machine

For two-class classification problem, assume that the given training dataset is $\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2),\ldots , (\mathbf{x}_n, y_n)\}$ with $\mathbf{x}_i\in {\mathbb {R}}^p$ as the feature vector and $y_i\in \{-1, 1\}$ as the class label. The idea of support vector machine (SVM) is to first map the feature vector into some high-dimensional reproducing kernel Hilbert space ${\mathcal {H}}$ by a mapping $\phi (\mathbf{x})$, and then construct a linear classifier in ${\mathcal {H}}$ with the form $\mathbf{w}'\phi (\mathbf{x}) +b$. SVM is constructed so that the margin of the classifier is large, which is measured by $1/\Vert \mathbf{w}\Vert $. To allow for possible mistakes, we also introduce a set of slack variables $\xi _i\ge 0$, which represents the penalty to the classifier for making a mistake at $(\mathbf{x}_i, y_i)$ for $i=1,\ldots , n$.

The SVM model can be fitted by solving the following optimization problem

$$\begin{aligned} {\left\{ \begin{array}{ll} \min _{\mathbf{w}, \varvec{\xi }, b} &{} \frac{1}{2}\Vert \mathbf{w}\Vert ^2+\frac{C}{n} \sum _{i=1}^n \xi _i \\ \text {s.t.} &{} y_i (\mathbf{w}'\phi (\mathbf{x}_i) +b) \ge 1- \xi _i, \; \xi _i\ge 0 \; \text { for } \; i=1,\ldots , n, \end{array}\right. } \end{aligned}$$

(1)

where $\varvec{\xi }=(\xi _1,\ldots ,\xi _n)'$, and $C>0$ controls the tradeoff between the margin of the classifier and the total penalty. The Lagrangian dual of problem (1) is

$$\begin{aligned} {\left\{ \begin{array}{ll} \min _{\varvec{\alpha }} &{} \frac{1}{2}\sum _{i=1}^n\sum _{j=1}^n\alpha _i\alpha _j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j)-\sum _{i=1}^n \alpha _i \\ \text {s.t.} &{} \sum _{i=1}^n\alpha _iy_i =0, \quad 0\le \alpha _i\le C/n \quad \text {for} \quad i=1,\ldots , n, \end{array}\right. } \end{aligned}$$

(2)

where $K(\mathbf{x}_i,\mathbf{x}_j) = \phi (\mathbf{x}_i)'\phi (\mathbf{x}_j)$ is the kernel function, and $\varvec{\alpha }=(\alpha _1,\ldots ,\alpha _n)'$ with $\alpha _i$ being the Lagrangian multiplier for the i-th constraint in Eq. (1). The training examples with nonzero $\alpha _i$ are called support vectors.

The coefficient vector of the classifier in space ${\mathcal {H}}$ is calculated as

$$\begin{aligned} \mathbf{w}= \sum _{i=1}^n \alpha _iy_i\phi (\mathbf{x}_i), \end{aligned}$$

and the intercept b could be calculated from the support vectors. The obtained classifier is

$$\begin{aligned} {\hat{y}} = \text {sign}\left( \sum _{i=1}^n \alpha _iy_iK(\mathbf{x}_i, \mathbf{x}) +b \right) , \end{aligned}$$

where ${{\hat{y}}}$ is the predicted class label for a new feature vector $\mathbf{x}$.

2.2 Support vector data description

In some practical problems, compared to negative examples, positive examples are relatively easier to obtain and more reliable. As such, instead of fitting a two-class classifier, we can alternatively describe the distribution of positive examples. Toward this end, Tax and Duin (1999, 2004) proposed a support vector data description (SVDD) method, which in the training stage, fits a tight hypersphere in the nonlinear high-dimensional feature space to include most of the training positive examples.

Let the given training dataset be $\{\mathbf{x}_i, i=1,\ldots , n\}$ with $\mathbf{x}_i\in {\mathbb {R}}^p$. We assume there is a nonlinear transformation $\phi $ to transform the feature vector $\mathbf{x}$ to $\phi (\mathbf{x})$, which is in a high-dimensional space ${\mathcal {H}}$. In this space ${\mathcal {H}}$, SVDD tries to construct a hypersphere with center ${\mathbf {c}}\in {\mathcal {H}}$ and radius $R>0$ such that the hypersphere contains most of the data and the volume is as small as possible. In other words, the desired hypersphere has the minimum $R^2$, and at the same time $\Vert \phi (\mathbf{x}_i)-{\mathbf {c}}\Vert ^2\le R^2$, for $i=1,\ldots , n$. In addition, as in the SVM formulation, we introduce a set of slack variables $\xi _i\ge 0$, since the training sample might contain outliers. In mathematical form, the problem could be summarized as

$$\begin{aligned} \min _{R,{\mathbf {c}},\varvec{\xi }}\quad R^2 + \frac{C}{n}\sum _{i=1}^n \xi _i, \end{aligned}$$

(3)

such that

$$\begin{aligned} \Vert \phi (\mathbf{x}_i)-{\mathbf {c}}\Vert ^2\le R^2+\xi _i \; \text { and }\; \xi _i\ge 0, \quad \text {for}\quad i=1,\ldots ,n, \end{aligned}$$

(4)

where $\varvec{\xi }=(\xi _1,\ldots ,\xi _n)'$ is the vector of slack variables, and the parameter $C>0$ controls the tradeoff between the two terms in (3).

The Lagrangian dual of the above optimization problem is

$$\begin{aligned} {\left\{ \begin{array}{ll} \min _{\varvec{\alpha }}&{} \sum _{i=1}^n\sum _{j=1}^n \alpha _i\alpha _j K(\mathbf{x}_i,\mathbf{x}_j) - \sum _{i=1}^n \alpha _i K(\mathbf{x}_i,\mathbf{x}_i) \\ \text {s.t.} &{} \sum _{i=1}^n \alpha _i=1, \quad 0\le \alpha _i\le \frac{C}{n} \quad \text {for} \quad i=1,\ldots , n, \end{array}\right. } \end{aligned}$$

(5)

where $\varvec{\alpha }=(\alpha _1,\ldots ,\alpha _n)'$ with $\alpha _i$ being the Lagrangian multiplier for the i-th constraint in Eq. (4), and $K(\mathbf{x}_i,\mathbf{x}_j)$ is the kernel function. Once problem (5) is solved, the center of the hypersphere is represented as

$$\begin{aligned} {\mathbf {c}} = \sum _{i=1}^n \alpha _i \phi (\mathbf{x}_i), \end{aligned}$$

(6)

and the radius R can be computed from the set of support vectors, i.e., the data points with $\alpha _i\ne 0$.

If the distance from a new example $\mathbf{x}$ to the center ${\mathbf {c}}$ is less than the radius R, it is classified as a positive example; otherwise, it is classified as a negative example. Thus, the class label for $\mathbf{x}$ is

$$\begin{aligned} {{\hat{y}}}&= \text {sign}\left( R^2-\left\| \phi (\mathbf{x})-\sum _{i=1}^n\alpha _i\phi (\mathbf{x}_i)\right\| ^2\right) . \end{aligned}$$

We note that in support vector clustering (Ben-Hur et al. 2001; Lee and Lee 2005, 2006) algorithms, the same optimization problem as SVDD is first solved and then cluster labels are assigned.

As implemented in popular toolboxes (Chang and Lin 2011; Joachims 1999), the quadratic programming in Eq. (2) and Eq. (5) can be solved by the decomposition methods (Osuna et al. 1997a, b) or sequential minimal optimization (Platt 1998). However, these algorithms are computationally expensive with time complexity about $O(n^3)$ where n is the size of training set. Moreover, the set of support vectors extracted by the quadratic programming is an indicator of the complexity of the obtained model, and is important to estimate the generalization error (Opper and Winther 2000; Vapnik and Chapelle 2000). As such, given the wide applications of SVM/SVDD, other than the expensive quadratic programming, we expect a fast training algorithm which can achieve similar accuracy as the quadratic programming method, and at the same time, it is desired to keep the information of support vectors. The current paper will give an attempt in this direction.

3 The proposed approach

This section first briefly reviews the penalty function method, based on which an approximated formulation to the quadratic programming in Eqs. (2) and (5) is developed, and a generalized Newton algorithm is proposed for SVM and SVDD.

3.1 Penalty function method for optimization problems

In optimization practice, unconstrained problems are often considered to be easier to solve than constrained ones. The idea of penalty function method is to approximate a constrained optimization problem by an unconstrained problem or one with simpler constraints, so that an approximate solution can be obtained in an easier way (Ruszczyński 2006, chap. 6). For the nonlinear optimization problem

$$\begin{aligned} {\left\{ \begin{array}{ll} \min _{\mathbf{u}} &{} f(\mathbf{u}) \\ \text {s.t.} \quad &{} g_i(\mathbf{u})\le 0, \quad \text {for} \quad i=1,\ldots , k; \\ \quad &{} h_i(\mathbf{u}) = 0, \quad \text {for} \quad i=1,\ldots , l; \end{array}\right. } \end{aligned}$$

(7)

the following function is called its quadratic penalty function (Ruszczyński 2006, chap. 6)

$$\begin{aligned} P_2(\mathbf{u}) = \frac{1}{2} \sum _{i=1}^k\left( g_i(\mathbf{u})_+\right) ^2 +\frac{1}{2} \sum _{i=1}^l\left( h_i(\mathbf{u})\right) ^2. \end{aligned}$$

Clearly, $\mathbf{u}$ satisfies the constraints in problem (7), if and only if $P_2(\mathbf{u})=0$.

Consider the unconstrained optimization problem

$$\begin{aligned} \min _{\mathbf{u}}\; \Phi _{\rho }(\mathbf{u}) = f(\mathbf{u}) + \rho P_2(\mathbf{u}), \end{aligned}$$

(8)

where $\rho >0$. Let the solution to Eq. (8) be $\mathbf{u}_\rho $, then it could be proved that $\mathbf{u}_\rho \rightarrow \mathbf{u}^*$ as $\rho \rightarrow \infty $, where $\mathbf{u}^*$ is the solution to problem (7). The intuition is that, for $\rho $ sufficiently large, in order to make $ \Phi _{\rho }(\mathbf{u})$ small, $P_2(\mathbf{u})$ should be very close to 0 and at the same time, $f(\mathbf{u})$ should also be as small as possible. In other words, at $\mathbf{u}_\rho $, the constraints are approximately satisfied and the original objective function is small. Thus, we can imagine that as $\rho $ grows, $\mathbf{u}_\rho $ would get closer and closer to $\mathbf{u}^*$. Please see (Ruszczyński 2006,chap. 6) for rigorous proofs to related theoretical results and demonstrative examples.

3.2 Penalty function method for support vector machine

Let ${\mathbf {K}}$ be the kernel matrix, that is, ${\mathbf {K}}_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)$ for $i = 1,2, \ldots , n$ and $j = 1, 2,\ldots , n$. Let ${\mathbf {H}}$ be $n\times n$ matrix with ${\mathbf {H}}_{ij} = y_iy_j{\mathbf {K}}_{ij}$. Then, the dual problem for SVM in Eq. (2) could be written compactly in matrix form as

$$\begin{aligned} {\left\{ \begin{array}{ll} \min _{\varvec{\alpha }} &{} \frac{1}{2}{\varvec{\alpha }}'{\mathbf {H}}{\varvec{\alpha }}-{\mathbf {1}}'_n{\varvec{\alpha }} \\ \text {s.t.} &{} {\mathbf {y}}'{\varvec{\alpha }} = 0 \quad \text { and } \quad {\mathbf {0}}_n \le {\varvec{\alpha }}\le \frac{C}{n}{\mathbf {1}}_n, \end{array}\right. } \end{aligned}$$

(9)

where ${\mathbf {y}} = (y_1, y_2, \ldots , y_n)'$. By the penalty function method, problem (9) could be solved approximately via minimizing the following function with respect to ${\varvec{\alpha }}$,

$$\begin{aligned} F_{\rho }({\varvec{\alpha }})&= \frac{1}{2}{\varvec{\alpha }}'{\mathbf {H}}{\varvec{\alpha }}-{\mathbf {1}}'_n{\varvec{\alpha }} \nonumber \\&\quad + \frac{\rho }{2}\left[ ({\mathbf {y}}'{\varvec{\alpha }})^2 + \Vert (-{\varvec{\alpha }})_+\Vert ^2+ \left\| \left( {\varvec{\alpha }}-\frac{C}{n}{\mathbf {1}}_n\right) _+\right\| ^2\right] \nonumber \\&= \frac{1}{2}{\varvec{\alpha }}'{\mathbf {H}}_\rho {\varvec{\alpha }}-{\mathbf {1}}'_n{\varvec{\alpha }} \nonumber \\&\quad + \frac{\rho }{2}\left[ \Vert (-{\varvec{\alpha }})_+\Vert ^2+ \left\| \left( {\varvec{\alpha }}-\frac{C}{n}{\mathbf {1}}_n\right) _+\right\| ^2\right] , \end{aligned}$$

(10)

where ${\mathbf {H}}_\rho = {\mathbf {H}}+\rho {\mathbf {y}}{\mathbf {y}}'$.

For any vector ${\mathbf {a}}\in {\mathbb {R}}^n$, there is ${\mathbf {a}}'{\mathbf {y}}{\mathbf {y}}'{\mathbf {a}}=({\mathbf {a}}'{\mathbf {y}})^2\ge 0$, that is, the matrix ${\mathbf {y}}{\mathbf {y}}'$ is positive semi-definite. Furthermore,

$$\begin{aligned} {\mathbf {a}}'{\mathbf {H}}{\mathbf {a}}= \sum _{i=1}^n\sum _{j=1}^n a_i {\mathbf {H}}_{ij}a_j&=\sum _{i=1}^n\sum _{j=1}^n a_i y_i {\mathbf {K}}_{ij}y_ja_j \nonumber \\&=(\mathbf {a.*y})'{\mathbf {K}}(\mathbf {a.*y}) \ge 0, \end{aligned}$$

(11)

where $\mathbf {a.*y}$ is the vector obtained by component-wise multiplying ${\mathbf {a}}$ and ${\mathbf {y}}$, and the inequality in Eq. (11) follows from the positive semi-definiteness of kernel matrix ${\mathbf {K}}$ (Cortes and Vapnik 1995). Equation (11) shows that ${\mathbf {H}}$ is a positive semi-definite matrix. Hence, as the sum of two positive semi-definite matrices, ${\mathbf {H}}_\rho $ is also positive semi-definite.

The gradient vector of $F_{\rho }({\varvec{\alpha }})$ is

$$\begin{aligned} \nabla F_{\rho }({\varvec{\alpha }}) = {\mathbf {H}}_\rho {\varvec{\alpha }}-{\mathbf {1}}_n - \rho (-{\varvec{\alpha }})_+ + \rho \left( {\varvec{\alpha }}-\frac{C}{n}{\mathbf {1}}_n\right) _+, \end{aligned}$$

(12)

and the generalized Hessian of $F_{\rho }({\varvec{\alpha }})$ is

$$\begin{aligned} \partial ^2 F_{\rho }({\varvec{\alpha }}) = {\mathbf {H}}_\rho + \rho \text {diag}(-{\varvec{\alpha }})_* + \rho \text {diag}\left( {\varvec{\alpha }}-\frac{C}{n}{\mathbf {1}}_n\right) _*. \end{aligned}$$

(13)

Because $\text {diag}(-{\varvec{\alpha }})_*$ and $\text {diag}\left( {\varvec{\alpha }}-\frac{C}{n}{\mathbf {1}}_n\right) _*$ are diagonal matrices with nonnegative diagonal elements, they are positive semi-definite. Since ${\mathbf {H}}_\rho $ is positive semi-definite, the generalized Hessian $\partial ^2 F_{\rho }({\varvec{\alpha }})$ is a positive semi-definite matrix. This indicates that function $F_{\rho }({\varvec{\alpha }})$ is convex and consequently, it has a minimum point.

To minimize $F_{\rho }({\varvec{\alpha }})$, for its simplicity, we choose to use Newton’s method (Boyd and Vandenberghe 2004). In each iteration, Newton algorithm searches for the optimal point in the direction of $-(\nabla ^2 F_{\rho }({\varvec{\alpha }}))^{-1} \nabla F_{\rho }({\varvec{\alpha }})$. The regular Hessian of $F_{\rho }({\varvec{\alpha }})$ does not exist because the plus function in $\nabla F_{\rho }({\varvec{\alpha }})$ is not differentiable. Therefore, we use the generalized Hessian of $F_{\rho }({\varvec{\alpha }})$ in the Newton algorithm, since the generalized Hessian has similar properties as the regular Hessian (Hiriart-Urruty et al. 1984). Thus, in the proposed algorithm, we update the solution in the direction of $-(\partial ^2 F_{\rho }({\varvec{\alpha }})+\delta \mathbf{I}_n)^{-1} \nabla F_{\rho }({\varvec{\alpha }})$, and we call the resulting algorithm as generalized Newton algorithm. Here, $\delta $ is a small positive number and the term $\delta \mathbf{I}_n$ is added to avoid possible singularity of $\partial ^2 F_{\rho }({\varvec{\alpha }})$. The convergence of the generalized Newton algorithm was studied in Mangasarian (2002).

Algorithm 1 summarizes the proposed generalized Newton algorithm for support vector machine (N-SVM).

In the second step, the step-size $\gamma _t$ could be chosen as

$$\begin{aligned} \gamma _t = \arg \min _{\gamma >0}\; F_{\rho }({\varvec{\alpha }}^t + \gamma {\mathbf {d}}_t). \end{aligned}$$

(14)

The minimization problem in Eq. (14) can be solved by backtracking line search algorithms (Boyd and Vandenberghe 2004, chap. 9). We choose to use Armijo rule (Armijo 1966) for its simplicity, which is given in Algorithm 2 for completeness.

In each iteration of Algorithm 1, we need to calculate a matrix-vector multiplication to determine the search direction in step 2. Since the involved matrix is of size $n\times n$ and the vector is in ${\mathbb {R}}^n$, the matrix-vector multiplication has time complexity $O(n^2)$. In each iteration of Algorithm 1, we also need to determine the step-size through Algorithm 2, in which we need to evaluate the function value via Eq. (10), and the most expensive calculation is ${\varvec{\alpha }}'{\mathbf {H}}_\rho {\varvec{\alpha }}$, which is again of time complexity $O(n^2)$ since ${\mathbf {H}}_\rho $ is of size $n\times n$ and the vector $\varvec{\alpha }$ is in ${\mathbb {R}}^n$. Hence, in each iteration of N-SVM, the time complexity is about $O(n^2)$, and consequently, the total time complexity of N-SVM is about $O(Mn^2)$, where M is the total iteration numbers needed for the algorithm to converge. For large training set, this is significantly more efficient than the quadratic programming-based SVM which has time complexity at the level of $O(n^3)$.

3.3 Penalty function method for support vector data description

Let ${\mathbf {k}}$ be the $n\times 1$ vector formed by the diagonal elements of the kernel matrix ${\mathbf {K}}$. Similar to our treatment in Sect. 3.2 for SVM, the dual problem for SVDD in Eq. (5) could be written compactly in the form of matrix as

$$\begin{aligned} {\left\{ \begin{array}{ll} \min _{\varvec{\alpha }} &{} {\varvec{\alpha }}'{\mathbf {K}}{\varvec{\alpha }}-{\mathbf {k}}'{\varvec{\alpha }} \\ \text {s.t.} &{} {\mathbf {1}}'_n{\varvec{\alpha }} = 1 \quad \text { and } \quad {\mathbf {0}}_n \le {\varvec{\alpha }}\le \frac{C}{n}{\mathbf {1}}_n. \end{array}\right. } \end{aligned}$$

(15)

Same as in Sect. 3.2, applying the penalty function method, we can approximately solve problem (15) by minimizing function

$$\begin{aligned} G_{\rho }({\varvec{\alpha }})&= {\varvec{\alpha }}'{\mathbf {K}}{\varvec{\alpha }}-{\mathbf {k}}'{\varvec{\alpha }}\nonumber \\&\quad + \frac{\rho }{2}\left[ ({\mathbf {1}}'_n{\varvec{\alpha }} - 1)^2 + \Vert (-{\varvec{\alpha }})_+\Vert ^2 + \left\| \left( {\varvec{\alpha }}-\frac{C}{n}{\mathbf {1}}_n\right) _+\right\| ^2\right] \nonumber \\&= \frac{1}{2}{\varvec{\alpha }}'{\mathbf {K}}_\rho {\varvec{\alpha }}-{\mathbf {k}}'_\rho {\varvec{\alpha }}\nonumber \\&\quad + \frac{\rho }{2}\left[ \Vert (-{\varvec{\alpha }})_+\Vert ^2+ \left\| \left( {\varvec{\alpha }}-\frac{C}{n}{\mathbf {1}}_n\right) _+\right\| ^2\right] + \frac{\rho }{2}, \end{aligned}$$

(16)

where ${\mathbf {k}}_\rho = {\mathbf {k}}+\rho {\mathbf {1}}_n$ and ${\mathbf {K}}_\rho = 2{\mathbf {K}}+\rho {\mathbf {1}}_n{\mathbf {1}}'_n$. Similar to Sect. 3.2, we can prove that ${\mathbf {K}}_\rho $ is a positive semi-definite matrix.

The gradient vector of $G_{\rho }({\varvec{\alpha }})$ is

$$\begin{aligned} \nabla G_{\rho }({\varvec{\alpha }}) = {\mathbf {K}}_\rho {\varvec{\alpha }}-{\mathbf {k}}_\rho - \rho (-{\varvec{\alpha }})_+ + \rho \left( {\varvec{\alpha }}-\frac{C}{n}{\mathbf {1}}_n\right) _+, \end{aligned}$$

and the generalized Hessian of $G_{\rho }({\varvec{\alpha }})$ is

$$\begin{aligned} \partial ^2 G_{\rho }({\varvec{\alpha }}) = {\mathbf {K}}_\rho + \rho \text {diag}(-{\varvec{\alpha }})_* + \rho \text {diag}\left( {\varvec{\alpha }}-\frac{C}{n}{\mathbf {1}}_n\right) _*. \end{aligned}$$

Similar to Sect. 3.2, we can verify that $\partial ^2 G_{\rho }({\varvec{\alpha }})$ is a positive semi-definite matrix, which indicates that $G_{\rho }({\varvec{\alpha }})$ is convex and has a minimum point.

Similar to Algorithm 1, we can develop an SVDD algorithm based on generalized Newton’s method (N-SVDD), which we choose not to present because it only slightly differs from Algorithm 1. Similar to the analysis presented at the end of Sect. 3.2, we could obtain the computational complexity of N-SVDD as $O(Mn^2)$, where n is the training set size and M is the total number of iterations needed for N-SVDD to converge.

3.4 Discussion

The original formulation of SVM and SVDD leads to a quadratic programming with constraints. By absorbing the constrains, this paper obtains an approximated unconstrained optimization problem, and this will bring some advantages compared to the original quadratic programming. First, the approximated objective function is a quadratic function with positive semi-definite Hessian matrix, which guarantees that there exists a unique minimum point. The solution of the approximated optimization problem converges to the original quadratic programming solution. Hence, all the information we can get from quadratic programming could also be approximately obtained from the approximated solution. For example, the set of support vectors extracted by both proposed method and quadratic programming are verified very close, see the experimental results in Sect. 4.

Second, the approximated objective function can be minimized by a generalized Newton’s method, which is guaranteed to converge quickly (Mangasarian 2002). Theoretical analysis of the quadratic programming-based SVM or SVDD showed that the computational complexity is about $O(n^3)$, which will be verified by our experimental results in Sects. 4.3 to 4.5. However, our theoretical analysis at the end of Sect. 3.2 shows that the computational complexity of N-SVM is about iteration number multiplying $O(n^2)$, which will also be verified by our experimental results. The expensive training cost makes parameter selection for QP-SVM and QP-SVDD not practical, because to select the model parameters, we usually need to use cross validation, which needs to train the model multiple times. On the contrary, due to their efficient training process, it is possible to select parameters for N-SVM and N-SVDD by techniques such as cross validation.

Finally, the Newton’s algorithm is easy to implement. The quadratic programming-based SVM and SVDD need to use an external quadratic programming package. For example, in our implementation, we employed a quadratic programming solver developed by about 1000 lines of C++ code. However, from the description of Algorithm 1, we can see that the N-SVM or N-SVDD only needs basic matrix operations, which are built-in functions of almost all modern programming languages. Hence, no specific software package is needed for N-SVM or N-SVDD. For example, in our implementation, N-SVM and N-SVDD each used about 50 lines of MATLAB code. In this sense, we can say that the proposed N-SVM and N-SVDD are much easier to be implemented, compared to the QP counterparts.

In summary, compared to the quadratic programming-based SVM or SVDD, the proposed methods are not only easy to implement, but also more computationally efficient. Moreover, they could extract almost the same set of support vectors as the quadratic programming-based SVM and SVDD. These advantages motivated us to develop this work.

4 Experimental results and analysis

On four pattern classification problems, we compare the performances of the proposed Newton algorithm-based SVM and SVDD (N-SVM and N-SVDD) to those of the ordinary quadratic programming (QP)-based models, i.e., QP-SVM and QP-SVDD.

4.1 Experiment setup and performance measures

All the computer codes were implemented in MATLAB, and the quadratic programming-based models had the QP solver from the C++ version of LIBSVM (Chang and Lin 2011). All the experiments were conducted on a desktop computer with Interl(R) Xeon(R) CPU @ 2.00 GHz and 8 GB memory. During all experiments that incorporated measurement of running time, one core was used solely for the experiments, and the number of other processes running on the system was minimized.

We used the Gaussian kernel

$$\begin{aligned} K(\mathbf{{u}},\mathbf{{v}})=\exp \left\{ -\frac{\Vert {\mathbf{{u}}}-\mathbf{{v}}\Vert ^2}{2\sigma ^2}\right\} , \end{aligned}$$

with $\sigma =10$. We set the penalty parameter in SVM and SVDD as $C = 2n$, where n is the training set size. Our purpose is to compare the performances between N-SVM/N-SVDD and QP-SVM/QP-SVDD, and this comparison is fair as long as the parameter settings are the same for the two algorithms because in this case, they solve exactly the same optimization problem with the same parameters. In general, we could select the optimal parameter setting $(C, \sigma )$ by applying cross validation, generalized approximate cross validation (Wahba et al. 2000), or other criteria mentioned in Chapelle et al. (2002) and Gold and Sollich (2003). Since parameter selection is not the focus of this paper, we choose not to pursue further in this issue. In the generalized Newton algorithm, we set the maximum iteration number to be 1000 and the parameter $\rho $ to be 500, and we found that the result did not differ too much as long as $\rho >100$. Both the tolerance parameter $\epsilon $ and the perturbation parameter $\delta $ in Algorithm 1 were set to be $10^{-5}$. In all the considered algorithms, $\varvec{\alpha }$ was initialized as $\mathbf{0}_n$.

The receiver operating characteristic (ROC) curve (Fawcett 2006) is employed to illustrate the performance of the classifier. The classifier performs better if the corresponding ROC curve is higher. To numerically compare the ROC curves of different methods, we calculate the area under the curve (AUC) (Fawcett 2006). The bigger the AUC, the better the overall performance of the corresponding classifier.

Similar to the work in Zheng (2019), we use precision and recall rates to compare the support vectors found by the two algorithms. We assume the support vectors extracted by QP-based models as true support vectors, and denote them as the set $SV_{Q}$; denote the support vectors from Newton algorithms as $SV_{N}$. The precision and recall (Powers 2011) rates are defined as

$$\begin{aligned} \text {precision} = \frac{|SV_Q\cap SV_N|}{|SV_N|} \quad \text { and } \quad \text {recall} = \frac{|SV_Q\cap SV_N|}{|SV_Q|}, \end{aligned}$$

where |A| represents the size of a set A. High precision means that the Newton algorithm finds more correct support vectors than incorrect ones, while high recall means that Newton algorithm extracts most of the correct support vectors.

4.2 Face detection

In the first experiment, we select to use a dataset which consists 5175 face images and 10,000 non-face images. The dataset is available at http://people.missouristate.edu/songfengzheng/FaceData.zip. Each image is of size $16\times 16$, which is normalized so that all pixel values are between 0 and 1. We do not extract any specific features for face detection (e.g., the features used in Viola and Jones 2001), instead, we directly use the pixel values as input to all the considered algorithms. We randomly select 500 face images for training the SVDD models and further randomly select 500 nonface images for training the SVM models, and the remaining images are used for testing purpose.

Fig. 1 gives the ROC curves of the tested algorithms which clearly shows that N-SVM and N-SVDD perform almost the same as their QP counterparts, in the sense that the ROC curves are almost indistinguishable if we plot them in the same figure. Numerically, N-SVM has AUC 0.9907 and QP-SVM has AUC 0.9906; N-SVDD has AUC 0.9241 while QP-SVDD has AUC 0.9353. Thus, both graphical and numerical measures demonstrate that the Newton algorithms and QP based algorithms have almost identical classification performances on the testing set.

We also compare the difference between the final solutions from different algorithms, we have $\Vert {\varvec{\alpha }}^{\text {QP}} - {\varvec{\alpha }}^{\text {N}}\Vert = 0.0552$ for SVDD models and $\Vert {\varvec{\alpha }}^{\text {QP}} - {\varvec{\alpha }}^{\text {N}}\Vert = 0.8206$ for SVM, where ${\varvec{\alpha }}^{\text {QP}}$ is the final solution to Eq. (2) or Eq. (5) and ${\varvec{\alpha }}^{\text {N}}$ is the final solution of Eq. (10) or Eq. (16), respectively. These differences are reasonably small, considering the size of the problems. This means that the solutions from Newton algorithm and QP based algorithm are quite close.

Training QP-SVM used 106.2333 s, while N-SVM training algorithm converged with only 46 iterations, consuming 1.0108 s, which is about 105 times faster. QP-SVDD used 15.7478 s in training, while N-SVDD training converged in 16 iterations, using 0.2492 s, which is about 63 times faster than the QP counterpart. These numbers indicate that the Newton algorithms are much more time efficient than the QP-based models in training.

QP-SVM found 402 support vectors, while N-SVM extracted 403 support vectors, with the precision rate 99.5% and the recall rate 99.75%. Both QP-SVDD and N-SVDD found the same 26 support vectors, which means that the precision rate and the recall rate are both 100%. The very high precision and recall rates show that Newton algorithms can correctly find almost all the support vectors with very few mistakes, and they also indicate that the Newton algorithms and quadratic programming-based algorithms obtain models with similar complexity.

4.3 Human activity recognition

In the second experiment, we try to recognize human activity (standing, sitting, laying, walking, walking upstairs, walking downstairs) based on inertial sensors for ambient assisted living. The dataset is publicly available from UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones). The training set consists of 4252 data points while the testing set size is 1492, and each data point has 561 features. We normalize the data so that each feature is between 0 and 1.

We use each activity as positive label and use all others as negative label to create six binary classification problems, which are denoted as WK (walking), WU (walking upstairs), WD (walking downstairs), SIT (sitting), STAND (standing), LAY (laying). In training the SVM models, to make the positive and negative examples balance, we further randomly select 500 negative examples for training purpose.

Table 1 compares the performances of QP-SVM and N-SVM. We first notice that the final solutions of QP-SVM (${\varvec{\alpha }}^{\text {QP}}$) and those of N-SVM (${\varvec{\alpha }}^{\text {N}}$) are close enough, in the sense that the norms of the difference vectors are small, considering the problem size. Table 1 shows that for each problem, the classification performance of QP-SVM and N-SVM are almost identical, in terms of the area under the ROC curve (AUC). We choose not to show the ROC curves because they are almost indistinguishable for each problem. Table 1 also shows that for each problem, N-SVM and QP-SVM found almost the same number of support vectors, obtaining models with similar complexity. However, the training time differs significantly, and N-SVM often converges in 100 iterations.

Table 1 On the human activity recognition problems, the comparison between QP-SVM and N-SVM

A support vector approach based on penalty function method

Abstract

Similar content being viewed by others

A fast iterative algorithm for support vector data description

Nonlinear optimization and support vector machines

Nonlinear optimization and support vector machines

Explore related subjects

1 Introduction

1.1 Notations

2 Support vector machine and support vector data description

2.1 Support vector machine

2.2 Support vector data description

3 The proposed approach

3.1 Penalty function method for optimization problems

3.2 Penalty function method for support vector machine

3.3 Penalty function method for support vector data description

3.4 Discussion

4 Experimental results and analysis

4.1 Experiment setup and performance measures

4.2 Face detection

4.3 Human activity recognition

4.4 Handwritten digit recognition

4.5 Music genre classification

5 Conclusion and future works

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation