New smoothing SVM algorithm with tight error bound and efficient reduced techniques

Zhou, Shuisheng; Cui, Jiangtao; Ye, Feng; Liu, Hongwei; Zhu, Qiang

doi:10.1007/s10589-013-9571-6

New smoothing SVM algorithm with tight error bound and efficient reduced techniques

Published: 06 June 2013

Volume 56, pages 599–617, (2013)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Computational Optimization and Applications Aims and scope Submit manuscript

New smoothing SVM algorithm with tight error bound and efficient reduced techniques

Download PDF

Shuisheng Zhou¹,
Jiangtao Cui²,
Feng Ye¹,
Hongwei Liu¹ &
…
Qiang Zhu¹

455 Accesses
12 Citations
Explore all metrics

Abstract

The quadratically convergent algorithms for training SVM with smoothing methods are discussed in this paper. By smoothing the objective function of an SVM formulation, Lee and Mangasarian [Comput. Optim. Appl. 20(1):5-22, 2001] presented one such algorithm called SSVM and proved that the error bound between the new smooth problem and the original one was $O(\frac{1}{p})$ for large positive smoothing parameter p. We derive a new method by smoothing the optimality conditions of the SVM formulation, and we prove that the error bound is $O(\frac{1}{p^{2}})$, which is better than Lee and Mangasarian’s result. Based on SMW identity and updating Hessian iteratively, some boosting skills are proposed to solve Newton equation with lower computational complexity for reduced smooth SVM algorithms. Many experimental results show that the proposed smoothing method has the same accuracy as SSVM, whose error bound is also tightened to $O(\frac{1}{p^{2}})$ in this paper, and the proposed boosting skills are efficient for solving large-scale problems by RSVM.

A semismooth Newton method for support vector classification and regression

Article 14 February 2019

A fast conjugate functional gain sequential minimal optimization training algorithm for LS-SVM model

Article 15 November 2022

An Efficient SMO Algorithm for Solving Non-smooth Problem Arising in $\varepsilon $-Insensitive Support Vector Regression

Article 09 January 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Based on the Vapnik and Chervonenkis’ structural risk minimization principle [28, 29], Support Vector Machine (SVM) has emerged as a state-of-the-art tool for classification and regression. SVM solves a convex quadratic programming problem, and its solution is not only explicitly defined but also sparse. SVM has been widely used in many application areas [28, 29], such as character identification, disease diagnoses, time series prediction, etc.

Given a training dataset $\{ (x_{i} ,y_{i}) \}_{i = 1}^{m} \subset{\mathbb{R}}^{n} \times \{ { - 1,1} \}$, SVM is to solve the following formulation (see [3, 13] etc.):

(1)

or the formulation with a regularization term $\frac{1}{2}b^{2}$ of bias b added to the objective function [16, 17, 22, 23]:

(2)

with σ=1 or 2, tradeoff parameter C>0 and ${\bf{x}}_{i} := \phi(x_{i})\in{\mathbb{H}}$ for nonlinear classification problems, where ϕ(⋅) maps x _i to a high dimensional, even infinite dimensional, feature space. Owing to the reproduced kernel theory, ${\mathbb{H}}$ is a reproducing kernel Hilbert space [13, 26] associated with a kernel function $k:{\mathbb{R}}^{n} \times{\mathbb{R}}^{n} \to{\mathbb{R}}$ satisfying $k(x_{i},x_{j})=\langle \phi(x_{i}),\phi(x_{j})\rangle_{{\mathbb{H}}}$. The learning result ${\bf{w}}$ can be represented by a linear combination of the kernel functions:

(3)

and it is not needed to use the high dimensional map ϕ(⋅) explicitly. By plugging (3) into (1) or (2), the optimal coefficients $\beta:=[\beta_{1}, \beta_{2},\ldots,\beta_{m}]^{T}\in{\mathbb{R}}^{m}$ and b can be solved and the classification surface θ(x):=∑_i β _i k(x _i,x)+b=0 is obtained.

Many studies have been done on the dual of problem (1) or (2) [1, 7, 14, 22, 23, 25, 30]. For example, the well-known methods, SMO [1, 14, 25] and SVM^light [11, 12], are decomposition methods to solve the dual of problem (1). Successive over-relaxation (SOR) [22] and Lagrangian Support Vector Machine (LSVM) algorithms [23] are simple iterative algorithms induced by the KKT conditions of the dual of problem (2), etc. These algorithms do work efficiently in experiments, although their convergence rate is low theoretically (at most linear convergence rate [18, 19, 22, 23]).

Lee and Mangasarian [17] proposed a quadratic training algorithm called SSVM (Smooth SVM) for linear problem, in which the objective function of problem (2) with σ=2 is smoothed by the exponential penalty function (Eq. (6) following), and its nonlinear version is based on the Generalized Support Vector Machine [21] listed as the following Eq. (5). However, SSVM cannot be used to solve large-scale nonlinear classification problems because its computational complexity for solving Newton equation is about O(m ³) per iteration (SMW identity (Eq. (21)) can greatly reduce the computing cost of the linear classification problem).

In [6], the kernel matrix K is precalculated and decomposed as K≈UU ^T where $U\in{\mathbb{R}}^{m\times \widetilde{m}}$ is a full column rank matrix, and a semi-smooth SVM with quadratic convergence rate is proposed. Further, another semi-smooth Newton SVM is proposed with many better experimental results reported in [31]. These algorithms can reach $O(m\widetilde{m}^{2})$ computational cost, and will be a good choice if $\widetilde{m} \ll m$. However, these algorithms require the whole kernel matrix to fit in memory, hence they are not efficient for large-scale problems.

Some low-rank methods without the whole kernel matrix precalculated have been studied. In RSVM [16], a reduced set W is selected randomly from the index set {1,2,…,m} with $\overline{m} :=| W | \le0.1m$ and (3) is replaced by a compact representation

$$ {\bf{w}}=\sum_{i\in W}\beta_i k(x_i,\cdot). $$

(4)

Although the reduced set W is a randomly selected small portion of the whole set, many experiments in [16] show that RSVM works extremely well and the accuracy is quite insensitive to the choice of W. Its computational complexity is about $O(m\overline{m}^{2})$ per iteration without the whole kernel matrix precalculated. Lin and Lin [20] discuss many kinds of implementations of RSVM in detail and conclude that the reduced methods are an appealing alternate for large-scale problems with a little lower accuracy. Lee and Huang [15] study the RSVM by Nyström approximation theoretically and provide a good understanding of the reduced kernel techniques.

In [13], another low-rank method is proposed to reduce the training complexity in a primal algorithm called PSVM (primal SVM), where the subset W in (4) is well-chosen iteratively to represent ${\bf{w}}$ as (4), and the resulting optimization problem is solved by the Newton method. Its computational complexity is at most $O(m d^{2}_{max})$ per iteration, where d _max is a specified maximum size of subset W. Compared to RSVM, PSVM may not be a better choice for common users because it needs a complicated sub-procedure to maintain W and the corresponding Hessian matrix iteratively.

There are also some efficient SVM solvers proposed in primal space recently. Fletcher and Zanghirati [8] proposed a new model for SVM problem in primal without introducing the penalization parameter C, and gave a new efficient SQP-like algorithm. They observed that the algorithm terminated within a finite number of iterations, but had not proven that. LapSVM [24] is a kind of learning algorithm for semi-supervised classification, and Pegasos [27] is designed for large datasets by stochastic sub-gradient descent methods.

In this paper, we focus on quadratic convergence algorithms for training SVM in primal space with the smoothing method, and the RSVM scheme is considered to reduce the complexity for large-scale problems. We generalize the results in [16, 17] to derive a new smoothing technique and prove that the error bound between the new smooth problem and the original one is $O(\frac{1}{p^{2}})$, which is stronger than that of SSVM in [17], where the error bound given is $O(\frac{1}{p})$. We also tighten the error bound of SSVM to $O(\frac{1}{p^{2}})$ in this paper. Some efficient reduced techniques, such as updating the Hessian matrix iteratively and solving Newton equation by SMW identity, are discussed in the paper to accelerate the algorithms for large-scale problems.

The rest of the paper is organized as follows. Lee and Mangasarian’s SSVM is restated in Sect. 2. We derive a new smoothing algorithm called NSSVM and obtain its error bound and convergence rate in Sect. 3. We further summarize the different kernel versions for these smoothing algorithms in Sect. 4, discuss some boosting techniques to reduce its computational complexity in Sect. 5, and tighten the error bound of SSVM in Sect. 6. Experimental results are presented in Sect. 7 to illustrate that the proposed algorithm is comparable to SSVM on generalization errors and training times. Section 8 concludes this paper.

2 SSVM Algorithm

In this section, we restate the Smooth Support Vector Machine (SSVM) proposed by Lee and Mangasarian in [17] and focus on the nonlinear classification problem. The corresponding problem can be rewritten as follows

(5)

where z:=[β ^T,b]^T, r _i:=1−H _i z with $H_{i} := y_{i} [K_{i}\ 1]\in {\mathbb{R}}^{m+1}$ for nonlinear classification (K _i is the ith row of the kernel matrix K and H _i is the ith row of matrix H). With the solution of (5), the classification surface is θ(x):=[k(x ₁,x),k(x ₂,x),…,k(x _m,x) 1]z ^∗=0.

In [17], the exponential penalty function $\psi_{p} (t):=t + \frac{1}{p}\log ( {1 + \exp ({ - p t } )} )$, p>0, is applied to smooth the second part of the objective function in (5). Here the following new equivalent form of ψ _p(t) is applied

(6)

and then

$$ \psi'_p (t) = \frac{\min\{1,\exp(p t)\}}{1 + \exp( - p|t|)},\qquad \psi''_p (t) = \frac{p\exp( - p|t|)}{ ( {1 + \exp( - p|t|)} )^2}. $$

(7)

Notice that log(1+exp(a)) may get an incorrect value if exp(a) is overflowing for a large positive number a such as a>1000. Our new equivalent form in (6) and its derivatives in (7) can avoid such potential overflowing.

Let $r:= (r_{1},r_{2},\ldots,r_{m})^{T} \in{\mathbb{R}}^{m}$ and ψ _p(r):=(ψ _p(r ₁),ψ _p(r ₂),…,ψ _p(r _m))^T. Then the smooth problem of (5) is

(8)

SSVM using Newton algorithm to solve (8) can converge quadratically, where the Newton equation ∇² f _p(z)d=−∇f _p(z) is solved per iteration with

(9)

(10)

Its computational cost per iteration is O(m ³). In Sect. 5, we illustrate that the cost can be reduced by the SMW identity [9]. The following lemma is presented to evaluate the approximate error bound:

Lemma 1

(Theorem 2.2 in [17])

If $\bar{z}$ be the unique minimizer of the smooth function f _p(z) in (8) and z ^∗ be the unique minimizer the original function f(z) in (5), then we have:

$$ \bigl \Vert {\bar{z} - z^* } \bigr \Vert _2^2 \le\frac{{Cm\log^2 2}}{{2p^2 }} + \frac{{Cm\eta\log2}}{p},\qquad f(\bar{z}) - f\bigl(z^* \bigr) \le\frac{{Cm\log^2 2}}{{2p^2 }} + \frac{{Cm\eta\log2}}{p}. $$

where η=max_1≤i≤m{|1−H _i z ^∗|}.

Lemma 1 shows that the error bound between the smooth problem (8) and the original problem (5) is $O(\frac{1}{p})$. In Sect. 3, we will propose a new smoothing method with a tighter error bound $O(\frac{1}{p^{2}})$, and will enhance the error bound of SSVM to $O(\frac{1}{p^{2}})$ in Sect. 6.

3 New proposed smoothing method and its properties

Let r ₊ be a vector whose ith element is max{r _i,0}. The necessary and sufficient optimality conditions of problem (5) are

(11)

Newton method cannot be used to solve these piecewise linear equations directly because ∇f(z) is non-differentiable at some points. Different from SSVM, which smooths the objective function of problem (5), we smooth the piecewise linear equations (11) with (6), and the resulted problem is to solve the following system of nonlinear equations:

$$ G_p (z) := z-CH^T \psi_p ( r )= 0. $$

(12)

It can be solved by any Newton-type method. Next we will investigate the error bound and convergence rate of our new smoothing method.

3.1 Error bound of the new smoothing method

In order to analyze the error bound between the solution of the smooth problem and the original one, we introduce the following minimization problem (13), whose first-order necessary and sufficient optimality conditions are just our smooth nonlinear equations (12).

(13)

where

$$ \varphi_p (t) := \int_{ - \infty}^t {\psi _p (u)} du, $$

(14)

with ψ _p(u) defined by (6). Comparing f(z) in (5) and h _p(z) in (13), it can be observed that the item $\frac{1}{2}\max\{0, r_{k}\}^{2}$ is approximated by φ _p(r _k) for k=1,2,…,m. The difference between them is analyzed in the following lemma.

Lemma 2

For φ _p(t) as defined in (14), we have

$$ 0 \le\varphi_p (t) - \frac{1}{2} \max\{0, t\}^2 \le\frac{\pi^2 }{{6 p^2 }}. $$

(15)

Proof

First we have φ _p(t) is a monotonically increasing function because of ψ _p(u)≥0, and

$$\varphi_p (0)=\frac{1}{p}\int_{ - \infty}^0 {\log\bigl( {1 + \exp(pu)} \bigr)du}= \frac{\pi^2}{12 p^2}. $$

If t<0, we have

$$0\le\varphi_p (t) =\varphi_p (t) - \frac{1}{2} \max\{0, t\}^2 \le\varphi_p (0)= \frac{\pi^2}{12 p^2 } \le \frac{\pi^2 }{6 p^2}. $$

If t≥0, then $\psi_{p} (t) = t + \frac{1}{p}\log ( {1 + \exp ( { - pt} )} )$, and we have

Thus (15) holds for any p and t. □

Then we analyze the difference between h _p(z) (13) and f(z) (5) in Lemma 3.

Lemma 3

The difference between the smooth problem h _p(z) and the original problem f(z) is bounded by the following inequalities:

$$0\le h_p (z)- f(z) \le\frac{Cm\pi^2}{6 p^2}. $$

Let z ^∗ be the unique solution of the original problem and $\hat{z}$ be the unique solution of the smooth problem. Using Lemma 3, we explore the error bound between z ^∗ and $\hat{z}$ together with the error bound between f(z ^∗) and $f(\hat{z})$ in Theorem 1.

Theorem 1

If z ^∗ is the unique solution of problem (5) or (11), and $\hat{z}$ is the unique solution of problem (12) or (13), then the following inequalities hold:

$$\bigl \Vert {\hat{z} - z^* } \bigr \Vert _2^2 \le \frac{{C m\pi^2 }}{6 p^2}\quad\mbox{\textit{and}}\quad f(\hat{z}) - f\bigl(z^* \bigr ) \le \frac{{C m\pi^2 }}{6 p^2}. $$

Proof

Because the eigenvalues of the Hessian matrix of h _p(z) and the eigenvalues of any Clarke generalized Jacobian [4] of ∇f(z) are all larger than 1 and $\nabla f(z^{*})=\nabla h_{p}(\hat{z})=0$, we have

Thus

$$\bigl \Vert {\hat{z} - z^* } \bigr \Vert _2^2 \le h_p \bigl(z^* \bigr) - f\bigl(z^* \bigr) - \bigl( {h_p ( \hat{z}) - f(\hat{z})} \bigr) \le h_p \bigl(z^* \bigr) - f\bigl(z^* \bigr) \le\frac{{C m\pi^2 }}{6 p^2}. $$

The second and the last inequalities follow form Lemma 3. Similarly, we have

$$f(\hat{z}) - f\bigl(z^* \bigr) \le h_p \bigl(z^* \bigr) - f\bigl(z^* \bigr) - \bigl( {h_p \bigl(z^* \bigr) - h_p (\hat{z})} \bigr) \le h_p \bigl(z^* \bigr) - f\bigl(z^* \bigr) \le \frac{{Cm\pi^2 }}{6 p^2}. $$

□

According to Theorem 1, the error bound of our smoothing method is $O(\frac{1}{p^{2}})$, while the corresponding results of SSVM in [17] are $O(\frac{1}{p})$ as listed in Lemma 1. So the error bound of our new method is better than those of SSVM.

3.2 New smoothing algorithm and its convergence rate

Computing the Jacobian matrix of the nonlinear equations (12), we have

$$ \nabla G_p (z) = I + CH^T \widehat{\varLambda}_p ( z )H, $$

(16)

where

$$ \widehat{\varLambda}_p (z) := \operatorname{diag} \bigl( {\psi'_p (r_1 ),\psi'_p (r_2 ), \ldots ,\psi'_p (r_m )} \bigr). $$

(17)

Comparing Λ _p(z) in (10) and $\widehat{\varLambda}_{p}(z)$ in (17), it can be observed that ∇G _p(z) in (16) is simpler than ∇² f _p(z) as defined in (9) for SSVM. The new smooth problem can be solved by the Newton-type algorithm as follows:

Algorithm 1

New Smoothing SVM Algorithm (NSSVM)

Step 0.:: Input data H and parameters p, C; give an initial point z ⁰; select ε>0 and σ∈(0,0.5). Set k:=0.
Step 1.:: Calculate g ^k=G _p(z ^k). If ∥g ^k∥<ε, stops; otherwise go to Step 2;
Step 2.:: Calculate d ^k by solving the Newton system ∇G _p(z ^k)d=−g ^k;
Step 3.:: (Armijo line search) Choose λ _k=max{1,2⁻¹,2⁻²,…} such that
$$h_p\bigl(z^k + \lambda_k d^k \bigr)\le h_p\bigl(z^k \bigr)+ \sigma \lambda_k {g^k}^T d^k; $$
Step 4.:: Let z ^k+1:=z ^k+λ _k d ^k and k:=k+1. Go to Step 1.

The following lemma is necessary to prove the convergence rate of the Algorithm 1.

Lemma 4

For ∇G _p(z) defined in (16), we have:

(1)
$d^{T} \nabla G_{p} (z)d \ge \Vert d \Vert _{2}^{2} $ for all d and z;
(2)
∥∇G _p(z ¹)−∇G _p(z ²)∥₂≤κ∥z ¹−z ²∥₂ for all z ¹, z ² and some κ>0.

Proof

(1) It is clear because all the eigenvalues of ∇G _p(z) are larger than 1.

(2) According to (16), we have

where η _i is obtained by Lagrange mean value theorem and the second inequality is implied by $\psi''_{p} (t) \le p$ for any t in (7), and $\kappa := Cp\Vert H \Vert _{2}^{3}$. □

With the help of Lemma 4, the convergence rate of the new smoothing algorithm can be obtained as in the following Theorem 2. The proof of Theorem 3.2 in [17] can be used to prove Theorem 2 and is thus omitted here.

Theorem 2

Let {z ^k} be a sequence generated by Algorithm 1, and $\hat{z}$ be the unique solution of problem (12). Then {z ^k} converges to the unique solution $\hat{z}$ quadratically.

Remark 1

In NSSVM, the objective function h _p(z) is complicated to compute because it contains an integral function φ _p(t) (14), which affects the efficiency of the algorithm in Armijo line-search procedure (Step 3). Since the error between h _p(z) and f(z) is $O(\frac{1}{p^{2}})$ for a large p, we can use f(z) in place of h _p(z) in Armijo line search procedure. Experimental results show that it is a good choice.

4 Kernel versions extension

There are three versions to solve nonlinear classification problem, while the mapping vector ${\bf x}_{i}\in{\mathbb{H}}$ is defined implicitly by the kernel function as $\langle{\bf{x}}_{i},{\bf{x}}_{j}\rangle _{{\mathbb{H}}}=k(x_{i},x_{j})$. We summarize them here.

(i)
According to GSVM [21], as mentioned above, problem (5) and its necessary and sufficient optimality conditions (11) are considered. If the optimal solution is z ^∗, then the resulting classification surface is θ(x):=[k(x ₁,x),k(x ₂,x),…,k(x _m,x) 1]z ^∗=0.
(ii)
Stemming from the dual problem of (2) with σ=2, we have
(18)
where $D=\operatorname{diag}(y)$ and e=[1,1,…,1]^T. Decomposing the kernel matrix as K=UU ^T exactly or approximately, Zhou et al. [31] prove that the problem (18) is equivalent to solving the following unconstrained problem
(19)
with r=e−D[U e]z. If the optimal solution to (19) is z ^∗, then the solution to (18) is α ^∗=C(e−Hz ^∗)₊, and the classification surface is $\theta(x):=\sum_{\alpha^{*}_{i}>0}y_{i}\alpha^{*}_{i} (k(x_{i},x)+1)=0$.
(iii)
Plugging (3) into (2) with σ=2 and noting z=[β ^T b]^T, r=e−D[K e]z, we have
(20)
The resulted classification surface is θ(x):=[k(x ₁,x),…,k(x _m,x),1]z ^∗=0.

In all above three approaches, their optimal conditions of the corresponding programming problems are piecewise linear equations that can be smoothed by our new method, and our results $O(\frac{1}{p^{2}})$ on error bound between the approximate solution and the exact optimal solution are achieved. The boost schemes mentioned in Sect. 5 also work for them.

In Sect. 7, we implement the smoothing algorithms based on case (i).

5 Boosting techniques

In this section, we investigate some techniques to boost the performance of these smoothing algorithms. To this end, the SMW identity is adopted to solve Newton equation, and/or some schemes are studied to update the Hessian matrix with low computational complexity.

5.1 Solving Newton equation by SMW identity

If D is a diagonal matrix with smaller size, the following SMW identity [9] can be used to calculate (I+H ^T DH)⁻¹ with less computational complexity:

$$ \bigl(I + H^TD H \bigr)^{ -1}=I- H^T \bigl(D^{-1}+ HH^T \bigr)^{-1} H. $$

(21)

As soon as Λ _p(z ^k) in (9) or $\widehat{\varLambda}_{p} (z^{k})$ in (17) has a sparse diagonal, the cost to obtain the corresponding Newton direction based on (21) is much less than O(m ³).

Take NSSVM as an example. For a large smoothing parameter p, $(\widehat{\varLambda}_{p} (z^{k}) )_{i,i}=\psi'_{p} (r^{k}_{i} )=\frac{\min\{1, \exp(p r^{k}_{i})\}}{1 + \exp( - p|r^{k}_{i}|)}$ in (17) is numerically zero or very tiny if $r^{k}_{i} =1-H_{i} z^{k} < 0$, and $(\widehat{\varLambda}_{p} (z^{k}))_{i,i}$ is significantly positive if $r^{k}_{i} \geq0$, which corresponds to a bounded support vector x _i for the current solution z ^k. That is to say, the diagonal elements of $\widehat{\varLambda}_{p}(z^{k})$ are commonly as sparse as the solutions of SVM (the Lagrange multipliers corresponding to input samples x _i). Let $J^{k} = \{ i\mid( \widehat{\varLambda}_{p} (z^{k}) )_{ii} \ge\tau, i =1,2, \ldots, m\}$ with an extremely tiny positive τ. Then ∇G _p(z ^k) can be approximately calculated as follows:

(22)

where $H_{J^{k}}$ is a sub-matrix comprised by the rows of H in the index set J ^k and $\widehat{\varLambda}_{J^{k},J^{k}}$ is a sub-matrix of $\widehat{\varLambda}_{p} (z^{k})$ corresponding to the rows and columns in J ^k. Experimental results show that if τ=10⁻¹⁰ or less, approximation errors can be neglected.

Based on SMW identity (21) and ∇G _p(z ^k) in (22), Newton direction d ^k of NSSVM is obtained as follows:

$$ d^k = - g^k + H_{J^k}^T \biggl( \frac{1}{C} \widehat{\varLambda}_{J^k ,J^k } ^{ - 1} + H_{J^k} H_{J^k}^T \biggr)^{ - 1} H_{J^k} g^k. $$

(23)

Let $\widetilde{m}_{k}:=|J^{k}|$, which approximate equals the number of support vectors. The computational complexity of the algorithm can be reduced from O(m ³) to $O(m\widetilde{m}_{k}^{2})$ per iteration. Experiments show that $\widetilde{m}_{k}$ is much smaller than m after several iterations.

The analysis for SSVM is similar as above.

5.2 Updating the reduced Hessian with boosting skills

In order to solve large-scale problems, the reduced techniques have been proposed to accelerate SVM training algorithms. In RSVM [16], a reduced set W in (4) is selected randomly from the index set {1,2,…,m} satisfying $\overline{m} :=| W | \le0.1 m$. If let $\overline{K}\in {\mathbb{R}}^{m\times\overline{m}}$($\overline{m} =|W|$) be a sub-matrix comprised by the columns of the full kernel matrix K corresponding to the reduced set W, then the formulation of RSVM in [15, 16, 20] is

(24)

which is very similar to (5). The differences between them are on the dimensions of z and

$$ r_i:=1-\overline{H}_i z\quad \mbox{with } \overline{H}_i:=y_i[\overline{K}_i\ 1],\ i=1,2,\ldots,m. $$

(25)

RSVM smooths the objective function of (24) by the exponential penalty function (6), while our new algorithm smooths the necessary and sufficient optimality conditions of (24). All algorithms solve Newton equation

$$B^kd^k = -g^k $$

to obtain direction d ^k, where $B^{k}:= I + C\overline{H}^{T} \overline{\varLambda}^{k} \overline{H} \in{\mathbb{R}}^{(\overline{m} + 1) \times (\overline{m} + 1)}$ with a diagonal matrix $\overline{\varLambda}^{k} \in R^{m \times m}$ defined correspondingly as Λ _p(z ^k) in (10) for RSVM and $\widehat{\varLambda}_{p} (z^{k} )$ in (17) for the new algorithm, in which r and $\overline{H}$ are defined by (25).

There are three methods to further boost the performance of the algorithms:

Scheme I: :

Based on the sparsity of the diagonal elements of $\overline{\varLambda}^{k}$, we have

$$ B^k:= I + C\overline{H}^T_{J^k} \overline{\varLambda}^k_{J^k,J^k} \overline{H} _{J^k}. $$

(26)

where $J^{k}:= \{ {i| {\overline{\varLambda}_{i,i}^{k} \ge\tau, i=1,\ldots,m} }\}$ with a tiny positive τ. Let $\widetilde{m}_{k} := | {J^{k} } |$. The total computational complexity per iteration is $O(\widetilde{m}_{k} \overline{m}^{2}\!{+}\,\overline{m}^{3}) \,{=}\, O(\max\{\widetilde{m}_{k}, \overline{m}\}\overline{m}^{2})$ where $O(\widetilde{m}_{k} \overline{m}^{2})$ is for computing B ^k and $O(\overline{m}^{3})$ is for solving Newton equation.

Scheme II: :

B ^k in (26) can be updated iteratively as in [31]:

$$ B^k = B^{k-1}+C\overline{H}^T \varUpsilon^k\overline{H} =B^{k-1}+C\overline{H}_{J_{0}^k}^T \varUpsilon_{J_{0}^k,J_{0}^k}^k\overline{H}_{J_{0}^k}, $$

(27)

where $\varUpsilon^{k} = \overline{\varLambda}^{k}-\overline{\varLambda}^{k-1}$ and $J_{0}^{k}=\{ {i| {\varUpsilon_{i,i}^{k} \ne0, i=1,\ldots,m} } \}$. Let $\widehat{m}_{k} :=|J_{0}^{k}|$. The total computational complexity is reduced to $O(\widehat{m}_{k}\overline{m}^{2} + \overline{m}^{3})= O(\max\{\widehat{m}_{k},\overline{m}\}\overline{m}^{2})$ per iteration.

Scheme III: :

If SMW identity (21) is applied to solve the reduced Newton equation, we have

$$ d^k = - g^k + \overline{H}_{J^k}^T \biggl( \frac{1}{C} \bigl( {\varLambda_{J^k ,J^k }^k } \bigr)^{ - 1} + \overline{H}_{J^k} \overline{H}_{J^k}^T \biggr)^{ - 1} \overline{H}_{J^k} g^k. $$

(28)

The total complexity per iteration is reduced to $O(\overline{m}\widetilde{m}_{k}^{2}+\widetilde{m}_{k}^{3})\,{=}\, O(\max\{\widetilde{m}_{k},\overline{m}\}\widetilde{m}_{k}^{2})$.

Experimental results in Sect. 7 show that $\widehat{m}_{k} \leq \widetilde{m}_{k} $ holds apparent after several iterations and $\widehat{m}_{k} $ has a tendency to 0 while $\widetilde{m}_{k} $ has a tendency to the number of the support vectors. This is to say that Scheme II is better than Scheme I.

In our implements, we compound these cases to get a better one:

Optimal Scheme: :: If $\widetilde{m}_{k} > \overline{m}$, Scheme II is chosen with the complexity $O(\widehat{m}_{k}\overline{m}^{2})$; Otherwise, Scheme III is chosen with the complexity $O(\overline{m} \widetilde{m}_{k}^{2})$.

Its computational complexity is often less than $O(m\overline{m}^{2} )$ given by [20] for RSVM, and is also less than $O(md^{2}_{max})$ given in [13] for PSVM because $\widehat{m}_{k}$ and $\widetilde{m}_{k}$ are often much smaller than m.

6 Tighter error bound for SSVM

Given the same smoothing parameter p, experimental results show that the accuracies of our new method are always very similar to those of SSVM. This is not consistent with the theoretical analysis that the error bound of SSVM is $O(\frac{1}{p})$ in [17], while the error bound of our new method is $O(\frac{1}{p^{2}})$. In the following we present a new proof to tighten the error bound of SSVM.

The difference between $\psi^{2} _{p} (t)$ and max{t,0}² is analyzed in Lemma 5.

Lemma 5

If ψ _p(t) be defined in (6), then the following inequalities holds:

$$ 0\le\psi^2 _p (t) - \max\{t,0 \}^2 \leq\frac{1}{p^2}. $$

(29)

Proof

The left inequality is obvious and the right inequality is proven as follows.

If t≤0,

$$\psi^2 _p (t)-\max\{t,0\}^2=\psi^2 _p (t)=\frac{1}{p^2}\log^2\bigl (1+\exp\bigl(-p|t|\bigr)\bigr)\leq\frac{\log^2 2 }{p^2}\leq \frac{1}{p^2}. $$

If t>0,

The first inequality holds because log(1+a)≤a for a≥0, and the last inequality holds because $\phi(t):=\frac{1}{p^{2}}\exp(-2pt)+\frac{2}{p}t\exp(-pt)$ is a monotonically decreasing function for t≥0 (since $\phi'(t)=\frac{2}{p}\exp(-pt) (1-pt- \exp(-pt) )\leq0$). □

A tighter error bound for SSVM can be proven using the same method as Theorem 1.

Theorem 3

If z ^∗ be the unique solution of problem (5) or (11), and $\bar{z}$ be the unique solution of problem (8), then the following inequalities hold:

$$\bigl \Vert {\bar{z} - z^* } \bigr \Vert _2^2 \le\frac{Cm}{2 p^2}\quad\mbox{\textit{and}}\quad f(\bar{z}) - f(z^* ) \le\frac {Cm}{2 p^2}. $$

By Theorem 3, the error bound of SSVM is also $O(\frac{1}{p^{2}})$.

7 Experimental results

In this section, we perform some experiments to compare the new algorithm (NSSVM) with SSVM [17] as well as PSVM [13] and also to show the performance of the proposed boosting techniques, and the reported results in [27] are also cited here for comparison. All the experiments are run on a Personal Computer with a Intel Core i3 2100 CPU and a maximum of 4 Gbytes of memory available for all processes. The computer runs Windows 7 with Matlab 7.10. Some Matlab codes of our algorithms are available on the webpage http://web.xidian.edu.cn/sszhou/en/paper.html

7.1 Compare NSSVM and SSVM on checkerboard dataset

This set of experiments is on a non-linearly separable example called “tried and true” checkerboard dataset, first given in [10], which is widely used to show the effectiveness of nonlinear kernel methods [16, 17, 22, 23, 31]. The data is generated by uniformly discretizing the regions [0,199]×[0,199] to 200²=40,000 points, and labeling two classes spaced by 4×4 grids (See [16, 17, 22, 23, 31] for the plots of the dataset). The training set is randomly sampled from the 40,000 data points with different training sizes, and the remainders of the points are left as the testing set. The chosen kernel function is the Gaussian kernel function K(x _i,x _j)=exp(−0.001∥x _i−x _j∥²). We set ε=10⁻⁴ to stop the algorithms and τ=10⁻¹⁰ to prune the diagonal elements of Hessian matrices, and let C=10⁴, p=10⁸.

Some plots are given in Fig. 1 to show the average changing per iteration of $\widetilde{m}_{k}=|J^{k}| $ in (26) and $\widehat{m}_{k}=|J_{0}^{k}|$ in (27). We take the new algorithm as an example, and the results obtained by SSVM are similar. In Fig. 1(a), the plots are obtained on an 8,000 training dataset with the reduced size $\overline{m}=800$, and the results are the average values of the first 55 iterations on 20 random trials. In Fig. 1(b), the plots are obtained on a 15,000 training dataset with the reduced size $\overline{m}=1000$, and the results are the average values of the first 65 iterations on 20 random trials. Semi-log plots with logarithmic (base 10) scale for the Y-axis are used.

From Fig. 1, it can be observed that $\widehat{m}_{k} $ is often smaller than $\widetilde{m}_{k} $. Both $\widehat{m}_{k} $ and $\widetilde{m}_{k} $ have a tendency to quickly converge to a small number. $\widetilde{m}_{k} $ has a tendency to converge to the number of support vectors (near 70) while $\widehat{m}_{k} $ has a tendency to converge to a very small number less than 10 in the iterative process. They are both much smaller than the reduced size $\overline{m}$ after 6 iterations. It can be concluded that the boosting skills dramatically reduce the complexity of the algorithms: The computational complexity is $O(\widehat{m}_{k}\overline{m}^{2})$ for the first 5 (or 6) iterations (over the reduced size level) and is $O(\overline{m} \widetilde{m}_{k}^{2})$ for the rest (below the reduced size level), which is less than $O(m\overline{m}^{2})$ for RSVM in [16, 20] and $O(md^{2}_{max})$ for PSVM in [13].

More training sets are randomly sampled from the checkerboard dataset with different training sizes for further experiments, where algorithms with the reduced method and the boosting techniques in Sect. 5 are compared. Table 1 lists the experimental results. The reduced set, with the size $\overline{m} = 0.1 m$ but limited by the lower bound 100 and the upper bound 1,000, is randomly selected from the training set. All the data in Table 1 are obtained on an average over 20 random experiments, and the standard deviations are also given in brackets for some columns. In the 5th and 6th columns, the mean training time of the proposed Optimal Scheme in Sect. 5.2 is compared with the mean training time of RSVM [16, 20] only with the sparsity considered as (26), which corresponds to Scheme I in Sect. 5.2. In the 7th column, we list the average value of ∥∇f∥ (∇f is defined in (11)) to check the quality of those approximate solutions in term of the exact solutions of the original problem. In the 8th column, we list the average value of $f(\hat{z})-f(\bar{z})$ to compare the quality of the solutions achieved by the two algorithms, where $\hat{z}$ is obtained by the new smoothing algorithm and $\bar{z}$ is obtained by SSVM.

Table 1 Experimental results with the proposed techniques on some medium sized training datasets. All the data are obtained on an average over 20 random trials and the standard deviations are also given in brackets for columns 3–6; ∥∇f∥ (∇f is defined in (11)) is to check the quality of these approximate solutions in term of exact solutions in the 7th column, and the quality of the solutions achieved by the two algorithms is compared in the 8th column, where $\hat{z}$ is computed by NSSVM and $\bar{z}$ is computed by SSVM with Optimal Scheme. “SS” and “NS”are “SSVM” and “NSSVM” for shorts respectively

Full size table

The results in Table 1 show that the new algorithm works as well as SSVM [17]. The two algorithms can get the same training accuracies (not listed) and testing accuracies with a little difference in the training time and the number of iterations. It is coherent with our theoretical analysis that the two algorithms have similar error bound. At the same time, according to the 5th and 6th columns, our optimal scheme saves nearly 1/3 of the training time. In the 7th column, it is clear that the obtained solutions are good approximate solutions of original problems, since the mean values of ∥∇f∥ are all much smaller than our stop criterion ε=10⁻⁴ and all ∇f meet the optimality conditions of original problems with tiny errors. The average values of $f(\hat{z})-f(\bar{z})$ in the last column show that the difference of the two algorithms is also very tiny and clearly they both satisfy inequality $-\frac{Cm}{2 p^{2}}\le f(\hat{z})-f(\bar{z}) = f(\hat{z})-f(z^{*})-(f(\bar{z})-f(z^{*}))\le\frac{Cm\pi^{2}}{6 p^{2}}$, which is implied by our theorems on error bound.

7.2 Compare SSVM, NSSVM and PSVM on some large benchmark datasets

This set of experiments is to evaluate the performance of the SSVM, our new NSSVM, as well as the PSVM in [13]. And the reported results in [27] are also cited here to compare with our algorithm. The Matlab codes for PSVM are gotten from the site [2], which are designed for paper [13]. Our codes for SSVM and NSSVM are available on the webpage http://web.xidian.edu.cn/sszhou/en/paper.html.

All algorithms are compared on the training time and the classification accuracies with reduced method, and the iterations of SSVM and our NSSVM are also given (we do not listed the iterations of PSVM since it is not easy to count).

The reduced sets for SSVM and NSSVM is random chosen while the reduced set for PSVM is iteratively updated by some complicate skills (see [13] for details). If we use the same reduced set size, PSVM is much slower than SSVM and NSSVM, but most of time has a little bit higher classification accuracies than SSVM and NSSVM have. So in Table 2, we half the reduced set size of PSVM so that the classification accuracies of three algorithms are comparable.

Table 2 Experimental results of the related algorithms on some larger size benchmark datasets. All the results are obtained on an average over 20 random trials and the standard deviations are given in brackets. Here “Tr”, “Ts”, “Dm” and “Rd” are “Training size”, “Test size”, “Dimension of samples” and “Reduced size” of the input datasets respectively, and “SS”, “NS” and “PS” are “SSVM”, “NSSVM” and “PSVM” for shorts respectively. Reported results [27] for some sets by other popular SVM solver are listed in the footnotes

Full size table

Eight large benchmark datasets from the site of [5] are adopted. Some of them are also appeared in [13] and [27]. For simplicity, Gaussian kernel function k(x,y)=exp(−γ∥x−y∥²) with different spread parameters γ is used for all datasets. The kernel spread parameters γ and the tradeoff parameters C are roughly chosen by 10-fold cross-validation. The details of the data sets and the corresponding selected parameters are listed as follows:

USPS3 and USPS8—USPS is a multi-class data set with 10 classes including 7,291 training samples and 2,007 test samples. Each sample has 256 features. Here USPS3 is used for the task of classifying the digit 3 versus the rest of the classes and USPS8 is used for the task of classifying the digit 8 versus the rest of the classes. The parameters are C=40 and γ=0.02.
Adult—It is the version given by Platt which has 32,561 training examples and 16,281 test examples. Each example has 123 binary features, and the parameters are C=1 and γ=0.05.
Shuttle—It is a multi-class data set with seven classes including 43,500 training examples and 14,500 test examples. Each example has 9 features. Here a binary classification problem is solved to separate class 1 from the rest, and the parameters used are C=10,000 and γ=2.
IJCNN—It has 49,990 training examples and 91,701 test examples. Each example is described by 22 features, and the parameters used are C=1,000 and γ=0.5.
MNIST3 and MNIST8—MNIST is a multi-class data set to classify the handwritten digits 0 to 9. It has 60,000 training examples and 10,000 test examples. Each example is described by 28×28 features. Here MNIST3 is the task of classifying digit 3 versus the rest of the classes and MNIST8 is the task of classifying digit 8 versus the other classes. The parameters used are C=40 and γ=0.02.
Vechile—It is the combined SensIT Vechile in [5]. It has 78,823 training examples and 19,705 test examples in three classes. Each example has 100 features. Here the task of classifying class 3 from the rest is trained, and the parameters are set as C=1,000 and γ=0.25.

The experimental results are listed in Table 2, where the means of abbreviations in the first column are explained in the table caption. All the results are obtained on an average over 20 random trials and the standard deviations are given in brackets, and the best values are in bold.

The results in Table 2 again show that the new algorithm works as well as SSVM [17]. The two algorithms can get the same testing accuracies with a little difference in the training time and the number of iterations, which are coherent with our theoretical analysis that the two algorithms have similar error bound. All the deviations (in brackets) are very small, so the mean values are confident.

The results in Table 2 also show that the smooth algorithms with the proposed boosting skills can achieve test accuracies comparable to PSVM, but save the training time about 2 to 10 times. Notice that the complicate codes of PSVM are more than 200 lines in Maltab while our SSVM and NSSVM are all less than 40 lines: the new smooth algorithms are more suited for the common SVM users than PSVM.

Furthermore, as reported in the footnotes of Table 2, by comparing our results against those obtained in [27] (with a more powerful computer), we can conclude that the smooth algorithms with our boosting skills are faster than some popular SVM solvers with the comparable test accuracies.

By the way, in this set of experiments, we found that Scheme II in Sect. 5.2 is the most useful skill to boost the algorithms. On the last 6 large training datasets with the smaller reduced set size, our Optimal Scheme is equivalent to Scheme II since $\widetilde{m}_{k}$ is larger than the reduced set size $\overline{m}$.

8 Conclusion

The quadratic convergence rate algorithms for training SVM with smooth method are studied in this paper. In the primal space, SVM can be modeled as an unconstrained problem with a piecewise quadratic objective function, whose optimality condition is a system of piecewise linear equations. We derive a new smoothing, quadratically convergent algorithm by smoothing the piecewise linear equations. The smoothing method is different from SSVM [17], where the objective function is smoothed. We prove that both the algorithms have the same approximation error bound $O(\frac{1}{p^{2}})$, which is tighter than the result given in [17].

For large-scale nonlinear problems, some reduced methods and boosting techniques are introduced to reduce the computational complexity of these algorithms and to improve the performance of these smoothing methods. Many experimental results support that the new smoothing algorithm is not only as good as SSVM, but also simpler than SSVM on the Hessian matrix. It can also be observed that the smoothing algorithms with the proposed boosting skills are comparable with similar algorithms like PSVM [13], but simpler and easier in implement.

References

Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–27 (2011). http://www.csie.ntu.edu.tw/~cjlin/libsvm
Article Google Scholar
Chapelle, O.: Support Vector Machines in the primal (2006). http://www.kyb.tuebingen.mpg.de/bs/people/chapelle/primal/
Chapelle, O.: Training a Support Vector Machine in the primal. Neural Comput. 19(5), 1155–1178 (2007)
Article MathSciNet MATH Google Scholar
Clarke, F.H.: Optimization and Nonsmooth Analysis. Willey, New York (1983)
MATH Google Scholar
Fan, R.E., Lin, C.J.: LIBSVM Data (2010). http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Ferris, M.C., Munson, T.S.: Semismooth Support Vector Machines. Math. Program., Ser. B 101, 185–204 (2004)
MathSciNet MATH Google Scholar
Fine, S., Scheinberg, K.: Efficient SVM training using low-rank kernel representations. J. Mach. Learn. Res. 2, 243–264 (2001)
Google Scholar
Fletcher, R., Zanghirati, G.: Binary separation and training support vector machines. Acta Numer. 19, 121–158 (2010)
Article MathSciNet MATH Google Scholar
Golub, G.H., Loan, C.F.V.: Matrix Computations. Johns Hopkins University Press, Baltimore (1996)
MATH Google Scholar
Ho, T.K., Kleinberg, E.M.: Building projectable classifiers of arbitrary complexity. In: Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, pp. 880–885 (1996)
Google Scholar
Joachims, T.: In: SVM^light, Support Vector Machine (1998). http://www.cs.cornell.edu/people/tj/svm_light/
Google Scholar
Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods—Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999)
Google Scholar
Keerthi, S.S., Chapelle, O., Decoste, D.: Building Support Vector Machines with reduced classifier complexity. J. Mach. Learn. Res. 7, 1493–1515 (2006)
MathSciNet MATH Google Scholar
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput. 13, 637–649 (2001)
Article MATH Google Scholar
Lee, Y.J., Huang, S.Y.: Reduced support vector machines: a statistical theory. IEEE Trans. Neural Netw. 18(1), 1–13 (2007)
Article Google Scholar
Lee, Y.J., Mangasarian, O.L.: RSVM: reduced Support Vector Machines. In: CD Proceedings of the SIAM International Conference on Data Mining, pp. 1–17. SIAM, Chicago (2001)
Google Scholar
Lee, Y.J., Mangasarian, O.L.: SSVM: a smooth Support Vector Machine for classification. Comput. Optim. Appl. 20(1), 5–22 (2001)
Article MathSciNet MATH Google Scholar
Lin, C.J.: On the convergence of the decomposition method for Support Vector Machines. IEEE Trans. Neural Netw. 12, 1288–1298 (2001)
Article Google Scholar
Lin, C.J.: Asymptotic convergence of an SMO algorithm without any assumptions. IEEE Trans. Neural Netw. 13, 248–250 (2002)
Article Google Scholar
Lin, K.M., Lin, C.J.: A study on reduced Support Vector Machines. IEEE Trans. Neural Netw. 14(6), 1449–1459 (2003)
Article Google Scholar
Mangasarian, O.L.: Generalized Support Vector Machine. In: Smola, A.J., Bartlett, P., Schökopf, B., Schuurmans, D. (eds.) Advances in Large Margin Classifiers, pp. 135–146. MIT Press, Cambridge (2000)
Google Scholar
Mangasarian, O.L., Musicant, D.R.: Successive overrelaxation for Support Vector Machines. IEEE Trans. Neural Netw. 10(5), 1032–1037 (1999)
Article Google Scholar
Mangasarian, O.L., Musicant, D.R.: Lagrangian Support Vector Machines. J. Mach. Learn. Res. 1, 161–177 (2001)
MathSciNet MATH Google Scholar
Melacci, S., Belkin, M.: Laplacian support vector machines trained in the primal. J. Mach. Learn. Res. 12, 1149–1184 (2011)
MathSciNet Google Scholar
Platt, J.C.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Schölkopf, B., Burges, C.J., Smola, A.J. (eds.) Advances in Kernel Method-Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)
Google Scholar
Schölkopf, B., Smola, A.J.: Learning with Kernels-Support Vector Machines, Regularization, Optimization and Beyond. The MIT Press, Cambridge (2002)
Google Scholar
Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127(1), 3–30 (2011)
Article MathSciNet MATH Google Scholar
Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)
Article Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (2000)
Book MATH Google Scholar
Zhou, S., Liu, H., Ye, F., Zhou, L.: A new iterative algorithm training SVM. Optim. Methods Softw. 24(6), 913–932 (2009)
Article MathSciNet MATH Google Scholar
Zhou, S., Liu, H., Zhou, L., Ye, F.: Semismooth Newton Support Vector Machine. Pattern Recognit. Lett. 28, 2054–2062 (2007)
Article Google Scholar

Download references

Acknowledgements

We gratefully acknowledge the supports of National Natural Science Foundation of China (NNSFC) under Grant Nos. 60603098, 61072144, 61179040, 61173089, 11101321 and 11101322. We also thank the anonymous reviewers and editors for their helpful comments to improve the presentation.

Author information

Authors and Affiliations

School of Science, Xidian University, Xi’an, 710071, China
Shuisheng Zhou, Feng Ye, Hongwei Liu & Qiang Zhu
School of Computer Science and Technology, Xidian University, Xi’an, 710071, China
Jiangtao Cui

Authors

Shuisheng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jiangtao Cui
View author publications
You can also search for this author in PubMed Google Scholar
Feng Ye
View author publications
You can also search for this author in PubMed Google Scholar
Hongwei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuisheng Zhou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, S., Cui, J., Ye, F. et al. New smoothing SVM algorithm with tight error bound and efficient reduced techniques. Comput Optim Appl 56, 599–617 (2013). https://doi.org/10.1007/s10589-013-9571-6

Download citation

Received: 08 March 2012
Published: 06 June 2013
Issue Date: December 2013
DOI: https://doi.org/10.1007/s10589-013-9571-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

New smoothing SVM algorithm with tight error bound and efficient reduced techniques

Abstract

Similar content being viewed by others

A semismooth Newton method for support vector classification and regression

A fast conjugate functional gain sequential minimal optimization training algorithm for LS-SVM model

An Efficient SMO Algorithm for Solving Non-smooth Problem Arising in \(\varepsilon \)-Insensitive Support Vector Regression

1 Introduction

2 SSVM Algorithm

Lemma 1

3 New proposed smoothing method and its properties

3.1 Error bound of the new smoothing method

Lemma 2

Proof

Lemma 3

Theorem 1

Proof

3.2 New smoothing algorithm and its convergence rate

Algorithm 1

Lemma 4

Proof

Theorem 2

Remark 1

4 Kernel versions extension

5 Boosting techniques

5.1 Solving Newton equation by SMW identity

5.2 Updating the reduced Hessian with boosting skills

6 Tighter error bound for SSVM

Lemma 5

Proof

Theorem 3

7 Experimental results

7.1 Compare NSSVM and SSVM on checkerboard dataset

7.2 Compare SSVM, NSSVM and PSVM on some large benchmark datasets

8 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation