1 Introduction

Based on the Vapnik and Chervonenkis’ structural risk minimization principle [28, 29], Support Vector Machine (SVM) has emerged as a state-of-the-art tool for classification and regression. SVM solves a convex quadratic programming problem, and its solution is not only explicitly defined but also sparse. SVM has been widely used in many application areas [28, 29], such as character identification, disease diagnoses, time series prediction, etc.

Given a training dataset \(\{ (x_{i} ,y_{i}) \}_{i = 1}^{m} \subset{\mathbb{R}}^{n} \times \{ { - 1,1} \}\), SVM is to solve the following formulation (see [3, 13] etc.):

(1)

or the formulation with a regularization term \(\frac{1}{2}b^{2}\) of bias b added to the objective function [16, 17, 22, 23]:

(2)

with σ=1 or 2, tradeoff parameter C>0 and \({\bf{x}}_{i} := \phi(x_{i})\in{\mathbb{H}}\) for nonlinear classification problems, where ϕ(⋅) maps x i to a high dimensional, even infinite dimensional, feature space. Owing to the reproduced kernel theory, \({\mathbb{H}}\) is a reproducing kernel Hilbert space [13, 26] associated with a kernel function \(k:{\mathbb{R}}^{n} \times{\mathbb{R}}^{n} \to{\mathbb{R}}\) satisfying \(k(x_{i},x_{j})=\langle \phi(x_{i}),\phi(x_{j})\rangle_{{\mathbb{H}}}\). The learning result \({\bf{w}}\) can be represented by a linear combination of the kernel functions:

(3)

and it is not needed to use the high dimensional map ϕ(⋅) explicitly. By plugging (3) into (1) or (2), the optimal coefficients \(\beta:=[\beta_{1}, \beta_{2},\ldots,\beta_{m}]^{T}\in{\mathbb{R}}^{m}\) and b can be solved and the classification surface θ(x):=∑ i β i k(x i ,x)+b=0 is obtained.

Many studies have been done on the dual of problem (1) or (2) [1, 7, 14, 22, 23, 25, 30]. For example, the well-known methods, SMO [1, 14, 25] and SVMlight [11, 12], are decomposition methods to solve the dual of problem (1). Successive over-relaxation (SOR) [22] and Lagrangian Support Vector Machine (LSVM) algorithms [23] are simple iterative algorithms induced by the KKT conditions of the dual of problem (2), etc. These algorithms do work efficiently in experiments, although their convergence rate is low theoretically (at most linear convergence rate [18, 19, 22, 23]).

Lee and Mangasarian [17] proposed a quadratic training algorithm called SSVM (Smooth SVM) for linear problem, in which the objective function of problem (2) with σ=2 is smoothed by the exponential penalty function (Eq. (6) following), and its nonlinear version is based on the Generalized Support Vector Machine [21] listed as the following Eq. (5). However, SSVM cannot be used to solve large-scale nonlinear classification problems because its computational complexity for solving Newton equation is about O(m 3) per iteration (SMW identity (Eq. (21)) can greatly reduce the computing cost of the linear classification problem).

In [6], the kernel matrix K is precalculated and decomposed as KUU T where \(U\in{\mathbb{R}}^{m\times \widetilde{m}}\) is a full column rank matrix, and a semi-smooth SVM with quadratic convergence rate is proposed. Further, another semi-smooth Newton SVM is proposed with many better experimental results reported in [31]. These algorithms can reach \(O(m\widetilde{m}^{2})\) computational cost, and will be a good choice if \(\widetilde{m} \ll m\). However, these algorithms require the whole kernel matrix to fit in memory, hence they are not efficient for large-scale problems.

Some low-rank methods without the whole kernel matrix precalculated have been studied. In RSVM [16], a reduced set W is selected randomly from the index set {1,2,…,m} with \(\overline{m} :=| W | \le0.1m\) and (3) is replaced by a compact representation

$$ {\bf{w}}=\sum_{i\in W}\beta_i k(x_i,\cdot). $$
(4)

Although the reduced set W is a randomly selected small portion of the whole set, many experiments in [16] show that RSVM works extremely well and the accuracy is quite insensitive to the choice of W. Its computational complexity is about \(O(m\overline{m}^{2})\) per iteration without the whole kernel matrix precalculated. Lin and Lin [20] discuss many kinds of implementations of RSVM in detail and conclude that the reduced methods are an appealing alternate for large-scale problems with a little lower accuracy. Lee and Huang [15] study the RSVM by Nyström approximation theoretically and provide a good understanding of the reduced kernel techniques.

In [13], another low-rank method is proposed to reduce the training complexity in a primal algorithm called PSVM (primal SVM), where the subset W in (4) is well-chosen iteratively to represent \({\bf{w}}\) as (4), and the resulting optimization problem is solved by the Newton method. Its computational complexity is at most \(O(m d^{2}_{max})\) per iteration, where d max is a specified maximum size of subset W. Compared to RSVM, PSVM may not be a better choice for common users because it needs a complicated sub-procedure to maintain W and the corresponding Hessian matrix iteratively.

There are also some efficient SVM solvers proposed in primal space recently. Fletcher and Zanghirati [8] proposed a new model for SVM problem in primal without introducing the penalization parameter C, and gave a new efficient SQP-like algorithm. They observed that the algorithm terminated within a finite number of iterations, but had not proven that. LapSVM [24] is a kind of learning algorithm for semi-supervised classification, and Pegasos [27] is designed for large datasets by stochastic sub-gradient descent methods.

In this paper, we focus on quadratic convergence algorithms for training SVM in primal space with the smoothing method, and the RSVM scheme is considered to reduce the complexity for large-scale problems. We generalize the results in [16, 17] to derive a new smoothing technique and prove that the error bound between the new smooth problem and the original one is \(O(\frac{1}{p^{2}})\), which is stronger than that of SSVM in [17], where the error bound given is \(O(\frac{1}{p})\). We also tighten the error bound of SSVM to \(O(\frac{1}{p^{2}})\) in this paper. Some efficient reduced techniques, such as updating the Hessian matrix iteratively and solving Newton equation by SMW identity, are discussed in the paper to accelerate the algorithms for large-scale problems.

The rest of the paper is organized as follows. Lee and Mangasarian’s SSVM is restated in Sect. 2. We derive a new smoothing algorithm called NSSVM and obtain its error bound and convergence rate in Sect. 3. We further summarize the different kernel versions for these smoothing algorithms in Sect. 4, discuss some boosting techniques to reduce its computational complexity in Sect. 5, and tighten the error bound of SSVM in Sect. 6. Experimental results are presented in Sect. 7 to illustrate that the proposed algorithm is comparable to SSVM on generalization errors and training times. Section 8 concludes this paper.

2 SSVM Algorithm

In this section, we restate the Smooth Support Vector Machine (SSVM) proposed by Lee and Mangasarian in [17] and focus on the nonlinear classification problem. The corresponding problem can be rewritten as follows

(5)

where z:=[β T,b]T, r i :=1−H i z with \(H_{i} := y_{i} [K_{i}\ 1]\in {\mathbb{R}}^{m+1}\) for nonlinear classification (K i is the ith row of the kernel matrix K and H i is the ith row of matrix H). With the solution of (5), the classification surface is θ(x):=[k(x 1,x),k(x 2,x),…,k(x m ,x) 1]z =0.

In [17], the exponential penalty function \(\psi_{p} (t):=t + \frac{1}{p}\log ( {1 + \exp ({ - p t } )} )\), p>0, is applied to smooth the second part of the objective function in (5). Here the following new equivalent form of ψ p (t) is applied

(6)

and then

$$ \psi'_p (t) = \frac{\min\{1,\exp(p t)\}}{1 + \exp( - p|t|)},\qquad \psi''_p (t) = \frac{p\exp( - p|t|)}{ ( {1 + \exp( - p|t|)} )^2}. $$
(7)

Notice that log(1+exp(a)) may get an incorrect value if exp(a) is overflowing for a large positive number a such as a>1000. Our new equivalent form in (6) and its derivatives in (7) can avoid such potential overflowing.

Let \(r:= (r_{1},r_{2},\ldots,r_{m})^{T} \in{\mathbb{R}}^{m}\) and ψ p (r):=(ψ p (r 1),ψ p (r 2),…,ψ p (r m ))T. Then the smooth problem of (5) is

(8)

SSVM using Newton algorithm to solve (8) can converge quadratically, where the Newton equation ∇2 f p (z)d=−∇f p (z) is solved per iteration with

(9)
(10)

Its computational cost per iteration is O(m 3). In Sect. 5, we illustrate that the cost can be reduced by the SMW identity [9]. The following lemma is presented to evaluate the approximate error bound:

Lemma 1

(Theorem 2.2 in [17])

If \(\bar{z}\) be the unique minimizer of the smooth function f p (z) in (8) and z be the unique minimizer the original function f(z) in (5), then we have:

$$ \bigl \Vert {\bar{z} - z^* } \bigr \Vert _2^2 \le\frac{{Cm\log^2 2}}{{2p^2 }} + \frac{{Cm\eta\log2}}{p},\qquad f(\bar{z}) - f\bigl(z^* \bigr) \le\frac{{Cm\log^2 2}}{{2p^2 }} + \frac{{Cm\eta\log2}}{p}. $$

where η=max1≤im {|1−H i z |}.

Lemma 1 shows that the error bound between the smooth problem (8) and the original problem (5) is \(O(\frac{1}{p})\). In Sect. 3, we will propose a new smoothing method with a tighter error bound \(O(\frac{1}{p^{2}})\), and will enhance the error bound of SSVM to \(O(\frac{1}{p^{2}})\) in Sect. 6.

3 New proposed smoothing method and its properties

Let r + be a vector whose ith element is max{r i ,0}. The necessary and sufficient optimality conditions of problem (5) are

(11)

Newton method cannot be used to solve these piecewise linear equations directly because ∇f(z) is non-differentiable at some points. Different from SSVM, which smooths the objective function of problem (5), we smooth the piecewise linear equations (11) with (6), and the resulted problem is to solve the following system of nonlinear equations:

$$ G_p (z) := z-CH^T \psi_p ( r )= 0. $$
(12)

It can be solved by any Newton-type method. Next we will investigate the error bound and convergence rate of our new smoothing method.

3.1 Error bound of the new smoothing method

In order to analyze the error bound between the solution of the smooth problem and the original one, we introduce the following minimization problem (13), whose first-order necessary and sufficient optimality conditions are just our smooth nonlinear equations (12).

(13)

where

$$ \varphi_p (t) := \int_{ - \infty}^t {\psi _p (u)} du, $$
(14)

with ψ p (u) defined by (6). Comparing f(z) in (5) and h p (z) in (13), it can be observed that the item \(\frac{1}{2}\max\{0, r_{k}\}^{2}\) is approximated by φ p (r k ) for k=1,2,…,m. The difference between them is analyzed in the following lemma.

Lemma 2

For φ p (t) as defined in (14), we have

$$ 0 \le\varphi_p (t) - \frac{1}{2} \max\{0, t\}^2 \le\frac{\pi^2 }{{6 p^2 }}. $$
(15)

Proof

First we have φ p (t) is a monotonically increasing function because of ψ p (u)≥0, and

$$\varphi_p (0)=\frac{1}{p}\int_{ - \infty}^0 {\log\bigl( {1 + \exp(pu)} \bigr)du}= \frac{\pi^2}{12 p^2}. $$

If t<0, we have

$$0\le\varphi_p (t) =\varphi_p (t) - \frac{1}{2} \max\{0, t\}^2 \le\varphi_p (0)= \frac{\pi^2}{12 p^2 } \le \frac{\pi^2 }{6 p^2}. $$

If t≥0, then \(\psi_{p} (t) = t + \frac{1}{p}\log ( {1 + \exp ( { - pt} )} )\), and we have

Thus (15) holds for any p and t. □

Then we analyze the difference between h p (z) (13) and f(z) (5) in Lemma 3.

Lemma 3

The difference between the smooth problem h p (z) and the original problem f(z) is bounded by the following inequalities:

$$0\le h_p (z)- f(z) \le\frac{Cm\pi^2}{6 p^2}. $$

Let z be the unique solution of the original problem and \(\hat{z}\) be the unique solution of the smooth problem. Using Lemma 3, we explore the error bound between z and \(\hat{z}\) together with the error bound between f(z ) and \(f(\hat{z})\) in Theorem 1.

Theorem 1

If z is the unique solution of problem (5) or (11), and \(\hat{z}\) is the unique solution of problem (12) or (13), then the following inequalities hold:

$$\bigl \Vert {\hat{z} - z^* } \bigr \Vert _2^2 \le \frac{{C m\pi^2 }}{6 p^2}\quad\mbox{\textit{and}}\quad f(\hat{z}) - f\bigl(z^* \bigr ) \le \frac{{C m\pi^2 }}{6 p^2}. $$

Proof

Because the eigenvalues of the Hessian matrix of h p (z) and the eigenvalues of any Clarke generalized Jacobian [4] of ∇f(z) are all larger than 1 and \(\nabla f(z^{*})=\nabla h_{p}(\hat{z})=0\), we have

Thus

$$\bigl \Vert {\hat{z} - z^* } \bigr \Vert _2^2 \le h_p \bigl(z^* \bigr) - f\bigl(z^* \bigr) - \bigl( {h_p ( \hat{z}) - f(\hat{z})} \bigr) \le h_p \bigl(z^* \bigr) - f\bigl(z^* \bigr) \le\frac{{C m\pi^2 }}{6 p^2}. $$

The second and the last inequalities follow form Lemma 3. Similarly, we have

$$f(\hat{z}) - f\bigl(z^* \bigr) \le h_p \bigl(z^* \bigr) - f\bigl(z^* \bigr) - \bigl( {h_p \bigl(z^* \bigr) - h_p (\hat{z})} \bigr) \le h_p \bigl(z^* \bigr) - f\bigl(z^* \bigr) \le \frac{{Cm\pi^2 }}{6 p^2}. $$

 □

According to Theorem 1, the error bound of our smoothing method is \(O(\frac{1}{p^{2}})\), while the corresponding results of SSVM in [17] are \(O(\frac{1}{p})\) as listed in Lemma 1. So the error bound of our new method is better than those of SSVM.

3.2 New smoothing algorithm and its convergence rate

Computing the Jacobian matrix of the nonlinear equations (12), we have

$$ \nabla G_p (z) = I + CH^T \widehat{\varLambda}_p ( z )H, $$
(16)

where

$$ \widehat{\varLambda}_p (z) := \operatorname{diag} \bigl( {\psi'_p (r_1 ),\psi'_p (r_2 ), \ldots ,\psi'_p (r_m )} \bigr). $$
(17)

Comparing Λ p (z) in (10) and \(\widehat{\varLambda}_{p}(z)\) in (17), it can be observed that ∇G p (z) in (16) is simpler than ∇2 f p (z) as defined in (9) for SSVM. The new smooth problem can be solved by the Newton-type algorithm as follows:

Algorithm 1

New Smoothing SVM Algorithm (NSSVM)

Step 0.:

Input data H and parameters p, C; give an initial point z 0; select ε>0 and σ∈(0,0.5). Set k:=0.

Step 1.:

Calculate g k=G p (z k). If ∥g k∥<ε, stops; otherwise go to Step 2;

Step 2.:

Calculate d k by solving the Newton system ∇G p (z k)d=−g k;

Step 3.:

(Armijo line search) Choose λ k =max{1,2−1,2−2,…} such that

$$h_p\bigl(z^k + \lambda_k d^k \bigr)\le h_p\bigl(z^k \bigr)+ \sigma \lambda_k {g^k}^T d^k; $$
Step 4.:

Let z k+1:=z k+λ k d k and k:=k+1. Go to Step 1.

The following lemma is necessary to prove the convergence rate of the Algorithm 1.

Lemma 4

ForG p (z) defined in (16), we have:

  1. (1)

    \(d^{T} \nabla G_{p} (z)d \ge \Vert d \Vert _{2}^{2} \) for all d and z;

  2. (2)

    ∥∇G p (z 1)−∇G p (z 2)∥2κz 1z 22 for all z 1, z 2 and some κ>0.

Proof

(1) It is clear because all the eigenvalues of ∇G p (z) are larger than 1.

(2) According to (16), we have

where η i is obtained by Lagrange mean value theorem and the second inequality is implied by \(\psi''_{p} (t) \le p\) for any t in (7), and \(\kappa := Cp\Vert H \Vert _{2}^{3}\). □

With the help of Lemma 4, the convergence rate of the new smoothing algorithm can be obtained as in the following Theorem 2. The proof of Theorem 3.2 in [17] can be used to prove Theorem 2 and is thus omitted here.

Theorem 2

Let {z k} be a sequence generated by Algorithm 1, and \(\hat{z}\) be the unique solution of problem (12). Then {z k} converges to the unique solution \(\hat{z}\) quadratically.

Remark 1

In NSSVM, the objective function h p (z) is complicated to compute because it contains an integral function φ p (t) (14), which affects the efficiency of the algorithm in Armijo line-search procedure (Step 3). Since the error between h p (z) and f(z) is \(O(\frac{1}{p^{2}})\) for a large p, we can use f(z) in place of h p (z) in Armijo line search procedure. Experimental results show that it is a good choice.

4 Kernel versions extension

There are three versions to solve nonlinear classification problem, while the mapping vector \({\bf x}_{i}\in{\mathbb{H}}\) is defined implicitly by the kernel function as \(\langle{\bf{x}}_{i},{\bf{x}}_{j}\rangle _{{\mathbb{H}}}=k(x_{i},x_{j})\). We summarize them here.

  1. (i)

    According to GSVM [21], as mentioned above, problem (5) and its necessary and sufficient optimality conditions (11) are considered. If the optimal solution is z , then the resulting classification surface is θ(x):=[k(x 1,x),k(x 2,x),…,k(x m ,x) 1]z =0.

  2. (ii)

    Stemming from the dual problem of (2) with σ=2, we have

    (18)

    where \(D=\operatorname{diag}(y)\) and e=[1,1,…,1]T. Decomposing the kernel matrix as K=UU T exactly or approximately, Zhou et al. [31] prove that the problem (18) is equivalent to solving the following unconstrained problem

    (19)

    with r=eD[U e]z. If the optimal solution to (19) is z , then the solution to (18) is α =C(eHz )+, and the classification surface is \(\theta(x):=\sum_{\alpha^{*}_{i}>0}y_{i}\alpha^{*}_{i} (k(x_{i},x)+1)=0\).

  3. (iii)

    Plugging (3) into (2) with σ=2 and noting z=[β T b]T, r=eD[K e]z, we have

    (20)

    The resulted classification surface is θ(x):=[k(x 1,x),…,k(x m ,x),1]z =0.

In all above three approaches, their optimal conditions of the corresponding programming problems are piecewise linear equations that can be smoothed by our new method, and our results \(O(\frac{1}{p^{2}})\) on error bound between the approximate solution and the exact optimal solution are achieved. The boost schemes mentioned in Sect. 5 also work for them.

In Sect. 7, we implement the smoothing algorithms based on case (i).

5 Boosting techniques

In this section, we investigate some techniques to boost the performance of these smoothing algorithms. To this end, the SMW identity is adopted to solve Newton equation, and/or some schemes are studied to update the Hessian matrix with low computational complexity.

5.1 Solving Newton equation by SMW identity

If D is a diagonal matrix with smaller size, the following SMW identity [9] can be used to calculate (I+H T DH)−1 with less computational complexity:

$$ \bigl(I + H^TD H \bigr)^{ -1}=I- H^T \bigl(D^{-1}+ HH^T \bigr)^{-1} H. $$
(21)

As soon as Λ p (z k) in (9) or \(\widehat{\varLambda}_{p} (z^{k})\) in (17) has a sparse diagonal, the cost to obtain the corresponding Newton direction based on (21) is much less than O(m 3).

Take NSSVM as an example. For a large smoothing parameter p, \((\widehat{\varLambda}_{p} (z^{k}) )_{i,i}=\psi'_{p} (r^{k}_{i} )=\frac{\min\{1, \exp(p r^{k}_{i})\}}{1 + \exp( - p|r^{k}_{i}|)}\) in (17) is numerically zero or very tiny if \(r^{k}_{i} =1-H_{i} z^{k} < 0\), and \((\widehat{\varLambda}_{p} (z^{k}))_{i,i}\) is significantly positive if \(r^{k}_{i} \geq0\), which corresponds to a bounded support vector x i for the current solution z k. That is to say, the diagonal elements of \(\widehat{\varLambda}_{p}(z^{k})\) are commonly as sparse as the solutions of SVM (the Lagrange multipliers corresponding to input samples x i ). Let \(J^{k} = \{ i\mid( \widehat{\varLambda}_{p} (z^{k}) )_{ii} \ge\tau, i =1,2, \ldots, m\}\) with an extremely tiny positive τ. Then ∇G p (z k) can be approximately calculated as follows:

(22)

where \(H_{J^{k}}\) is a sub-matrix comprised by the rows of H in the index set J k and \(\widehat{\varLambda}_{J^{k},J^{k}}\) is a sub-matrix of \(\widehat{\varLambda}_{p} (z^{k})\) corresponding to the rows and columns in J k. Experimental results show that if τ=10−10 or less, approximation errors can be neglected.

Based on SMW identity (21) and ∇G p (z k) in (22), Newton direction d k of NSSVM is obtained as follows:

$$ d^k = - g^k + H_{J^k}^T \biggl( \frac{1}{C} \widehat{\varLambda}_{J^k ,J^k } ^{ - 1} + H_{J^k} H_{J^k}^T \biggr)^{ - 1} H_{J^k} g^k. $$
(23)

Let \(\widetilde{m}_{k}:=|J^{k}|\), which approximate equals the number of support vectors. The computational complexity of the algorithm can be reduced from O(m 3) to \(O(m\widetilde{m}_{k}^{2})\) per iteration. Experiments show that \(\widetilde{m}_{k}\) is much smaller than m after several iterations.

The analysis for SSVM is similar as above.

5.2 Updating the reduced Hessian with boosting skills

In order to solve large-scale problems, the reduced techniques have been proposed to accelerate SVM training algorithms. In RSVM [16], a reduced set W in (4) is selected randomly from the index set {1,2,…,m} satisfying \(\overline{m} :=| W | \le0.1 m\). If let \(\overline{K}\in {\mathbb{R}}^{m\times\overline{m}}\)(\(\overline{m} =|W|\)) be a sub-matrix comprised by the columns of the full kernel matrix K corresponding to the reduced set W, then the formulation of RSVM in [15, 16, 20] is

(24)

which is very similar to (5). The differences between them are on the dimensions of z and

$$ r_i:=1-\overline{H}_i z\quad \mbox{with } \overline{H}_i:=y_i[\overline{K}_i\ 1],\ i=1,2,\ldots,m. $$
(25)

RSVM smooths the objective function of (24) by the exponential penalty function (6), while our new algorithm smooths the necessary and sufficient optimality conditions of (24). All algorithms solve Newton equation

$$B^kd^k = -g^k $$

to obtain direction d k, where \(B^{k}:= I + C\overline{H}^{T} \overline{\varLambda}^{k} \overline{H} \in{\mathbb{R}}^{(\overline{m} + 1) \times (\overline{m} + 1)}\) with a diagonal matrix \(\overline{\varLambda}^{k} \in R^{m \times m}\) defined correspondingly as Λ p (z k) in (10) for RSVM and \(\widehat{\varLambda}_{p} (z^{k} )\) in (17) for the new algorithm, in which r and \(\overline{H}\) are defined by (25).

There are three methods to further boost the performance of the algorithms:

Scheme I: :

Based on the sparsity of the diagonal elements of \(\overline{\varLambda}^{k}\), we have

$$ B^k:= I + C\overline{H}^T_{J^k} \overline{\varLambda}^k_{J^k,J^k} \overline{H} _{J^k}. $$
(26)

where \(J^{k}:= \{ {i| {\overline{\varLambda}_{i,i}^{k} \ge\tau, i=1,\ldots,m} }\}\) with a tiny positive τ. Let \(\widetilde{m}_{k} := | {J^{k} } |\). The total computational complexity per iteration is \(O(\widetilde{m}_{k} \overline{m}^{2}\!{+}\,\overline{m}^{3}) \,{=}\, O(\max\{\widetilde{m}_{k}, \overline{m}\}\overline{m}^{2})\) where \(O(\widetilde{m}_{k} \overline{m}^{2})\) is for computing B k and \(O(\overline{m}^{3})\) is for solving Newton equation.

Scheme II: :

B k in (26) can be updated iteratively as in [31]:

$$ B^k = B^{k-1}+C\overline{H}^T \varUpsilon^k\overline{H} =B^{k-1}+C\overline{H}_{J_{0}^k}^T \varUpsilon_{J_{0}^k,J_{0}^k}^k\overline{H}_{J_{0}^k}, $$
(27)

where \(\varUpsilon^{k} = \overline{\varLambda}^{k}-\overline{\varLambda}^{k-1}\) and \(J_{0}^{k}=\{ {i| {\varUpsilon_{i,i}^{k} \ne0, i=1,\ldots,m} } \}\). Let \(\widehat{m}_{k} :=|J_{0}^{k}|\). The total computational complexity is reduced to \(O(\widehat{m}_{k}\overline{m}^{2} + \overline{m}^{3})= O(\max\{\widehat{m}_{k},\overline{m}\}\overline{m}^{2})\) per iteration.

Scheme III: :

If SMW identity (21) is applied to solve the reduced Newton equation, we have

$$ d^k = - g^k + \overline{H}_{J^k}^T \biggl( \frac{1}{C} \bigl( {\varLambda_{J^k ,J^k }^k } \bigr)^{ - 1} + \overline{H}_{J^k} \overline{H}_{J^k}^T \biggr)^{ - 1} \overline{H}_{J^k} g^k. $$
(28)

The total complexity per iteration is reduced to \(O(\overline{m}\widetilde{m}_{k}^{2}+\widetilde{m}_{k}^{3})\,{=}\, O(\max\{\widetilde{m}_{k},\overline{m}\}\widetilde{m}_{k}^{2})\).

Experimental results in Sect. 7 show that \(\widehat{m}_{k} \leq \widetilde{m}_{k} \) holds apparent after several iterations and \(\widehat{m}_{k} \) has a tendency to 0 while \(\widetilde{m}_{k} \) has a tendency to the number of the support vectors. This is to say that Scheme II is better than Scheme I.

In our implements, we compound these cases to get a better one:

Optimal Scheme: :

If \(\widetilde{m}_{k} > \overline{m}\), Scheme II is chosen with the complexity \(O(\widehat{m}_{k}\overline{m}^{2})\); Otherwise, Scheme III is chosen with the complexity \(O(\overline{m} \widetilde{m}_{k}^{2})\).

Its computational complexity is often less than \(O(m\overline{m}^{2} )\) given by [20] for RSVM, and is also less than \(O(md^{2}_{max})\) given in [13] for PSVM because \(\widehat{m}_{k}\) and \(\widetilde{m}_{k}\) are often much smaller than m.

6 Tighter error bound for SSVM

Given the same smoothing parameter p, experimental results show that the accuracies of our new method are always very similar to those of SSVM. This is not consistent with the theoretical analysis that the error bound of SSVM is \(O(\frac{1}{p})\) in [17], while the error bound of our new method is \(O(\frac{1}{p^{2}})\). In the following we present a new proof to tighten the error bound of SSVM.

The difference between \(\psi^{2} _{p} (t)\) and max{t,0}2 is analyzed in Lemma 5.

Lemma 5

If ψ p (t) be defined in (6), then the following inequalities holds:

$$ 0\le\psi^2 _p (t) - \max\{t,0 \}^2 \leq\frac{1}{p^2}. $$
(29)

Proof

The left inequality is obvious and the right inequality is proven as follows.

If t≤0,

$$\psi^2 _p (t)-\max\{t,0\}^2=\psi^2 _p (t)=\frac{1}{p^2}\log^2\bigl (1+\exp\bigl(-p|t|\bigr)\bigr)\leq\frac{\log^2 2 }{p^2}\leq \frac{1}{p^2}. $$

If t>0,

The first inequality holds because log(1+a)≤a for a≥0, and the last inequality holds because \(\phi(t):=\frac{1}{p^{2}}\exp(-2pt)+\frac{2}{p}t\exp(-pt)\) is a monotonically decreasing function for t≥0 (since \(\phi'(t)=\frac{2}{p}\exp(-pt) (1-pt- \exp(-pt) )\leq0\)). □

A tighter error bound for SSVM can be proven using the same method as Theorem 1.

Theorem 3

If z be the unique solution of problem (5) or (11), and \(\bar{z}\) be the unique solution of problem (8), then the following inequalities hold:

$$\bigl \Vert {\bar{z} - z^* } \bigr \Vert _2^2 \le\frac{Cm}{2 p^2}\quad\mbox{\textit{and}}\quad f(\bar{z}) - f(z^* ) \le\frac {Cm}{2 p^2}. $$

By Theorem 3, the error bound of SSVM is also \(O(\frac{1}{p^{2}})\).

7 Experimental results

In this section, we perform some experiments to compare the new algorithm (NSSVM) with SSVM [17] as well as PSVM [13] and also to show the performance of the proposed boosting techniques, and the reported results in [27] are also cited here for comparison. All the experiments are run on a Personal Computer with a Intel Core i3 2100 CPU and a maximum of 4 Gbytes of memory available for all processes. The computer runs Windows 7 with Matlab 7.10. Some Matlab codes of our algorithms are available on the webpage http://web.xidian.edu.cn/sszhou/en/paper.html

7.1 Compare NSSVM and SSVM on checkerboard dataset

This set of experiments is on a non-linearly separable example called “tried and true” checkerboard dataset, first given in [10], which is widely used to show the effectiveness of nonlinear kernel methods [16, 17, 22, 23, 31]. The data is generated by uniformly discretizing the regions [0,199]×[0,199] to 2002=40,000 points, and labeling two classes spaced by 4×4 grids (See [16, 17, 22, 23, 31] for the plots of the dataset). The training set is randomly sampled from the 40,000 data points with different training sizes, and the remainders of the points are left as the testing set. The chosen kernel function is the Gaussian kernel function K(x i ,x j )=exp(−0.001∥x i x j 2). We set ε=10−4 to stop the algorithms and τ=10−10 to prune the diagonal elements of Hessian matrices, and let C=104, p=108.

Some plots are given in Fig. 1 to show the average changing per iteration of \(\widetilde{m}_{k}=|J^{k}| \) in (26) and \(\widehat{m}_{k}=|J_{0}^{k}|\) in (27). We take the new algorithm as an example, and the results obtained by SSVM are similar. In Fig. 1(a), the plots are obtained on an 8,000 training dataset with the reduced size \(\overline{m}=800\), and the results are the average values of the first 55 iterations on 20 random trials. In Fig. 1(b), the plots are obtained on a 15,000 training dataset with the reduced size \(\overline{m}=1000\), and the results are the average values of the first 65 iterations on 20 random trials. Semi-log plots with logarithmic (base 10) scale for the Y-axis are used.

Fig. 1
figure 1

Average tendency of \(\widetilde{m}_{k} \) and \(\widehat{m}_{k}\) per iteration on 20 random trials. Logarithmic (base 10) scale for the Y-axis is used. The figures show that our optimal scheme is efficient: The computational complexity is \(O(\widehat{m}_{k}\overline{m}^{2})\) for the first 5 (or 6) iterations (over the reduced size level) and is \(O(\overline{m} \widetilde{m}_{k}^{2})\) for the rest, which is much less than \(O(m\overline{m}^{2})\) for RSVM in [16, 20] and \(O(md^{2}_{max})\) for PSVM in [13]

From Fig. 1, it can be observed that \(\widehat{m}_{k} \) is often smaller than \(\widetilde{m}_{k} \). Both \(\widehat{m}_{k} \) and \(\widetilde{m}_{k} \) have a tendency to quickly converge to a small number. \(\widetilde{m}_{k} \) has a tendency to converge to the number of support vectors (near 70) while \(\widehat{m}_{k} \) has a tendency to converge to a very small number less than 10 in the iterative process. They are both much smaller than the reduced size \(\overline{m}\) after 6 iterations. It can be concluded that the boosting skills dramatically reduce the complexity of the algorithms: The computational complexity is \(O(\widehat{m}_{k}\overline{m}^{2})\) for the first 5 (or 6) iterations (over the reduced size level) and is \(O(\overline{m} \widetilde{m}_{k}^{2})\) for the rest (below the reduced size level), which is less than \(O(m\overline{m}^{2})\) for RSVM in [16, 20] and \(O(md^{2}_{max})\) for PSVM in [13].

More training sets are randomly sampled from the checkerboard dataset with different training sizes for further experiments, where algorithms with the reduced method and the boosting techniques in Sect. 5 are compared. Table 1 lists the experimental results. The reduced set, with the size \(\overline{m} = 0.1 m\) but limited by the lower bound 100 and the upper bound 1,000, is randomly selected from the training set. All the data in Table 1 are obtained on an average over 20 random experiments, and the standard deviations are also given in brackets for some columns. In the 5th and 6th columns, the mean training time of the proposed Optimal Scheme in Sect. 5.2 is compared with the mean training time of RSVM [16, 20] only with the sparsity considered as (26), which corresponds to Scheme I in Sect. 5.2. In the 7th column, we list the average value of ∥∇f∥ (∇f is defined in (11)) to check the quality of those approximate solutions in term of the exact solutions of the original problem. In the 8th column, we list the average value of \(f(\hat{z})-f(\bar{z})\) to compare the quality of the solutions achieved by the two algorithms, where \(\hat{z}\) is obtained by the new smoothing algorithm and \(\bar{z}\) is obtained by SSVM.

Table 1 Experimental results with the proposed techniques on some medium sized training datasets. All the data are obtained on an average over 20 random trials and the standard deviations are also given in brackets for columns 3–6; ∥∇f∥ (∇f is defined in (11)) is to check the quality of these approximate solutions in term of exact solutions in the 7th column, and the quality of the solutions achieved by the two algorithms is compared in the 8th column, where \(\hat{z}\) is computed by NSSVM and \(\bar{z}\) is computed by SSVM with Optimal Scheme. “SS” and “NS”are “SSVM” and “NSSVM” for shorts respectively

The results in Table 1 show that the new algorithm works as well as SSVM [17]. The two algorithms can get the same training accuracies (not listed) and testing accuracies with a little difference in the training time and the number of iterations. It is coherent with our theoretical analysis that the two algorithms have similar error bound. At the same time, according to the 5th and 6th columns, our optimal scheme saves nearly 1/3 of the training time. In the 7th column, it is clear that the obtained solutions are good approximate solutions of original problems, since the mean values of ∥∇f∥ are all much smaller than our stop criterion ε=10−4 and all ∇f meet the optimality conditions of original problems with tiny errors. The average values of \(f(\hat{z})-f(\bar{z})\) in the last column show that the difference of the two algorithms is also very tiny and clearly they both satisfy inequality \(-\frac{Cm}{2 p^{2}}\le f(\hat{z})-f(\bar{z}) = f(\hat{z})-f(z^{*})-(f(\bar{z})-f(z^{*}))\le\frac{Cm\pi^{2}}{6 p^{2}}\), which is implied by our theorems on error bound.

7.2 Compare SSVM, NSSVM and PSVM on some large benchmark datasets

This set of experiments is to evaluate the performance of the SSVM, our new NSSVM, as well as the PSVM in [13]. And the reported results in [27] are also cited here to compare with our algorithm. The Matlab codes for PSVM are gotten from the site [2], which are designed for paper [13]. Our codes for SSVM and NSSVM are available on the webpage http://web.xidian.edu.cn/sszhou/en/paper.html.

All algorithms are compared on the training time and the classification accuracies with reduced method, and the iterations of SSVM and our NSSVM are also given (we do not listed the iterations of PSVM since it is not easy to count).

The reduced sets for SSVM and NSSVM is random chosen while the reduced set for PSVM is iteratively updated by some complicate skills (see [13] for details). If we use the same reduced set size, PSVM is much slower than SSVM and NSSVM, but most of time has a little bit higher classification accuracies than SSVM and NSSVM have. So in Table 2, we half the reduced set size of PSVM so that the classification accuracies of three algorithms are comparable.

Table 2 Experimental results of the related algorithms on some larger size benchmark datasets. All the results are obtained on an average over 20 random trials and the standard deviations are given in brackets. Here “Tr”, “Ts”, “Dm” and “Rd” are “Training size”, “Test size”, “Dimension of samples” and “Reduced size” of the input datasets respectively, and “SS”, “NS” and “PS” are “SSVM”, “NSSVM” and “PSVM” for shorts respectively. Reported results [27] for some sets by other popular SVM solver are listed in the footnotes

Eight large benchmark datasets from the site of [5] are adopted. Some of them are also appeared in [13] and [27]. For simplicity, Gaussian kernel function k(x,y)=exp(−γxy2) with different spread parameters γ is used for all datasets. The kernel spread parameters γ and the tradeoff parameters C are roughly chosen by 10-fold cross-validation. The details of the data sets and the corresponding selected parameters are listed as follows:

  • USPS3 and USPS8—USPS is a multi-class data set with 10 classes including 7,291 training samples and 2,007 test samples. Each sample has 256 features. Here USPS3 is used for the task of classifying the digit 3 versus the rest of the classes and USPS8 is used for the task of classifying the digit 8 versus the rest of the classes. The parameters are C=40 and γ=0.02.

  • Adult—It is the version given by Platt which has 32,561 training examples and 16,281 test examples. Each example has 123 binary features, and the parameters are C=1 and γ=0.05.

  • Shuttle—It is a multi-class data set with seven classes including 43,500 training examples and 14,500 test examples. Each example has 9 features. Here a binary classification problem is solved to separate class 1 from the rest, and the parameters used are C=10,000 and γ=2.

  • IJCNN—It has 49,990 training examples and 91,701 test examples. Each example is described by 22 features, and the parameters used are C=1,000 and γ=0.5.

  • MNIST3 and MNIST8—MNIST is a multi-class data set to classify the handwritten digits 0 to 9. It has 60,000 training examples and 10,000 test examples. Each example is described by 28×28 features. Here MNIST3 is the task of classifying digit 3 versus the rest of the classes and MNIST8 is the task of classifying digit 8 versus the other classes. The parameters used are C=40 and γ=0.02.

  • Vechile—It is the combined SensIT Vechile in [5]. It has 78,823 training examples and 19,705 test examples in three classes. Each example has 100 features. Here the task of classifying class 3 from the rest is trained, and the parameters are set as C=1,000 and γ=0.25.

The experimental results are listed in Table 2, where the means of abbreviations in the first column are explained in the table caption. All the results are obtained on an average over 20 random trials and the standard deviations are given in brackets, and the best values are in bold.

The results in Table 2 again show that the new algorithm works as well as SSVM [17]. The two algorithms can get the same testing accuracies with a little difference in the training time and the number of iterations, which are coherent with our theoretical analysis that the two algorithms have similar error bound. All the deviations (in brackets) are very small, so the mean values are confident.

The results in Table 2 also show that the smooth algorithms with the proposed boosting skills can achieve test accuracies comparable to PSVM, but save the training time about 2 to 10 times. Notice that the complicate codes of PSVM are more than 200 lines in Maltab while our SSVM and NSSVM are all less than 40 lines: the new smooth algorithms are more suited for the common SVM users than PSVM.

Furthermore, as reported in the footnotes of Table 2, by comparing our results against those obtained in [27] (with a more powerful computer), we can conclude that the smooth algorithms with our boosting skills are faster than some popular SVM solvers with the comparable test accuracies.

By the way, in this set of experiments, we found that Scheme II in Sect. 5.2 is the most useful skill to boost the algorithms. On the last 6 large training datasets with the smaller reduced set size, our Optimal Scheme is equivalent to Scheme II since \(\widetilde{m}_{k}\) is larger than the reduced set size \(\overline{m}\).

8 Conclusion

The quadratic convergence rate algorithms for training SVM with smooth method are studied in this paper. In the primal space, SVM can be modeled as an unconstrained problem with a piecewise quadratic objective function, whose optimality condition is a system of piecewise linear equations. We derive a new smoothing, quadratically convergent algorithm by smoothing the piecewise linear equations. The smoothing method is different from SSVM [17], where the objective function is smoothed. We prove that both the algorithms have the same approximation error bound \(O(\frac{1}{p^{2}})\), which is tighter than the result given in [17].

For large-scale nonlinear problems, some reduced methods and boosting techniques are introduced to reduce the computational complexity of these algorithms and to improve the performance of these smoothing methods. Many experimental results support that the new smoothing algorithm is not only as good as SSVM, but also simpler than SSVM on the Hessian matrix. It can also be observed that the smoothing algorithms with the proposed boosting skills are comparable with similar algorithms like PSVM [13], but simpler and easier in implement.