1 Introduction

In the information age, mass production of information has caused serious information overloaded. Facing this dilemma, support vector machine (SVM), one kind of fast information classification algorithm, becomes one effective solution. As one kind of full-supervised statistical machine learning, support vector machine (SVM) has get widely application for its good performance in information classification. However, in order to achieve satisfactory classification standard, it is necessary to train the SVM with quantities of labeled datasets. In fact, this condition cannot be fully achieved as the acquisition of the labeled data is usually difficult or the payment is much expensive. In contrast, unlabeled data are abundant and easy to collect. Furthermore, relatively few labeled datasets lead to a frequent drawback, that is the over fitting to the training data with a consequent loss of generality. Thus, to deal with this problem, the semi-supervised support vector machine (S3VM) learning method is proposed [1,2,3].

The semi-supervised support vector machine is the method utilizing both the labeled and unlabeled data for learning. The main goal of the S3VM is to employ the large collection of unlabeled data together with a limited labeled data to improve the classification accuracy. Because of its elegant properties with unique global optimal solution and avoiding the disaster of dimensionality, lots of scholars have marched for this area and applied the S3VM to many fields, such as text classification [4], the multi-class human action recognition [5, 6], biomedical science [7, 8], graph reduction [9], image and video classification [10], and applications in industry and business [11, 12].

However, the main drawback of S3VM is that the objective function is usually non-smooth. It needs to endure heavy burden in solving two quadratic programming problems with inversion matrix operation. Also, fast algorithm cannot be used, increasing the computing complexity. Some researchers have proposed several advanced methods to smooth the objection function. In 2005, the replacement of the non-smooth term \(\max \{ 0,1 - \left| x \right|\}\) with \(\exp ( - 3x^{2} )\) is given and the low density separation LDS-S3VM [3] was proposed by Chapelle and Zien. But the approximation accuracy is not so high. In 2009, Liu et al. showed the polynomial function [13]\(P(x) = \frac{{1 - x^{2} }}{2} + \frac{1}{8}(1 - x^{2} )^{2} + \frac{1}{16}(1 - x^{2} )^{3} + \frac{5}{128}(1 - x^{2} )^{4} + \frac{7}{256}(1 - x^{2} )^{5} ,\) \(x \in [ - \frac{1}{k},\frac{1}{k}]\). However, the 10-order polynomial function is too complex and has too many calculations. Later, Yang et. al offered one new smoothing strategy of approximate function \(\rho_{\varepsilon } (x) = \sqrt {x^{2} + \varepsilon } \approx \left| x \right|\) [14] based on robust difference of convex functions in 2013. This new smooth method applied the DC optimization algorithms for solving the S4VMs, and didn’t add new variables and constraints to the corresponding S3VMs. It is a promising direction to facilitate the research of S4VMs. Zhang et al. introduced their cubic spline function [15] \(s(x,k) = \frac{{k^{2} \left| x \right|^{3} }}{3} - kx^{2} - \frac{1}{3k} + 1,\) \((\left| x \right| \le \frac{1}{k}),\) and quintic spline function [16] \(s(x,k) = - \frac{{k^{4} \left| x \right|^{5} }}{5} + \frac{1}{2}k^{3} x^{4} - kx^{2} - \frac{3}{10k} + 1,\) \((\left| x \right| \le \frac{1}{k})\) in 2015. However, the above smooth techniques are not so satisfied.

Motivated by the works of [3, 13,14,15,16], a new research question is gradually arisen, whether there is any other smooth technology, improving accuracy and decreasing calculation scale. In this paper, a new class of Bézier smooth functions is applied. Employing the smooth Bézier function \(B_{n} (x)\) to approximate the non-smooth term \(\max \{ 0,1 - \left| t \right|\}\), a novel class of Bézier smooth semi-supervised support vector machines (BS4VMs) is derived. The new programming possesses the following attractive advantages: firstly, the fast gradient algorithm can be used to solve the BS4VMs as the objective function becomes smooth and differentiable. Much calculation time can be saved. Secondly, a new class of smooth functions is proposed. The optimal smooth function can be selected for different scale datasets. Lastly and more importantly, convergence analysis and experimental comparisons verify that BS4VMs are superior to the given models in classification capability and efficiency.

In order to make the expression more clear, the definition of each variable involved in equations is listed in Table 1. For example, all vectors are column vectors, and \(\nabla f(t)\) represents the gradient of the function.

Table 1 List of symbols

The rest of this paper is organized as follows. The preliminary background knowledge of S3VM will be introduced in Sect. 2. Section 3 shows how the BS4VMs can be derived. A fast quasi-Newton algorithm for solving the programming will be followed in Sect. 4. Then the nonlinear BS4VMs and the convergence analysis of the model are listed in Sects. 5 and 6. The comparisons of the proposed algorithm with other advanced methods based on four kinds of datasets will be analyzed in Sect. 7. The discussion and conclusion will be followed in the last section.

2 Preliminary of semi-supervised support vector machine

The purpose of S3VM for binary classification is to maximize the margin by using the labeled and unlabeled data. Considering one programming, the training data contain the \(l\) labeled points \(\{ (x^{i} ,y_{i} )\}_{i = 1}^{l} ,y_{i} = \pm 1\) and the \(u\) unlabeled dataset \(\{ x^{i} \}_{i = l + 1}^{l + u}\), where \(x^{i} { = (}x_{1}^{i} ,x_{2}^{i} ,...,x_{m}^{i} {)} \in {\mathbb{R}}^{m} .\) For the linearly separable data, one optimal separating hyperplane with the largest distance for the S3VM classifier should be explored.

Let \(y \triangleq (y^{l} ,y^{l + u} )\) be a column vector, where \(y^{l} { = (}y_{{1}} {,}y_{2} ,...,y_{l} {)}^{\rm T}\) is the known label, and \(y^{{l}{\text{ + u}}} { = (}y_{{l{ + 1}}} {,}y_{{l{ + }2}} ,...,y_{{l{ + }u}} {)}^{\rm T}\) is the unknown label. The vector \(y_{n} = [y_{l + 1} ,...,y_{l + n} ]\) based on the largest margin is the pursuing goal. For the linear condition, the S3VM can be described as

$$ \begin{gathered} \, J(w){ = }\min \frac{1}{2}\left\| w \right\|^{2} + C\sum\limits_{i = 1}^{l} {\xi_{i} } + C^{*} \sum\limits_{j = l + 1}^{l + u} {\xi_{i} } \hfill \\ s.t. \, y_{i} (w^{\rm T} x^{i} + b) \ge 1 - \xi_{i} ,i = 1,...,l \hfill \\ \, \left| {w^{\rm T} x^{{_{j} }} + b} \right| \ge 1 - \xi_{i} ,j = l + 1,...,l + u, \hfill \\ \, \xi = \{ \xi_{1} ,\xi_{2} ,...,\xi_{n} \} \ge 0 \hfill \\ \end{gathered} $$
(1)

where \(C\) and \(C^{*} ,\) the penalty parameters for both labeled and unlabeled data, are greater than zero. The programming (1) can be changed into the unconstrained form of

$$ \, J(w){ = }\mathop {\min }\limits_{{w,b \in R^{n + l} }} \frac{1}{2}\left\| w \right\|^{2} + C\sum\limits_{i = 1}^{l} {L^{2} (y_{i} (w^{\rm T} x_{i} + b)) + } C^{*} \sum\limits_{i = l + 1}^{l + u} {L(\left| {w^{\rm T} x_{i} + b} \right|)} $$
(2)

in which \(L(t)\) is the hinge loss function and \(L(t) = \max (0,1 - t)\)[3].

3 Bézier smooth semi-supervised support vector for classification

3.1 Background knowledge about the Bézier function

Bézier curves were invented in 1968 by the French engineer Pierre Bézier for the initial purpose of designing automobile bodies [18]. For one series of interpolation points \(P_{0} ,P_{1} , \cdots P_{n - 1} ,P_{n}\) that need to be fitted, the intermediate points \(P_{1} , \cdots P_{n - 1}\) are used to specify the endpoint tangent vectors. Hence the Bézier curve passes through \(P_{0}\) and \(P_{n}\) and approximates the other controlpoints, just like Fig. 1. To accomplish this goal, some kinds of weighting functions representing the influence of the control points at a given point of the Bézier curve are required. Arbitrary function satisfying the requirements is allowed, but in most cases the Bernstein polynomial is employed. A Bézier curve of degree n can be expressed as \(B(t) = \sum\nolimits_{i = 0}^{n} {C_{i}^{n} } (t)P_{i} ,\) where \(P_{i}\) is the control point or anchor point. \(C_{i}^{n} (t)\) means the Bernstein polynomial given by \(C_{i}^{n} (t){ = }\left( \begin{gathered} n \hfill \\ i \hfill \\ \end{gathered} \right)(1 - t)^{n - i} t^{i}\), in which \(i \in \{ 0,1,...,n\}\).

Fig. 1
figure 1

Schematic diagram of the Bézier interpolation function

Many advantages for Bézier Curves have been noticed:

  1. (1)

    They always passed through anchor points \(P_{0}\) and \(P_{n}\).

  2. (2)

    They are always tangent to the lines of path \(P_{0} \to P_{1}\) and \(P_{n - 1} \to P_{n}\).

  3. (3)

    They always lie within the convex hull consisting of the control points [19]. Owing to these good performances, the Bézier curves have been widely applied in computer graphic, such as technical illustration programs, CAD programs, trajectory guidance, and so forth [20,21,22,23].

For approximating the hinge loss function, the quadratic parameter Bézier function can be expressed as \(\left\{ {\begin{array}{*{20}c} {B_{2x} (t) = (2t - 1)/k} \\ {B_{2y} (t) = ( - 2t^{2} + 2t)/k} \\ \end{array} } \right.\) in which \(p_{0} = ( - \frac{1}{k},0),p_{1} = (0,\frac{1}{k}),p_{2} = (\frac{1}{k},0)\). Eliminating the parameter \(t\), \(y = B_{2} (x) = - \frac{1}{2k}(k^{2} x^{2} - 1)\) will be given. Similarly, the cubic parameter Bézier function \(\left\{ \begin{gathered} B_{3x} (t) = (2t^{3} - 2t^{2} + 2t - 1)/k \hfill \\ B_{3y} (t) = ( - 3t^{2} + 3t)/k \hfill \\ \end{gathered} \right.\) will be acquired by interpolating four points \(p_{0} ,p_{1} ,p_{2} ,p_{3}\), in which \(p_{0} = ( - \frac{1}{k},0),p_{1} = p_{2} = (0,\frac{1}{k}),\) \(p_{3} = (\frac{1}{k},0).\) From the general formula \(B(t) = \sum\nolimits_{i = 0}^{n} {C_{i}^{n} } (t)P_{i} .\) the n-order Bézier function \(y = B_{n} (x)\) will be acquired.

Theorem 1

Bézier curve \(B_{n - 1} (t)\) is \(n - 1\)-order smooth at the points \(x = \pm \frac{1}{k}\).

Proof

The proof is based on mathematical induction.

(i) \(\forall x \in \Omega ,B_{2} (x) = - \frac{1}{2}(k^{2} x^{2} - 1)\) satisfies the following equalities at the points \(x = \pm \frac{1}{k}\)

$$ \left\{ \begin{gathered} B_{2} ( - \frac{1}{k}) = 0,B_{2} (\frac{1}{k}) = 0, \hfill \\ B_{2}^{\prime } ( - \frac{1}{k}) = 1,B_{2}^{\prime } (\frac{1}{k}) = - 1. \hfill \\ \end{gathered} \right. $$
(3)

So, \(B_{2} (x,k)\) is one-order smooth.

(ii) \(B_{3} (x)\) satisfies the following equalities at the points \(x = \pm \frac{1}{k}\),

$$ \left\{ {\begin{array}{*{20}c} \begin{gathered} B_{3} ( - \frac{1}{k}) = 0, \, B_{3} (\frac{1}{k}) = 0, \hfill \\ B_{3}^{\prime } ( - \frac{1}{k}) = 1, \, B_{3}^{\prime } (\frac{1}{k}) = - 1, \hfill \\ \end{gathered} \\ { \, B_{3}^{\prime \prime } ( - \frac{1}{k}) = 0, \, B_{3}^{\prime \prime } (\frac{1}{k}) = 0.} \\ \end{array} } \right. $$
(4)

Hence, \(B_{3} (x)\) is twice-order smooth.

(iii) Let \(B_{{P_{0} P_{1} ...P_{n - 1} }}\) denote the Bézier curve determined by points \(P_{0} ,P_{1} ,...,P_{n - 1}\). Based on

$$ B(t) = B_{{P_{0} \cdots P_{n - 1} }} (t) = (1 - t)B_{{P_{0} \cdots P_{n - 2} }} (t) + tB_{{P_{1} \cdots P_{n - 1} }} (t), $$
(5)

according to the mathematical induction, \(B_{n - 1} (x)\) is \(n - 1\) order smooth can be proved.

3.2 Bézier smooth semi-supervised support vector for classification

From (2), the last term \(C^{*} \sum\nolimits_{i = l + 1}^{l + u} {L(\left| {w^{\rm T} x_{i} + b} \right|)}\) is non-smooth and difficult to solve [4], making the formula (2) become a difficult-solving mixed-integer quadratic programming. Replacing this term with smooth function \(y = B_{n} (x),\) a new class of Bézier smooth semi-supervised support vector machines (BS4VMs) is derived, described in formula (6)

$$ \mathop {\min }\limits_{w,b} \varphi (w,b) = \min \frac{1}{2}w^{2} + C\sum\limits_{i = 1}^{l} {L^{2} (y_{i} (w^{\rm T} x_{i} + b)) + } C^{*} \sum\limits_{i = l + 1}^{l + u} {B_{n} (w^{\rm T} x_{i} + b)} . $$
(6)

In this paper, without loss of generality, 4-order Bézier interpolation function \(y = B_{4} (x)\) is taken into consideration. The higher the order of Bézier function, the better the approximation. The approximation comparison of different smooth models can be seen in Fig. 2.

Fig. 2
figure 2

Approximation comparison among the proposed models and the Bézier model with \(k = 1.\)

From Fig. 2, one can find that (1) the 4-order Bézier function performs best among 3-order Bézier function, exponent function, 10-order polynomial, the cubic spline function, and quintic spline function in approximating the hinge loss function. (2) 3-order Bézier function performs almost the same with 10-order polynomial, while the calculation complexity is much less than the latter.

4 The nonlinear kernel for BS4VM

For the nonlinear case, the kernel function \(k(x^{i} ,x^{j} ) = \phi (x^{i} )^{\rm T} \phi (x^{j} )\) can be applied to map the original data into the high dimension Hilbert space. After this transforming, the linear program will be arrived. Let \(\phi :R^{m} \to R^{d} \left( {d > m} \right)\) be the mapping function of the formula (1). The nonlinear kernel-based S3VM can be shown as

$$ \begin{gathered} \, J(w){ = }\min \frac{1}{2}\left\| w \right\|^{2} + C\sum\limits_{i = 1}^{l} {\xi_{i} } + C^{*} \sum\limits_{j = l + 1}^{l + u} {\xi_{i} } \hfill \\ s.t. \, y_{i} (w^{\rm T} \phi (x^{{_{i} }} ) + b) \ge 1 - \xi_{i} ,i = 1,...,l \hfill \\ \, \left| {w^{\rm T} \phi (x^{{_{j} }} ) + b} \right| \ge 1 - \xi_{i} ,j = l + 1,...,l + u. \hfill \\ \, \xi = \{ \xi_{1} ,\xi_{2} ,...,\xi_{n} \} \ge 0 \hfill \\ \end{gathered} $$
(7)

In this paper, the Gaussian kernel \(k(x^{i} ,x^{j} ) = \exp ( - \left\| {x^{i} - x^{j} } \right\|_{2}^{2} /2\sigma^{2} )\) is adopted and the kernel function \(K = k(x^{i} ,x^{j} )\) is positive semi-definite matrix [17]. For formula (2), the variable \(w\) will be replaced by \(w = \sum\nolimits_{i = 1}^{m} {u_{i} } y_{i} x_{i}\), in which \(u \in R^{m}\). The nonlinear S3VM is achieved.

$$ \mathop {\min }\limits_{u,b} \varphi (u,b) = \min \frac{1}{2}u^{2} + C\sum\limits_{i = 1}^{l} {L^{2} (y_{i} (k(x_{i} ,x_{j} )u_{j} + b)) + } C^{*} \sum\limits_{i = l + 1}^{l + u} {L(\left| {k(x_{i} ,x_{j} )u_{j} + b} \right|)} . $$
(8)

Applying the n-order Bézier smooth function, the nonlinear BS4VM model with kernel function is offered.

$$ \mathop {\min }\limits_{u,b} \varphi (u,b) = \min \frac{1}{2}u^{2} + C\sum\limits_{i = 1}^{l} {L^{2} (y_{i} (k(x_{i} ,x_{j} )u_{j} + b)) + } C^{*} \sum\limits_{i = l + 1}^{l + u} {B_{n} (k(x_{i} ,x_{j} )u_{j} + b)} . $$
(9)

The objective function (9) is \(n - 1\)-order differentiable for any arbitrary kernel.

5 One fast quasi-Newton method for solving BS4VM

In this section, the sub-LBFGS algorithm will be employed to solve semi-supervised problem (1) [2, 24, 25]. Differentiate (2) with the method of subgradient, the following will be given:

$$ \partial J(w) = w + C\sum\limits_{i = 1}^{l} {\beta_{i} y_{i} x_{i} + } C^{*} \sum\limits_{i = l + 1}^{l + n} {\beta_{i} y_{i} x_{i} ,} $$
(10)

where \(\beta_{i} : = \left\{ \begin{gathered} 1{\text{ if }}i \in E,E: = \{ i:1 - y_{i} w^{\rm T} x_{i} > 0\} , \hfill \\ \psi {\text{ where}}\psi \in (0,1),{\text{ if }}i \in M,M: = \{ i:1 - y_{i} w^{\rm T} x_{i} = 0\} , \hfill \\ 0{\text{ if }}i \in W,W: = \{ i:1 - y_{i} w^{\rm T} x_{i} < 0\} , \hfill \\ \end{gathered} \right.\) \(E,M{\text{ and }}W\) denote the sets of points which are in error, on the margin and well-classified, respectively. For a given direction \(p\), it is required to find a subgradient \(g\). Based on formula (10), Eq. (11) will be given:

$$ \begin{gathered} \mathop {\sup }\limits_{{g \in \partial J(w_{t} )}} g^{\rm T} p = \mathop {\sup }\limits_{{\beta_{i} ,i \in M_{t} }} (w + C\sum\limits_{{i \in M_{t} }}^{{}} {\beta_{i} y_{i} x_{i} + } C^{*} \sum\limits_{{i \in M_{t} }}^{{}} {\beta_{i} y_{i} x_{i} } )^{\rm T} p \hfill \\ { = }w^{\rm T} p + C\sum\limits_{{i \in M_{t} }}^{{}} {\mathop {\sup }\limits_{{\beta_{i} \in [0,1]}} \beta_{i} y_{i} x_{i}^{\rm T} p + C^{*} \sum\limits_{{i \in M_{t} }}^{{}} {\mathop {\sup }\limits_{{\beta_{i} \in [0,1]}} \beta_{i} y_{i} x_{i}^{\rm T} p.} } \hfill \\ \end{gathered} $$
(11)

Now the S3VM algorithm with sub-LBFGS optimization solving procedure can be offered (Algorithm 1).

In step 3 of Algorithm 1, a classifier is obtained by firstly running BS4VM on the labeled examples alone. Steps 5–17 show the loop iteration process when solving the objective programming. Step 9 identifies pairs of unlabeled examples with temporary positive and negative labels such that switching these labels would decrease the value of the objective function.

6 Convergence analysis of the Bézier function and BS4VM

This section will show the approximation precision of Bézier function to hinge loss function and the convergence of BS4VM. In addition, the convergence condition holds in the nonlinear BS4VM.

6.1 Approximation accuracy analysis of Bézier function

Theorem 2

Let \(x \in R,\)\(k > 0,\)\(L(x)\) stand for the hinge loss function, and \(B_{4} (x,k)\) be the Bézier function with five interpolation points. There will be such results

$$ 0 \le B_{4} (x,k) \le L(\left| x \right|,k) $$
(12)
$$ 0 \le L^{2} (\left| x \right|,k) - B_{4}^{2} (x,k) \le \frac{15}{{64k^{2} }} $$

Proof

(i) It is obvious that \(L(\left| x \right|,k) - B_{4} (x,k){ = }0\) holds with \(\left| x \right| > \frac{1}{k}.\) For \(x \in [ - \frac{1}{k},0),\) \(L(\left| x \right|,k)\) and \(B_{4} (x,k)\) are monotonically increasing, and \(L(\left| x \right|,k) - B_{4} (x,k) \ge\) \(L(\left| {\frac{ - 1}{k}} \right|,k) - B_{4} (\frac{ - 1}{k},k){ = }0\) is easy to obtain. For \(x \in \left[ {0,\frac{1}{k}} \right],\) \(L(\left| x \right|,k)\) and \(B_{4} (x,k)\) are monotonically decreasing, and there will be \(L(\left| x \right|,k) - B_{4} (x,k) \ge L(\left| \frac{1}{k} \right|,k) - B_{4} (\frac{1}{k},k){ = }0\). So \(0 \le B_{4} (x,k) \le L(\left| x \right|,k)\) is achieved.

(ii) \(L^{2} (\left| x \right|,k) - B_{4}^{2} (x,k){ = }0\) holds with \(\left| x \right| > \frac{1}{k}.\) For \(x \in [ - \frac{1}{k},0),\) from (i), one can find \(L(\left| x \right|,k)\) and \(B_{4} (x,k)\) are monotonically increasing; therefore, \(0 \le L(\left| x \right|,k){ - }B_{4} (x,k) \le L(0,k){ - }B_{4} (0,k){ = }\frac{1}{8k}\) is established. As is known, \(L(\left| x \right|,k){ + }B_{4} (x,k) \le\) \(L(0,k) + B_{4} (0,k){ = }\frac{15}{{8k}},\)\(L^{2} (\left| x \right|,k) - B_{4}^{2} (x,k) \le \frac{1}{8k} \cdot \frac{15}{{8k}}{ = }\frac{15}{{64k^{2} }}\) will be derived. In short, \(0 \le L^{2} (\left| x \right|,k) - B_{4}^{2} (x,k) \le \frac{15}{{64k^{2} }}\) is proved.

6.2 Convergence analysis of the BS4VM

Theorem 3

Let \(A \in R^{m \times n} ,b \in R^{m \times 1} ,\) and define two real functions \(g(x)\) and \(f(x,k)\) as follows:

$$ \begin{gathered} g(x) = \frac{1}{2}\left\| x \right\|_{2}^{2} + \frac{1}{2}\left\| {L(\left| {Ax + b} \right|)} \right\|_{2}^{2} + \frac{1}{2}\left\| {L(\left| {Ax + b} \right|)} \right\|, \hfill \\ f(x,k) = \frac{1}{2}\left\| x \right\|_{2}^{2} + \frac{1}{2}\left\| {B_{{4}} (Ax + b,k)} \right\|_{2}^{2} + \frac{1}{2}\left\| {B_{{4}} (Ax + b,k)} \right\|. \hfill \\ \end{gathered} $$
(13)

The following results can be achieved:

(1) \(\forall k > 0\), there will be \(\left\| {x_{k}^{*} - x^{*} } \right\| \le \frac{{{15}}}{{128k^{2} }}\)

(2) \(\mathop {\lim }\limits_{k \to \propto } \left\| {x_{k}^{*} - x^{*} } \right\| = 0\)

Proof

(i) Applying the first-order optimization condition and convex property of \(g(x)\) and \(f(x,k)\), formula (14) is attained,

$$ g(x_{k}^{*} ) - g(x^{*} ) \ge \nabla g(x^{*} )(x_{k}^{*} - x^{*} ) + \frac{1}{2}\left\| {x_{k}^{*} - x^{*} } \right\|_{2}^{2} = \frac{1}{2}\left\| {x_{k}^{*} - x^{*} } \right\|_{2}^{2} , $$
$$ f(x_{{}}^{*} ,k) - f(x_{k}^{*} ,k) \ge \nabla f(x^{*} )(x^{*} - x_{k}^{*} ) + \frac{1}{2}\left\| {x_{k}^{*} - x^{*} } \right\|_{2}^{2} = \frac{1}{2}\left\| {x_{k}^{*} - x^{*} } \right\|_{2}^{2} {. } $$
(14)

Based on the formula (13) and the property of \(B_{4} (x,k) \le h(x)\), formula (14) is acquired,

$$ \begin{gathered} \left\| {x_{k}^{*} - x^{*} } \right\| \le g(x_{k}^{*} ) - g(x^{*} ) + f(x_{{}}^{*} ,k) - f(x_{k}^{*} ,k) \hfill \\ { = (}f(x_{{}}^{*} ,k) - g(x^{*} ){) - (}f(x_{k}^{*} ,k) - g(x_{k}^{*} )) \hfill \\ \, \le g(x^{*} ) - f(x_{{}}^{*} ,k) \hfill \\ \, = \frac{1}{2}\left\| {L(Ax + b)} \right\|_{2}^{2} - \frac{1}{2}\left\| {B_{4} (Ax + b,k)} \right\|_{2}^{2} . \hfill \\ \end{gathered} $$
(15)

According to Theorem 2, for \(x \in [ - \frac{1}{k},\frac{1}{k}]\), \(L_{{}}^{2} (\left| x \right|,k) - B_{4}^{2} (\left| x \right|,k) \le L_{{}}^{2} (0,k) - B_{4}^{2} (0,k){ = }\frac{{{15}}}{{64k^{2} }}\). So \(\left\| {x_{k}^{*} - x^{*} } \right\|{ = }\frac{1}{2}[L_{{}}^{2} (\left| x \right|,k) - B_{4}^{2} (x;k)] \le \frac{{{15}}}{{128k^{2} }}\) holds.

(ii) As \(\left\| {x_{k}^{*} - x^{*} } \right\| \le \frac{15}{{128k^{2} }}\), it is easy to draw the conclusion of \(\mathop {\lim }\limits_{k \to \infty } \left\| {x_{k}^{*} - x^{*} } \right\| = 0\). Theorem 3 is proved.

7 The experiments and comparisons

This section will evaluate the performance, effectiveness and complexity of the proposed BS4VMs. It will be surveyed from two dimensions. The longitudinal dimension means the comparison BS4VMs with other three smooth models, LDS4VM (S3VM with low density separation) [3], CS4VM (S3VM with cubic spline function) [15], and the QS4VM (S3VM with quintic spline function) [16]. The horizontal dimension stands for the comparison of BS4VMs within different orders. This part lists three kinds of BS4VMs, BS4VM-I (S3VM with 2-order Bézier function), BS4VM-II (S3VM with 3-order Bézier function), and BS4VM-III (S3VM with 4-order Bézier function). Experiments are carried on four kinds of datasets, the artificial datasets, UCI dataset,Footnote 1 USPS dataset, and large-scale NDC dataset. These four kinds of datasets are of significant difference. Subsection 7.1 shows the experiment on small-size artificial dataset named “checkboard.” It is produced generated by two dimensions of uniformly distributing the regions to points. The “checkboard” belongs to one kind of data with nonlinear separable. In subsection 7.2, UCI datasets are the real-world datasets. They are generated from some statistical departments, electronic sensors, and reports. Some datasets are multi-classes and irregular data. Preprocessing is required. They have different data size. In subsection 7.3, handwritten symbol consists of 16*16 grayscale pixels of handwritten digits from ‘0' to ‘9'. These data are from the USPS Company and belong to the digital pattern recognition of real world. The last kind of dataset is NDC, namely, normally distributed clusters, generated by the NDC algorithm. The algorithm generates a series of random centers for multivariate normal distributions. Randomly generate a fraction for this center and a separating plane. Based on the plane, some classes for centers will be chosen. Then the points are randomly generated from the distributions. The size can be changed according to the experimenter. For the test of large-scale dataset, the NDC is a good choice.

Because of the too high complexity of the 10-order polynomial function in [13], the calculation time exceeds the acceptable range. Thus, this section ignores algorithm [13] in comparison. As the parameters \(C\) and \(C^{*}\) are not sensitive to the accuracy of classification, \(C = C^{*}\) is set varying from 10–2 to 102. All classifiers are implemented on PC of Windows 10 with 64 bit operation system, Intel I7 processor (1.6 GHZ) and 16 GB RAM. The codes of models are written in MATLAB R2009a.

Experiments are set up according to the following rules: the ratio of the labeled points \(m\) varies from 5 to 65%, and the rest is the unlabeled points, similar to the unlabeled data ratio evolving from 20 to 80% in [26]. The labeled ratio is set according to the missing label scenarios in real world. 5% labeled ratio means the majority of data labels are missing. This is a picky condition to detect a good classifier. On the other hand, if the labeled ratio is more than 70%, too many labels means the gap between semi-supervised SVM and full-supervised SVM is quite small. Therefore, the labeled ratio is set from 5 to 65% with the interval of 20%. The labeled data are used for training the LDS4VM, CS4VM, QS4VM, BS4VM, and then predicting the unlabeled points. Before simulation, all databases are normalized and the two classes of label are divided into classes of − 1 and + 1. Each experiment is carried on with tenfold cross-validation.

7.1 Experiment based on artificial dataset

The first experiment is designed to demonstrate the effectiveness of BS4VM through the artificial nonlinear “tried and true” checkboard dataset [27]. The checkboard dataset is generated by two dimensions of uniformly distributing the regions to points and labeling two classes “White” and “Black.” Each dimension has 100 points, and thus the checkboard dataset has 10,000 samples to train and test the algorithms, just as Fig. 3 shows. The comparison result can be seen in Table 2.

Fig. 3
figure 3

Figure of the checkerboard dataset

Table 2 Test accuracy on checkboard dataset with different labeled ratio (the bold part is the best result)

Table 2 demonstrates that (1) with the increase in the labeled ratio, the classification accuracy climbs on the whole. (2) The higher the order of the spline polynomial, the better the classification accuracy. (3) The checkboard dataset is not suitable for too few labeled samples, as the experiment result is not so satisfactory with labeled ratio equal to 5%. Lastly, the comparison in Table 2 shows the BS4VM performs the best classification accuracy.

7.2 Results on UCI datasets

In this subsection, eight real-world UCI datasetsFootnote 2 are chosen to test the four classification algorithms. This collection of databases was created in 1987 and has been widely used by the machine learning community for the empirical analysis. It provides various datasets from many areas in reality life, such as disease diagnosis, manufacturing, business, and so on. The calculating results are given in Table 3.

Table 3 Tenfold cross-validation results of the average correction with different ratios of labeled points on eight public datasets for the four algorithms (the bold part is the best result)

Table 3 illustrates the detailed comparisons of the proposed model with other three models in eight different datasets. From Table 3, one can find that with the increase in the labeled ratio, all the algrorithms show better classification accuracy. For Clean dataset with labeled ratio varying from 25 to 65%, the experimental result by BS4VM (accuracy 68.35%, 71.28%, 75.45%) outperforms other three algorithms, LDS4VM (66.95%, 65.94%, 72.90%), CS4VM (66.11%, 66.13%, 70.66%), and QS4VM (66.53%, 67.94%, 71.41%). This conclusion holds for datasets Lympho, Bupa, Tumor, WDBC, and Adult as well in most scenarios. For datasets Balance, German, the advantages of classification accuracy go up and down, and BS4VM performs a litter better than other three methods.

For the purpose of describing the dynamic process of test accuracy for each dataset with various labeled ratios, Fig. 4 is given. It presents the overall trend of these algorithms. It claims all the lines have the trend of climbing with the labeled ratio increasing. Taking Data (4) for example, the red line stands for the proposed BS4VM method, the blue and black lines mean QS4VM, CS4VM, and the purple line denotes LDS4VM. For the labeled ratio 5%, the accuracy of CS4VM is better than BS4VM. But with the ascension of ratio, the red line is always above the other three lines, claiming the BS4VM performs the best with high labeled ratio.

Fig. 4
figure 4

The accuracy comparison of the LDS4VM, CS4VM, QS4VM, BS4VM on eight publicly available datasets, with 5%, 25%, 45%, and 65% as labeled data: (1) Lympho, (2) Bupa, (3) Tumor, (4) Clean, (5) Balance, (6) German, (7) WDBC, (8) Adult

To further analyze the statistical accuracy more clearly, the average ranks of all the classifiers are computed and listed in Table 4 and Fig. 5. Table 4 indicates the average ranks of eight datasets. This rank order is calculated by average value of each algorithm with different labeled ratios. The smaller the number of rank, the higher the simulation accuracy. From the last row of Table 4, one can notice that BS4VM ranks in the first place for eight datasets, whereas the others stand on second, third and fourth places.

Table 4 Accuracy average ranks of LDS4VM,CS4VM, QS4VM, BS4VM with linear kernel
Fig. 5
figure 5

Correction average ranks of LDS4VM, CS4VM, QS4VM, BS4VM in each dataset with different labeled ratios

In order to verify the advantage of proposed algorithm BS4VM, the Friedman statistical method is employed. Fredman statistic is distributed based on \(\chi_{F}^{2}\) with \(k - 1\) degree of freedom, where \(k\) means the counts of algorithms and \(N\) stands for the counts of datasets.

For the above experiment on UCI datasets, under the null hypothesis that all the algorithms are equivalent, Fredman statistic can be calculated as [28]

$$ \chi_{F}^{2} = \frac{12N}{{k(k + 1)}}[\sum\limits_{i = 1}^{4} {R_{i}^{2} } - \frac{{k(k + 1)^{2} }}{4}] = \frac{12 \times 8}{{4 \times 5}}[{3}{\text{.2188}}^{2} + {2}{\text{.7031}}^{2} + {2}{\text{.8281}}^{2} + {1}{\text{.25}}^{2} - \frac{{4 \times 5^{2} }}{4}]{ = 10}{\text{.6954}} $$
$$ F_{F} = \frac{{(N - 1)\chi_{F}^{2} }}{{N(k - 1) - \chi_{F}^{2} }} = \frac{7 \times 10.6954}{{8 \times 3 - 10.6954}}{ = }5.6264 $$

For four algorithms and eight datasets, \(F_{F}\) is distributed with \((k - 1) = 3\) and \((k - 1)(N - 1) = 21\) degrees of freedom. The critical or threshold value of \(F(3,21)\) for significance level \(\alpha = 0.05\) is 3.072. Obviously, \(F_{F} = 5.6264 > F(3,21) = 3.072\), thus the null hypothesis will be rejected, and these four algorithms having significant differences can be surely verified.

After the null hypothesis is rejected, the Nemenyi test can be proceed when all classifiers are compared to each other [28]. The performance of two classifiers is of significant difference if the corresponding average ranks differ by at least the critical difference \(CD = q_{\alpha } \sqrt {\frac{k(k + 1)}{{6N}}}\). For the UCI experiment, \(CD = 2.291\sqrt {\frac{4 \times 5}{{6 \times 8}}} = 1.4788\) at \(\alpha = 0.1.\) As the average rank difference between LDS4VM and BS4VM (3.2188–1.25 = 1.9688) is bigger than critical difference 1.4788, the performance of BS4VM is significantly better than that of LDS4VM. Similarly, the performance of BS4VM is quite superior than that of QS4VM (2.8281 − 1.25 = 1.5781 > 1.4788). Due to 2.7031 − 1.25 = 1.4531 < 1.4788, this Nemenyi test cannot detect the significant difference between CS4VM and BS4VM.

Figure 5 visually presents the accuracy ranks of experiment results with different labeled ratios. One can find that the advantage of BS4VM varies. But from a statistical point of view, the BS4VM performs best, just as Table 4 shows. The proposed algorithm shows satisfactory performance from Fig. 5b–d for most cases. This reminds us that, for different machine learning algorithms, the statistical results of quantities of datasets are more precision and credible, rather not one specific calculation.

7.3 Results on handwritten symbol recognition

In this section, USPS handwritten datasetsFootnote 3 will be investigated to show the impact of the number of labeled data on the classification accuracy. The handwritten database consists of grayscale images of handwritten digits from ‘0' to ‘9', as shown in Fig. 6.

Fig. 6
figure 6

Ten number symbols of the USPS database

The comparison of four pairwise digits ‘0' versus ’8', ‘2' versus ‘4', ‘1' versus ‘7', and ‘3' versus ’6' is given, respectively. The calculation results of accuracy and dynamic process can be seen in Table 5 and Fig. 7. From Table 5 and Fig. 7, the classification accuracies of two pairs ‘0′ versus ’8′ and ‘1′ versus ‘7′ arrive at more than 80%, even almost 99%, while the classification accuracies of pairs ‘2' versus ‘4' and ‘3' versus ’6' are less than 80%, even smaller than 52%. Thus, the generalization ability of S3VM varies. The suitable dataset should be considered if one plans to carry out the identification process.

Table 5 Tenfold cross-validation results of the average correction and the number of labeled points on USPS database for four algorithms (the bold part is the best result)
Fig. 7
figure 7

The average test accuracy of LDS4VM, CS4VM, QS4VM, BS4VM on USPS dataset with various labeled ratios

Table 6 and Fig. 8 express the accuracy ranks of each dataset with various labeled percentages. Table 6 proves that the BS4VM ranks in the first place; meanwhile, the other three algorithms perform similarly. Figure 8 shows the accuracy rank of each calculation. Taking Fig. 8d for example, when the labeled data are more than 50%, the proposed learning algorithm gets well trained and shows satisfactory precision.

Table 6 Average ranks of five algorithms with linear kernel on USPS accuracy values
Fig. 8
figure 8

Correction average ranks of LDS4VM, CS4VM, QS4VM, BS4VM in each dataset with different labeled ratios

The Friedman statistical method can also be applied on USPS dataset to compare these algorithms from a quantitative perspective. For the four algorithms and four datasets,

$$ \chi_{F}^{2} = \frac{12N}{{k(k + 1)}}[\sum\limits_{i = 1}^{4} {R_{i}^{2} } - \frac{{k(k + 1)^{2} }}{4}] = \frac{12 \times 4}{{4 \times 5}}[{2}{\text{.65625}}^{2} + {2}{\text{.8125}}^{2} + {2}{\text{.75}}^{2} + {1}{\text{.78125}}^{2} - \frac{{4 \times 5^{2} }}{4}]{ = 10}{\text{.1203}} $$
$$ F_{F} = \frac{{(N - 1)\chi_{F}^{2} }}{{N(k - 1) - \chi_{F}^{2} }} = \frac{3 \times 10.1203}{{4 \times 3 - 10.1203}}{ = 16}{\text{.1521}} $$

\(F_{F}\) is distributed with \((k - 1) = 3\) and \((k - 1)(N - 1) = 9\) degrees of freedom. The threshold value of \(F(3,9)\) for significance level \(\alpha = 0.05\) is 3.863. Obviously, \(F_{F} = 16.1521 > F(3,9) = 3.863\), thus the null hypothesis will be rejected, and the hypothesis that four algorithms are of significant difference is proved. This means the generalization and robustness of BS4VM are promising.

7.4 Results on large-scale NDC dataset for nonlinear Gaussian kernel

In the last subsection, further to verify which algorithm performs best on both accuracy and calculating time among BS4VMs, experiments based on the NDC datasetFootnote 4 for nonlinear Gaussian kernel are carried out. The NDC dataset is designed with large-scale attributes or with large samples to test the robustness of the new algorithms. The NDC dataset is a temporal higher-order network dataset, which means a sequence of time-stamped simplices where each simplex is a set of nodes. As in the real world, large-scale datasets are more commonly classified, the test accuracy and calculation time should be considered as well.

Table 7 and Fig. 9 show the performances of three kinds of BS4VMs, namely BS4VM-I, BS4VM-II, and BS4VM-III with different orders of Bézier function. One can notice that (1) the BS4VMs classify the NDC datasets very well, and most of the results are more than 96%. (2) With the climbs of labeled ratio and attributions of NDC1 ~ NDC5, the computing time increases quickly. However, with the rise of samples of NDC6 ~ NDC10, the calculating time doesn’t go up dramatically. (3) Because these three algorithms are belong to the same kind of smooth technique, the accuracy differences are quite small. But the accuracy of BS4VM-III stands the first place for most cases. Meanwhile, the computing time of BS4VM-I and BS4VM-II line up top two on account of the higher complexity of BS4VM-III.

Table 7 The test correction and calculation time comparisons for Gaussian kernel (the bold part is the best result)
Fig. 9
figure 9figure 9

The average test accuracy and calculation time of BS4VM-I, BS4VM-II, and BS4VM-III on ten NDC dataset with various labeled ratio

In order to clarify the comparison results, Table 8 lists the average ranks of BS4VM-I, BS4VM-II and BS4VM-III with Gaussian kernel on accuracy and calculation time for NDC. From the statistics, the accuracy average rank of BS4VM-III is 1.8875, smaller than other two number, indicating this method is more superior. The consuming time ranks of BS4VM-I and BS4VM-II are equal, revealing the computing complexity are the same even though BS4VM-II has higher order of Bézier function.

Table 8 Average ranks of BS4VM-I, BS4VM-II and BS4VM-III with Gaussian kernel on NDC correction and time values

For the purpose of verifying whether the performances of the three algorithms have significant difference, the Friedman statistical method is utilized. For this experiment with three methods and ten datasets, statistical results \(\chi_{F}^{2}\) and \(F_{F}\) will be

$$ \chi_{F}^{2} = \frac{12N}{{k(k + 1)}}[\sum\limits_{i = 1}^{5} {R_{i}^{2} } - \frac{{k(k + 1)^{2} }}{4}] = \frac{12 \times 10}{{3 \times 4}}[{2}{\text{.175}}^{2} + {1}{\text{.95}}^{2} + {1}{\text{.8875}}^{2} - \frac{{3 \times 4^{2} }}{4}]{ = 0}{\text{.958}} $$
$$ F_{F} = \frac{{(N - 1)\chi_{F}^{2} }}{{N(k - 1) - \chi_{F}^{2} }} = \frac{9 \times 0.958}{{10 \times 3 - 0.958}}{ = }0.4528 $$

The critical value of \(F(2,18)\) for significance level \(\alpha = 0.05\) is 3.55. Visibly, \(F_{F} = 0.4528 < F(2,18) = 3.555\), and thus these three algorithms have no significant differences from the quantification method is verified. It is suggested that if high accuracy is considered, higher order of BS4VMs should have the priority. However, if calculating time weighs a lot, the lower order of BS4VMs should be chosen.

For the goal of visual expression, the diversities of classification correction and calculation time of each dataset with the variety of labeled ratio, the histogram Figs. 10 and 11 are given.

Fig. 10
figure 10

Correction average ranks of BS4VM-I, BS4VM-II and BS4VM-III in each dataset with different labeled ratios

Fig. 11
figure 11

Time average ranks of BS4VM-I, BS4VM-II and BS4VM-III in each dataset with different labeled ratios

From Fig. 10c and d, the classification precision of BS4VM-III lies in the forefront, when the threshold of labeled proportion is above 45%. However, this superior performance is at the cost of complex calculation, just as Fig. 11c and d shows. From the ranks of calculation time, BS4VM-I shows the perfect performance in Fig. 11a, b and d, as the lower order of Bézier function, the less computational complexity.

8 Conclusion

Considering the non-smooth term of semi-supervised support vector machines blocking the improvement in classification accuracy, a new class of Bézier functions is ultilized to approximate the hinge loss function, and a novel kind of Bézier smooth semi-supervised support vector machines (BS4VMs) is constructed. The convergence proves the proposed model can draw close to the non-smooth objective function theoretically. As n-order Bézier function is \(n - 1\)-order smooth and differentiable, the fast algorithm can be used to solve the programming. In contrast to the LDS4VM, CS4VM, and QS4VM, experiments on artificial data, UCI data, USPS handwritten database, and NDC datasets clearly show that the BS4VMs have the best performance and efficiency among exponential function, cubic spline function, and quintic spline function. Moreover, the proposed algorithms show good performance for large-scale datasets. Due to the advantage of different order of BS4VMs varying, when applying BS4VMs, performance or efficiency priority should be paid attention. For further research, the feature selection and fuzzy membership should be good ways to improve the accuracy for different kinds of datasets. Bézier function for semi-supervised SVM on regression and its generalization performances will be explored as well.