Abstract
The semi-supervised support vector machine (S3VM) for classification is introduced for dealing with quantities of unlabeled data in the real world. Labeled data are utilized to train the algorithm and then were adapted to classify the unlabeled data. However, this algorithm has several drawbacks, such as the non-smooth term of semi-supervised objective function negatively affects the classification precision. Moreover, it is required to endure heavy burden in solving two quadratic programming problems with inversion matrix operation. To cope with this problem, this article puts forward a novel class of Bézier smooth semi-supervised support vector machines (BS4VMs), based on the approximation property of Bézier function to the non-smooth term. Because of this approximation, a fast quasi-Newton method for solving BS4VMs can be used to decrease the calculating time scale. This new kind of algorithm enhances the generalization and robustness of S3VM for nonlinear case as well. Further, to show how the BS4VMs can be practically implemented, experiments on synthetic, UCI dataset, USPS dataset, and large-scale NDC database are offered. The theoretical analysis and experiments comparisons clearly confirm the superiority of BS4VMs in both classification accuracy and calculating time.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In the information age, mass production of information has caused serious information overloaded. Facing this dilemma, support vector machine (SVM), one kind of fast information classification algorithm, becomes one effective solution. As one kind of full-supervised statistical machine learning, support vector machine (SVM) has get widely application for its good performance in information classification. However, in order to achieve satisfactory classification standard, it is necessary to train the SVM with quantities of labeled datasets. In fact, this condition cannot be fully achieved as the acquisition of the labeled data is usually difficult or the payment is much expensive. In contrast, unlabeled data are abundant and easy to collect. Furthermore, relatively few labeled datasets lead to a frequent drawback, that is the over fitting to the training data with a consequent loss of generality. Thus, to deal with this problem, the semi-supervised support vector machine (S3VM) learning method is proposed [1,2,3].
The semi-supervised support vector machine is the method utilizing both the labeled and unlabeled data for learning. The main goal of the S3VM is to employ the large collection of unlabeled data together with a limited labeled data to improve the classification accuracy. Because of its elegant properties with unique global optimal solution and avoiding the disaster of dimensionality, lots of scholars have marched for this area and applied the S3VM to many fields, such as text classification [4], the multi-class human action recognition [5, 6], biomedical science [7, 8], graph reduction [9], image and video classification [10], and applications in industry and business [11, 12].
However, the main drawback of S3VM is that the objective function is usually non-smooth. It needs to endure heavy burden in solving two quadratic programming problems with inversion matrix operation. Also, fast algorithm cannot be used, increasing the computing complexity. Some researchers have proposed several advanced methods to smooth the objection function. In 2005, the replacement of the non-smooth term \(\max \{ 0,1 - \left| x \right|\}\) with \(\exp ( - 3x^{2} )\) is given and the low density separation LDS-S3VM [3] was proposed by Chapelle and Zien. But the approximation accuracy is not so high. In 2009, Liu et al. showed the polynomial function [13]\(P(x) = \frac{{1 - x^{2} }}{2} + \frac{1}{8}(1 - x^{2} )^{2} + \frac{1}{16}(1 - x^{2} )^{3} + \frac{5}{128}(1 - x^{2} )^{4} + \frac{7}{256}(1 - x^{2} )^{5} ,\) \(x \in [ - \frac{1}{k},\frac{1}{k}]\). However, the 10-order polynomial function is too complex and has too many calculations. Later, Yang et. al offered one new smoothing strategy of approximate function \(\rho_{\varepsilon } (x) = \sqrt {x^{2} + \varepsilon } \approx \left| x \right|\) [14] based on robust difference of convex functions in 2013. This new smooth method applied the DC optimization algorithms for solving the S4VMs, and didn’t add new variables and constraints to the corresponding S3VMs. It is a promising direction to facilitate the research of S4VMs. Zhang et al. introduced their cubic spline function [15] \(s(x,k) = \frac{{k^{2} \left| x \right|^{3} }}{3} - kx^{2} - \frac{1}{3k} + 1,\) \((\left| x \right| \le \frac{1}{k}),\) and quintic spline function [16] \(s(x,k) = - \frac{{k^{4} \left| x \right|^{5} }}{5} + \frac{1}{2}k^{3} x^{4} - kx^{2} - \frac{3}{10k} + 1,\) \((\left| x \right| \le \frac{1}{k})\) in 2015. However, the above smooth techniques are not so satisfied.
Motivated by the works of [3, 13,14,15,16], a new research question is gradually arisen, whether there is any other smooth technology, improving accuracy and decreasing calculation scale. In this paper, a new class of Bézier smooth functions is applied. Employing the smooth Bézier function \(B_{n} (x)\) to approximate the non-smooth term \(\max \{ 0,1 - \left| t \right|\}\), a novel class of Bézier smooth semi-supervised support vector machines (BS4VMs) is derived. The new programming possesses the following attractive advantages: firstly, the fast gradient algorithm can be used to solve the BS4VMs as the objective function becomes smooth and differentiable. Much calculation time can be saved. Secondly, a new class of smooth functions is proposed. The optimal smooth function can be selected for different scale datasets. Lastly and more importantly, convergence analysis and experimental comparisons verify that BS4VMs are superior to the given models in classification capability and efficiency.
In order to make the expression more clear, the definition of each variable involved in equations is listed in Table 1. For example, all vectors are column vectors, and \(\nabla f(t)\) represents the gradient of the function.
The rest of this paper is organized as follows. The preliminary background knowledge of S3VM will be introduced in Sect. 2. Section 3 shows how the BS4VMs can be derived. A fast quasi-Newton algorithm for solving the programming will be followed in Sect. 4. Then the nonlinear BS4VMs and the convergence analysis of the model are listed in Sects. 5 and 6. The comparisons of the proposed algorithm with other advanced methods based on four kinds of datasets will be analyzed in Sect. 7. The discussion and conclusion will be followed in the last section.
2 Preliminary of semi-supervised support vector machine
The purpose of S3VM for binary classification is to maximize the margin by using the labeled and unlabeled data. Considering one programming, the training data contain the \(l\) labeled points \(\{ (x^{i} ,y_{i} )\}_{i = 1}^{l} ,y_{i} = \pm 1\) and the \(u\) unlabeled dataset \(\{ x^{i} \}_{i = l + 1}^{l + u}\), where \(x^{i} { = (}x_{1}^{i} ,x_{2}^{i} ,...,x_{m}^{i} {)} \in {\mathbb{R}}^{m} .\) For the linearly separable data, one optimal separating hyperplane with the largest distance for the S3VM classifier should be explored.
Let \(y \triangleq (y^{l} ,y^{l + u} )\) be a column vector, where \(y^{l} { = (}y_{{1}} {,}y_{2} ,...,y_{l} {)}^{\rm T}\) is the known label, and \(y^{{l}{\text{ + u}}} { = (}y_{{l{ + 1}}} {,}y_{{l{ + }2}} ,...,y_{{l{ + }u}} {)}^{\rm T}\) is the unknown label. The vector \(y_{n} = [y_{l + 1} ,...,y_{l + n} ]\) based on the largest margin is the pursuing goal. For the linear condition, the S3VM can be described as
where \(C\) and \(C^{*} ,\) the penalty parameters for both labeled and unlabeled data, are greater than zero. The programming (1) can be changed into the unconstrained form of
in which \(L(t)\) is the hinge loss function and \(L(t) = \max (0,1 - t)\)[3].
3 Bézier smooth semi-supervised support vector for classification
3.1 Background knowledge about the Bézier function
Bézier curves were invented in 1968 by the French engineer Pierre Bézier for the initial purpose of designing automobile bodies [18]. For one series of interpolation points \(P_{0} ,P_{1} , \cdots P_{n - 1} ,P_{n}\) that need to be fitted, the intermediate points \(P_{1} , \cdots P_{n - 1}\) are used to specify the endpoint tangent vectors. Hence the Bézier curve passes through \(P_{0}\) and \(P_{n}\) and approximates the other controlpoints, just like Fig. 1. To accomplish this goal, some kinds of weighting functions representing the influence of the control points at a given point of the Bézier curve are required. Arbitrary function satisfying the requirements is allowed, but in most cases the Bernstein polynomial is employed. A Bézier curve of degree n can be expressed as \(B(t) = \sum\nolimits_{i = 0}^{n} {C_{i}^{n} } (t)P_{i} ,\) where \(P_{i}\) is the control point or anchor point. \(C_{i}^{n} (t)\) means the Bernstein polynomial given by \(C_{i}^{n} (t){ = }\left( \begin{gathered} n \hfill \\ i \hfill \\ \end{gathered} \right)(1 - t)^{n - i} t^{i}\), in which \(i \in \{ 0,1,...,n\}\).
Many advantages for Bézier Curves have been noticed:
-
(1)
They always passed through anchor points \(P_{0}\) and \(P_{n}\).
-
(2)
They are always tangent to the lines of path \(P_{0} \to P_{1}\) and \(P_{n - 1} \to P_{n}\).
-
(3)
They always lie within the convex hull consisting of the control points [19]. Owing to these good performances, the Bézier curves have been widely applied in computer graphic, such as technical illustration programs, CAD programs, trajectory guidance, and so forth [20,21,22,23].
For approximating the hinge loss function, the quadratic parameter Bézier function can be expressed as \(\left\{ {\begin{array}{*{20}c} {B_{2x} (t) = (2t - 1)/k} \\ {B_{2y} (t) = ( - 2t^{2} + 2t)/k} \\ \end{array} } \right.\) in which \(p_{0} = ( - \frac{1}{k},0),p_{1} = (0,\frac{1}{k}),p_{2} = (\frac{1}{k},0)\). Eliminating the parameter \(t\), \(y = B_{2} (x) = - \frac{1}{2k}(k^{2} x^{2} - 1)\) will be given. Similarly, the cubic parameter Bézier function \(\left\{ \begin{gathered} B_{3x} (t) = (2t^{3} - 2t^{2} + 2t - 1)/k \hfill \\ B_{3y} (t) = ( - 3t^{2} + 3t)/k \hfill \\ \end{gathered} \right.\) will be acquired by interpolating four points \(p_{0} ,p_{1} ,p_{2} ,p_{3}\), in which \(p_{0} = ( - \frac{1}{k},0),p_{1} = p_{2} = (0,\frac{1}{k}),\) \(p_{3} = (\frac{1}{k},0).\) From the general formula \(B(t) = \sum\nolimits_{i = 0}^{n} {C_{i}^{n} } (t)P_{i} .\) the n-order Bézier function \(y = B_{n} (x)\) will be acquired.
Theorem 1
Bézier curve \(B_{n - 1} (t)\) is \(n - 1\)-order smooth at the points \(x = \pm \frac{1}{k}\).
Proof
The proof is based on mathematical induction.
(i) \(\forall x \in \Omega ,B_{2} (x) = - \frac{1}{2}(k^{2} x^{2} - 1)\) satisfies the following equalities at the points \(x = \pm \frac{1}{k}\)
So, \(B_{2} (x,k)\) is one-order smooth.
(ii) \(B_{3} (x)\) satisfies the following equalities at the points \(x = \pm \frac{1}{k}\),
Hence, \(B_{3} (x)\) is twice-order smooth.
(iii) Let \(B_{{P_{0} P_{1} ...P_{n - 1} }}\) denote the Bézier curve determined by points \(P_{0} ,P_{1} ,...,P_{n - 1}\). Based on
according to the mathematical induction, \(B_{n - 1} (x)\) is \(n - 1\) order smooth can be proved.
3.2 Bézier smooth semi-supervised support vector for classification
From (2), the last term \(C^{*} \sum\nolimits_{i = l + 1}^{l + u} {L(\left| {w^{\rm T} x_{i} + b} \right|)}\) is non-smooth and difficult to solve [4], making the formula (2) become a difficult-solving mixed-integer quadratic programming. Replacing this term with smooth function \(y = B_{n} (x),\) a new class of Bézier smooth semi-supervised support vector machines (BS4VMs) is derived, described in formula (6)
In this paper, without loss of generality, 4-order Bézier interpolation function \(y = B_{4} (x)\) is taken into consideration. The higher the order of Bézier function, the better the approximation. The approximation comparison of different smooth models can be seen in Fig. 2.
From Fig. 2, one can find that (1) the 4-order Bézier function performs best among 3-order Bézier function, exponent function, 10-order polynomial, the cubic spline function, and quintic spline function in approximating the hinge loss function. (2) 3-order Bézier function performs almost the same with 10-order polynomial, while the calculation complexity is much less than the latter.
4 The nonlinear kernel for BS4VM
For the nonlinear case, the kernel function \(k(x^{i} ,x^{j} ) = \phi (x^{i} )^{\rm T} \phi (x^{j} )\) can be applied to map the original data into the high dimension Hilbert space. After this transforming, the linear program will be arrived. Let \(\phi :R^{m} \to R^{d} \left( {d > m} \right)\) be the mapping function of the formula (1). The nonlinear kernel-based S3VM can be shown as
In this paper, the Gaussian kernel \(k(x^{i} ,x^{j} ) = \exp ( - \left\| {x^{i} - x^{j} } \right\|_{2}^{2} /2\sigma^{2} )\) is adopted and the kernel function \(K = k(x^{i} ,x^{j} )\) is positive semi-definite matrix [17]. For formula (2), the variable \(w\) will be replaced by \(w = \sum\nolimits_{i = 1}^{m} {u_{i} } y_{i} x_{i}\), in which \(u \in R^{m}\). The nonlinear S3VM is achieved.
Applying the n-order Bézier smooth function, the nonlinear BS4VM model with kernel function is offered.
The objective function (9) is \(n - 1\)-order differentiable for any arbitrary kernel.
5 One fast quasi-Newton method for solving BS4VM
In this section, the sub-LBFGS algorithm will be employed to solve semi-supervised problem (1) [2, 24, 25]. Differentiate (2) with the method of subgradient, the following will be given:
where \(\beta_{i} : = \left\{ \begin{gathered} 1{\text{ if }}i \in E,E: = \{ i:1 - y_{i} w^{\rm T} x_{i} > 0\} , \hfill \\ \psi {\text{ where}}\psi \in (0,1),{\text{ if }}i \in M,M: = \{ i:1 - y_{i} w^{\rm T} x_{i} = 0\} , \hfill \\ 0{\text{ if }}i \in W,W: = \{ i:1 - y_{i} w^{\rm T} x_{i} < 0\} , \hfill \\ \end{gathered} \right.\) \(E,M{\text{ and }}W\) denote the sets of points which are in error, on the margin and well-classified, respectively. For a given direction \(p\), it is required to find a subgradient \(g\). Based on formula (10), Eq. (11) will be given:
Now the S3VM algorithm with sub-LBFGS optimization solving procedure can be offered (Algorithm 1).
In step 3 of Algorithm 1, a classifier is obtained by firstly running BS4VM on the labeled examples alone. Steps 5–17 show the loop iteration process when solving the objective programming. Step 9 identifies pairs of unlabeled examples with temporary positive and negative labels such that switching these labels would decrease the value of the objective function.
6 Convergence analysis of the Bézier function and BS4VM
This section will show the approximation precision of Bézier function to hinge loss function and the convergence of BS4VM. In addition, the convergence condition holds in the nonlinear BS4VM.
6.1 Approximation accuracy analysis of Bézier function
Theorem 2
Let \(x \in R,\)\(k > 0,\)\(L(x)\) stand for the hinge loss function, and \(B_{4} (x,k)\) be the Bézier function with five interpolation points. There will be such results
Proof
(i) It is obvious that \(L(\left| x \right|,k) - B_{4} (x,k){ = }0\) holds with \(\left| x \right| > \frac{1}{k}.\) For \(x \in [ - \frac{1}{k},0),\) \(L(\left| x \right|,k)\) and \(B_{4} (x,k)\) are monotonically increasing, and \(L(\left| x \right|,k) - B_{4} (x,k) \ge\) \(L(\left| {\frac{ - 1}{k}} \right|,k) - B_{4} (\frac{ - 1}{k},k){ = }0\) is easy to obtain. For \(x \in \left[ {0,\frac{1}{k}} \right],\) \(L(\left| x \right|,k)\) and \(B_{4} (x,k)\) are monotonically decreasing, and there will be \(L(\left| x \right|,k) - B_{4} (x,k) \ge L(\left| \frac{1}{k} \right|,k) - B_{4} (\frac{1}{k},k){ = }0\). So \(0 \le B_{4} (x,k) \le L(\left| x \right|,k)\) is achieved.
(ii) \(L^{2} (\left| x \right|,k) - B_{4}^{2} (x,k){ = }0\) holds with \(\left| x \right| > \frac{1}{k}.\) For \(x \in [ - \frac{1}{k},0),\) from (i), one can find \(L(\left| x \right|,k)\) and \(B_{4} (x,k)\) are monotonically increasing; therefore, \(0 \le L(\left| x \right|,k){ - }B_{4} (x,k) \le L(0,k){ - }B_{4} (0,k){ = }\frac{1}{8k}\) is established. As is known, \(L(\left| x \right|,k){ + }B_{4} (x,k) \le\) \(L(0,k) + B_{4} (0,k){ = }\frac{15}{{8k}},\)\(L^{2} (\left| x \right|,k) - B_{4}^{2} (x,k) \le \frac{1}{8k} \cdot \frac{15}{{8k}}{ = }\frac{15}{{64k^{2} }}\) will be derived. In short, \(0 \le L^{2} (\left| x \right|,k) - B_{4}^{2} (x,k) \le \frac{15}{{64k^{2} }}\) is proved.
6.2 Convergence analysis of the BS4VM
Theorem 3
Let \(A \in R^{m \times n} ,b \in R^{m \times 1} ,\) and define two real functions \(g(x)\) and \(f(x,k)\) as follows:
The following results can be achieved:
(1) \(\forall k > 0\), there will be \(\left\| {x_{k}^{*} - x^{*} } \right\| \le \frac{{{15}}}{{128k^{2} }}\)
(2) \(\mathop {\lim }\limits_{k \to \propto } \left\| {x_{k}^{*} - x^{*} } \right\| = 0\)
Proof
(i) Applying the first-order optimization condition and convex property of \(g(x)\) and \(f(x,k)\), formula (14) is attained,
Based on the formula (13) and the property of \(B_{4} (x,k) \le h(x)\), formula (14) is acquired,
According to Theorem 2, for \(x \in [ - \frac{1}{k},\frac{1}{k}]\), \(L_{{}}^{2} (\left| x \right|,k) - B_{4}^{2} (\left| x \right|,k) \le L_{{}}^{2} (0,k) - B_{4}^{2} (0,k){ = }\frac{{{15}}}{{64k^{2} }}\). So \(\left\| {x_{k}^{*} - x^{*} } \right\|{ = }\frac{1}{2}[L_{{}}^{2} (\left| x \right|,k) - B_{4}^{2} (x;k)] \le \frac{{{15}}}{{128k^{2} }}\) holds.
(ii) As \(\left\| {x_{k}^{*} - x^{*} } \right\| \le \frac{15}{{128k^{2} }}\), it is easy to draw the conclusion of \(\mathop {\lim }\limits_{k \to \infty } \left\| {x_{k}^{*} - x^{*} } \right\| = 0\). Theorem 3 is proved.
7 The experiments and comparisons
This section will evaluate the performance, effectiveness and complexity of the proposed BS4VMs. It will be surveyed from two dimensions. The longitudinal dimension means the comparison BS4VMs with other three smooth models, LDS4VM (S3VM with low density separation) [3], CS4VM (S3VM with cubic spline function) [15], and the QS4VM (S3VM with quintic spline function) [16]. The horizontal dimension stands for the comparison of BS4VMs within different orders. This part lists three kinds of BS4VMs, BS4VM-I (S3VM with 2-order Bézier function), BS4VM-II (S3VM with 3-order Bézier function), and BS4VM-III (S3VM with 4-order Bézier function). Experiments are carried on four kinds of datasets, the artificial datasets, UCI dataset,Footnote 1 USPS dataset, and large-scale NDC dataset. These four kinds of datasets are of significant difference. Subsection 7.1 shows the experiment on small-size artificial dataset named “checkboard.” It is produced generated by two dimensions of uniformly distributing the regions to points. The “checkboard” belongs to one kind of data with nonlinear separable. In subsection 7.2, UCI datasets are the real-world datasets. They are generated from some statistical departments, electronic sensors, and reports. Some datasets are multi-classes and irregular data. Preprocessing is required. They have different data size. In subsection 7.3, handwritten symbol consists of 16*16 grayscale pixels of handwritten digits from ‘0' to ‘9'. These data are from the USPS Company and belong to the digital pattern recognition of real world. The last kind of dataset is NDC, namely, normally distributed clusters, generated by the NDC algorithm. The algorithm generates a series of random centers for multivariate normal distributions. Randomly generate a fraction for this center and a separating plane. Based on the plane, some classes for centers will be chosen. Then the points are randomly generated from the distributions. The size can be changed according to the experimenter. For the test of large-scale dataset, the NDC is a good choice.
Because of the too high complexity of the 10-order polynomial function in [13], the calculation time exceeds the acceptable range. Thus, this section ignores algorithm [13] in comparison. As the parameters \(C\) and \(C^{*}\) are not sensitive to the accuracy of classification, \(C = C^{*}\) is set varying from 10–2 to 102. All classifiers are implemented on PC of Windows 10 with 64 bit operation system, Intel I7 processor (1.6 GHZ) and 16 GB RAM. The codes of models are written in MATLAB R2009a.
Experiments are set up according to the following rules: the ratio of the labeled points \(m\) varies from 5 to 65%, and the rest is the unlabeled points, similar to the unlabeled data ratio evolving from 20 to 80% in [26]. The labeled ratio is set according to the missing label scenarios in real world. 5% labeled ratio means the majority of data labels are missing. This is a picky condition to detect a good classifier. On the other hand, if the labeled ratio is more than 70%, too many labels means the gap between semi-supervised SVM and full-supervised SVM is quite small. Therefore, the labeled ratio is set from 5 to 65% with the interval of 20%. The labeled data are used for training the LDS4VM, CS4VM, QS4VM, BS4VM, and then predicting the unlabeled points. Before simulation, all databases are normalized and the two classes of label are divided into classes of − 1 and + 1. Each experiment is carried on with tenfold cross-validation.
7.1 Experiment based on artificial dataset
The first experiment is designed to demonstrate the effectiveness of BS4VM through the artificial nonlinear “tried and true” checkboard dataset [27]. The checkboard dataset is generated by two dimensions of uniformly distributing the regions to points and labeling two classes “White” and “Black.” Each dimension has 100 points, and thus the checkboard dataset has 10,000 samples to train and test the algorithms, just as Fig. 3 shows. The comparison result can be seen in Table 2.
Table 2 demonstrates that (1) with the increase in the labeled ratio, the classification accuracy climbs on the whole. (2) The higher the order of the spline polynomial, the better the classification accuracy. (3) The checkboard dataset is not suitable for too few labeled samples, as the experiment result is not so satisfactory with labeled ratio equal to 5%. Lastly, the comparison in Table 2 shows the BS4VM performs the best classification accuracy.
7.2 Results on UCI datasets
In this subsection, eight real-world UCI datasetsFootnote 2 are chosen to test the four classification algorithms. This collection of databases was created in 1987 and has been widely used by the machine learning community for the empirical analysis. It provides various datasets from many areas in reality life, such as disease diagnosis, manufacturing, business, and so on. The calculating results are given in Table 3.
Table 3 illustrates the detailed comparisons of the proposed model with other three models in eight different datasets. From Table 3, one can find that with the increase in the labeled ratio, all the algrorithms show better classification accuracy. For Clean dataset with labeled ratio varying from 25 to 65%, the experimental result by BS4VM (accuracy 68.35%, 71.28%, 75.45%) outperforms other three algorithms, LDS4VM (66.95%, 65.94%, 72.90%), CS4VM (66.11%, 66.13%, 70.66%), and QS4VM (66.53%, 67.94%, 71.41%). This conclusion holds for datasets Lympho, Bupa, Tumor, WDBC, and Adult as well in most scenarios. For datasets Balance, German, the advantages of classification accuracy go up and down, and BS4VM performs a litter better than other three methods.
For the purpose of describing the dynamic process of test accuracy for each dataset with various labeled ratios, Fig. 4 is given. It presents the overall trend of these algorithms. It claims all the lines have the trend of climbing with the labeled ratio increasing. Taking Data (4) for example, the red line stands for the proposed BS4VM method, the blue and black lines mean QS4VM, CS4VM, and the purple line denotes LDS4VM. For the labeled ratio 5%, the accuracy of CS4VM is better than BS4VM. But with the ascension of ratio, the red line is always above the other three lines, claiming the BS4VM performs the best with high labeled ratio.
To further analyze the statistical accuracy more clearly, the average ranks of all the classifiers are computed and listed in Table 4 and Fig. 5. Table 4 indicates the average ranks of eight datasets. This rank order is calculated by average value of each algorithm with different labeled ratios. The smaller the number of rank, the higher the simulation accuracy. From the last row of Table 4, one can notice that BS4VM ranks in the first place for eight datasets, whereas the others stand on second, third and fourth places.
In order to verify the advantage of proposed algorithm BS4VM, the Friedman statistical method is employed. Fredman statistic is distributed based on \(\chi_{F}^{2}\) with \(k - 1\) degree of freedom, where \(k\) means the counts of algorithms and \(N\) stands for the counts of datasets.
For the above experiment on UCI datasets, under the null hypothesis that all the algorithms are equivalent, Fredman statistic can be calculated as [28]
For four algorithms and eight datasets, \(F_{F}\) is distributed with \((k - 1) = 3\) and \((k - 1)(N - 1) = 21\) degrees of freedom. The critical or threshold value of \(F(3,21)\) for significance level \(\alpha = 0.05\) is 3.072. Obviously, \(F_{F} = 5.6264 > F(3,21) = 3.072\), thus the null hypothesis will be rejected, and these four algorithms having significant differences can be surely verified.
After the null hypothesis is rejected, the Nemenyi test can be proceed when all classifiers are compared to each other [28]. The performance of two classifiers is of significant difference if the corresponding average ranks differ by at least the critical difference \(CD = q_{\alpha } \sqrt {\frac{k(k + 1)}{{6N}}}\). For the UCI experiment, \(CD = 2.291\sqrt {\frac{4 \times 5}{{6 \times 8}}} = 1.4788\) at \(\alpha = 0.1.\) As the average rank difference between LDS4VM and BS4VM (3.2188–1.25 = 1.9688) is bigger than critical difference 1.4788, the performance of BS4VM is significantly better than that of LDS4VM. Similarly, the performance of BS4VM is quite superior than that of QS4VM (2.8281 − 1.25 = 1.5781 > 1.4788). Due to 2.7031 − 1.25 = 1.4531 < 1.4788, this Nemenyi test cannot detect the significant difference between CS4VM and BS4VM.
Figure 5 visually presents the accuracy ranks of experiment results with different labeled ratios. One can find that the advantage of BS4VM varies. But from a statistical point of view, the BS4VM performs best, just as Table 4 shows. The proposed algorithm shows satisfactory performance from Fig. 5b–d for most cases. This reminds us that, for different machine learning algorithms, the statistical results of quantities of datasets are more precision and credible, rather not one specific calculation.
7.3 Results on handwritten symbol recognition
In this section, USPS handwritten datasetsFootnote 3 will be investigated to show the impact of the number of labeled data on the classification accuracy. The handwritten database consists of grayscale images of handwritten digits from ‘0' to ‘9', as shown in Fig. 6.
The comparison of four pairwise digits ‘0' versus ’8', ‘2' versus ‘4', ‘1' versus ‘7', and ‘3' versus ’6' is given, respectively. The calculation results of accuracy and dynamic process can be seen in Table 5 and Fig. 7. From Table 5 and Fig. 7, the classification accuracies of two pairs ‘0′ versus ’8′ and ‘1′ versus ‘7′ arrive at more than 80%, even almost 99%, while the classification accuracies of pairs ‘2' versus ‘4' and ‘3' versus ’6' are less than 80%, even smaller than 52%. Thus, the generalization ability of S3VM varies. The suitable dataset should be considered if one plans to carry out the identification process.
Table 6 and Fig. 8 express the accuracy ranks of each dataset with various labeled percentages. Table 6 proves that the BS4VM ranks in the first place; meanwhile, the other three algorithms perform similarly. Figure 8 shows the accuracy rank of each calculation. Taking Fig. 8d for example, when the labeled data are more than 50%, the proposed learning algorithm gets well trained and shows satisfactory precision.
The Friedman statistical method can also be applied on USPS dataset to compare these algorithms from a quantitative perspective. For the four algorithms and four datasets,
\(F_{F}\) is distributed with \((k - 1) = 3\) and \((k - 1)(N - 1) = 9\) degrees of freedom. The threshold value of \(F(3,9)\) for significance level \(\alpha = 0.05\) is 3.863. Obviously, \(F_{F} = 16.1521 > F(3,9) = 3.863\), thus the null hypothesis will be rejected, and the hypothesis that four algorithms are of significant difference is proved. This means the generalization and robustness of BS4VM are promising.
7.4 Results on large-scale NDC dataset for nonlinear Gaussian kernel
In the last subsection, further to verify which algorithm performs best on both accuracy and calculating time among BS4VMs, experiments based on the NDC datasetFootnote 4 for nonlinear Gaussian kernel are carried out. The NDC dataset is designed with large-scale attributes or with large samples to test the robustness of the new algorithms. The NDC dataset is a temporal higher-order network dataset, which means a sequence of time-stamped simplices where each simplex is a set of nodes. As in the real world, large-scale datasets are more commonly classified, the test accuracy and calculation time should be considered as well.
Table 7 and Fig. 9 show the performances of three kinds of BS4VMs, namely BS4VM-I, BS4VM-II, and BS4VM-III with different orders of Bézier function. One can notice that (1) the BS4VMs classify the NDC datasets very well, and most of the results are more than 96%. (2) With the climbs of labeled ratio and attributions of NDC1 ~ NDC5, the computing time increases quickly. However, with the rise of samples of NDC6 ~ NDC10, the calculating time doesn’t go up dramatically. (3) Because these three algorithms are belong to the same kind of smooth technique, the accuracy differences are quite small. But the accuracy of BS4VM-III stands the first place for most cases. Meanwhile, the computing time of BS4VM-I and BS4VM-II line up top two on account of the higher complexity of BS4VM-III.
In order to clarify the comparison results, Table 8 lists the average ranks of BS4VM-I, BS4VM-II and BS4VM-III with Gaussian kernel on accuracy and calculation time for NDC. From the statistics, the accuracy average rank of BS4VM-III is 1.8875, smaller than other two number, indicating this method is more superior. The consuming time ranks of BS4VM-I and BS4VM-II are equal, revealing the computing complexity are the same even though BS4VM-II has higher order of Bézier function.
For the purpose of verifying whether the performances of the three algorithms have significant difference, the Friedman statistical method is utilized. For this experiment with three methods and ten datasets, statistical results \(\chi_{F}^{2}\) and \(F_{F}\) will be
The critical value of \(F(2,18)\) for significance level \(\alpha = 0.05\) is 3.55. Visibly, \(F_{F} = 0.4528 < F(2,18) = 3.555\), and thus these three algorithms have no significant differences from the quantification method is verified. It is suggested that if high accuracy is considered, higher order of BS4VMs should have the priority. However, if calculating time weighs a lot, the lower order of BS4VMs should be chosen.
For the goal of visual expression, the diversities of classification correction and calculation time of each dataset with the variety of labeled ratio, the histogram Figs. 10 and 11 are given.
From Fig. 10c and d, the classification precision of BS4VM-III lies in the forefront, when the threshold of labeled proportion is above 45%. However, this superior performance is at the cost of complex calculation, just as Fig. 11c and d shows. From the ranks of calculation time, BS4VM-I shows the perfect performance in Fig. 11a, b and d, as the lower order of Bézier function, the less computational complexity.
8 Conclusion
Considering the non-smooth term of semi-supervised support vector machines blocking the improvement in classification accuracy, a new class of Bézier functions is ultilized to approximate the hinge loss function, and a novel kind of Bézier smooth semi-supervised support vector machines (BS4VMs) is constructed. The convergence proves the proposed model can draw close to the non-smooth objective function theoretically. As n-order Bézier function is \(n - 1\)-order smooth and differentiable, the fast algorithm can be used to solve the programming. In contrast to the LDS4VM, CS4VM, and QS4VM, experiments on artificial data, UCI data, USPS handwritten database, and NDC datasets clearly show that the BS4VMs have the best performance and efficiency among exponential function, cubic spline function, and quintic spline function. Moreover, the proposed algorithms show good performance for large-scale datasets. Due to the advantage of different order of BS4VMs varying, when applying BS4VMs, performance or efficiency priority should be paid attention. For further research, the feature selection and fuzzy membership should be good ways to improve the accuracy for different kinds of datasets. Bézier function for semi-supervised SVM on regression and its generalization performances will be explored as well.
Notes
The UCI dataset can be available at https://archive.ics.uci.edu/ml/ datasets.php and https://cs.nyu.edu/~roweis/data.html.
The USPS datasets are available at http://www.cs.nyu.edu/*roweis/data.html.
References
Bennett KP, Demiriz A (1999) Semi-supervised support vector machines. In: Kearns Michael S, Solla Sara A, Cohn David A (eds) Advances in neural information processing systems. MIT Press, London, pp 368–374
Reddy IS, Shevade S, Murty MN et al (2011) A fast quasi-Newton method for semi-supervised SVM. Pattern Recogn 44(10):2305–2313
Chapelle O, Zien A (2005) Semi-supervised classification by low density separation. In: AISTATS 2005—Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pp 57–64
Lanquillon C (2000) Learning from labeled and unlabeled documents: a comparative study on semi-supervised text classification. In: Zighed Djamel A, Komorowski Jan, Żytkow Jan (eds) Lecture notes in computer science. Springer, Berlin, pp 490–497
Liu CY, Jiang ZS, Su XX (2019) Detection of human fall using floor vibration and multi-features semi-supervised SVM. Sensors 19(17):3720
Kumar MP, Rajagopal MK (2019) Detecting facial emotions using normalized minimal feature vectors and semi-supervised twin support vector machines classifier. Appl Intell 49:4150–4174
Lang RL, Lu RB, Zhao CQ (2020) Graph-based semi-supervised one class support vector machine for detecting abnormal lung sounds. Appl Math Comput 364:124487
Ju Z, Gu H (2016) Predicting pupylation sites in prokaryotic proteins using semi-supervised self-training support vector machine algorithm. Anal Biochem 507:1–6
Xie XJ (2020) Multi-view semi-supervised least squares twin support vector machines with manifold-preserving graph reduction. Int J Mach Learn Cybern 11(11):2489–2499
Mygdalis V, Iosifidis A, Tefas A et al (2018) Semi-supervised subclass support vector data description for image and video classification. Neurocomputing 278:51–61
Liu CY, Gryllias K (2020) A semi-supervised support vector data description-based fault detection method for rolling element bearings based on cyclic spectral analysis. Mech Syst Signal Process 140:106682
Li Z, Tian Y, Li K et al (2017) Reject inference in credit scoring using semi-supervised support vector machines. Expert Syst Appl 74:105–114
Liu YQ, Liu SY, Gu MT (2009) Polynomial smooth classification algorithm of vector machines. Comput Sci (in Chinese) 36(7):179–181
Yang L, Wang L (2013) A class of smooth semi-supervised SVM by difference of convex functions programming and algorithm. Knowl-Based Syst 41:1–7
Zhang XD, Ma JG (2015) A general cubic spline smooth semi-supervised support vector machine. Chin J Eng 37:385–389
Zhang XD, Ma JG, Li AH et al (2015) Quintic spline smooth semi-supervised support vector classification machine. J Syst Eng Electron 26:626–632
Deng N, Tian Y, Zhang C (2012) Support vector machines: optimization based theory, algorithms, and extensions. Chapman and Hall/CRC, London
Bézier P (1968) Renault uses numerical control for car body design and tooling[C]//Paper Sae 680010, Society of Automotive Engineers Congress
Choi JW, Elkaim GH (2008) Bézier curve for trajectory guidance. World Congr Eng Comput Sci WCECS 2173(1):22–24
Mandad M, Campen M (2020) Bézier guarding: precise higher-order meshing of curved 2D domains. ACM Trans Graph 39(4):103–118
Raja SP (2020) Bézier and B-spline curves - a study and its application in wavelet decomposition. Int J Wavelets Multiresolut Inf Process 18(4):2050030
Zhu YF, Xu G, Ling CN (2019) Construction of energy-minimizing Bézier surfaces interpolating given diagonal curves. J Image Graph 24(11):1998–2008
Wu Q, Wang E (2015) Bézier function smooth support vector regression. ICIC express letters. Part B Appl Int J Res Surv 6:1773–1779
Nocedal J, Wright SJ (1999) Numerical optimization. Springer, New York
Yu J, Vishwanathau SVN, Gunter S et al (2010) A Quasi-Newton approach to nonsmooth convex optimization problems in machine learning. J Mach Learn Res 11:1145–1200
Chen W, Shao Y, Hong N (2014) Laplacian smooth twin support vector machine for semi-supervised classification. Int J Mach Learn Cybern 5:459–468
Ho TK, and Kleinberg EM (1996) “Checkerboard dataset”, http://www.cs.wisc.edu/~musicant/data/ndc/ , accessed on July 20 2020
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Acknowledgements
This work was supported by the Social Science Foundation of China under Grant (18ZDA027).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The author declares that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, E., Wang, ZY. & Wu, Q. One novel class of Bézier smooth semi-supervised support vector machines for classification. Neural Comput & Applic 33, 9975–9991 (2021). https://doi.org/10.1007/s00521-021-05765-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-05765-6