1 Introduction

Support Vector Machine (SVM) based on statistical learning theory is a machine learning algorithm proposed by Vapnik in [3, 4, 10]. The SVM seeks an optimal separation hyperplane between limited positive and negative sample information, and to find the optimal compromise between the complexity of the model and generalization ability has shown its advantages of effectiveness and efficiency in classification and regression task with support vectors being pursued through convex quadratic programming technique[2, 20, 31].

In order to improve the classification performance of SVM, various improved methods were proposed subsequently. One of such the methods applied mutual information (MI) to measure the relevance between two random variables [18, 24, 25], and to estimate the MI between each feature and the given class labels [12, 22]. The weights of each feature estimated by the MI method improve the generalization ability of the traditional SVMs, whereas show bad performance in high dimensions. Therefore, a novel radius-margin-based SVM model for joint learning of feature transformation and the SVM classifier [7, 21, 26,27,28,29] was proposed. However, most suffer computational expense and simplified forms of transformation. A central SVM (CSVM) [1, 9, 35] which uses class centers to construct support vector machine was proposed. Euclidean metric criterion extended to Minkowski metric was proposed to directly calculate weight of each feature [5, 16, 17]. Nevertheless, it is difficult to tune the additional parameter.

To the best of our knowledge, the importance of training samples before feeding into a model has not been considering in SVM. It is well known that all samples are assumed to have identical contributions to obtain optimal hyperplane in conventional SVM and its improved methods [8, 15, 19]. However, available training data are often contaminated by noise and outliers in many practical applications. Therefore, the performance of SVM may be dominated by weakly related or even irrelevant samples. A robust support vector machine [11, 23] was proposed, where a general method is able to form an adaptive margin by using the distance between each class of training data center and data points. Lin et al proposed a Fuzzy Support Vector Machine (FSVM) [13, 14] applying the fuzzy membership degree to the training data to relax the influence of outliers. However, the selection of membership function has always been a difficult problem in the fuzzy support vector machine.

From the above discussion, we propose a novel Iterative Factoring Support Vector Machine (If-SVM), where sample factoring is introduced in our proposed model to measure the significance of each data point. It can effectively decrease the negative impact of trivial or noisy data points. Thus, it avoids training the classifier on trivial or noisy samples. Compared with existing weighted SVM methods, we can derive novel better dataset in training. Therefore, the influences of non-critical samples in SVM are decreased. By introducing this iterative factoring data points in SVM, the classification accuracy of our proposed method is above that of 1.45% than other comparative methods in image recognition datasets. We also will extend and apply our idea to other image processing applications in future work, such as image segmentation [30, 32, 34].

  1. 1)

    We introduced a sample factor into the proposed model to measure the significance of each data point. This indicator variable can determine whether a data point is a critical sample or not.

  2. 2)

    We further propose a novel Iterative Factoring Support Vector Machine (If-SVM) method which iteratively evaluates the importance of each sample to reduce the influence of non-critical samples. This significantly decreases negative impacts of trivial or noisy data points on the classifier model.

  3. 3)

    Our further experiments also demonstrate that, the performance of the state-of-the-art SVM methods can also be improved by incorporating our sample factoring idea into their models, which demonstrate our idea is a useful tool to improve the state-of-art SVM models.

The remainder of this paper is organized as follows. Section 2 briefly reviews the basic theory of standard SVM and introduces some improved methods. We present the theoretical deduction of our proposed algorithm in detail. Next, it is extended into kernel space in Section 3. Experimental evaluation is reported and discussed in Section 4. Finally, we conclude the paper with future work in Section 5.

2 Related work

In order to effectively reduce noise and maximally improve classification accuracy, many researchers proposed some methods to improve the performance of support vector machine. In this section, we briefly review the traditional SVM for classification and present several improved methods.

Suppose the training set of the classification problem is \({T=\left \{ \left ({\textbf {x}_{i}},{{y}_{i}} \right ) \right \}_{i=1}^{n}}\), xi represents the ith training sample, \({{y}_{i}}\in \left \{ -1,1 \right \}\) stands for the corresponding class label of xi, with i = 1, ... , n. The classic SVM algorithm aims to obtain optimal hyperplane by the following optimization problem

$$ \begin{array}{@{}rcl@{}} \min \frac{1}{2}\left\| \textbf{w} \right\|^{2} &+& C \sum\limits_{i = 1}^{n} {\xi_{\textbf{i}}} \\ \text{s.t.}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {y_{i}}(\textbf{w}^{\textbf{T}} \textbf{x}_{\textbf{i}} + b) &\ge& 1 - {\xi_{i}},i = 1, ... , n, \\ {\xi_{i}} \ge 0,i &=& 1, ... , n, \end{array} $$
(1)

where w is a normal vector of hyperplane \(\textbf {w}^{\textbf {T}}\textbf {x}_{\textbf {i}} + b = 0\) in the feature space, b is the scalar offset of hyperplane, ξi is a slack variable and C is a penalty parameter.

In this way, the optimization problem can be transformed into a convex quadratic programming problem. To solve this quadratic programming problem, we construct a Lagrangian and transform into the dual

$$ \begin{array}{@{}rcl@{}} \max \sum\limits_{i = 1}^{n} {{\alpha_{i}}} &-& \frac{1}{2}\sum\limits_{i = 1}^{n} {\sum\limits_{{\text{j}} = 1}^{n} {{\alpha_{i}}} } {\alpha_{j}}{y_{i}}{y_{j}} {\textbf{x}_{\textbf{i}}}^{T} \textbf{x}_{\textbf{j}} \\ &\text{s.t.}&{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \sum\limits_{{\text{i}} = 1}^{N} {{\alpha_{i}}{\kern 1pt} {\kern 1pt} {y_{i}}} = 0, \\ 0 &\le& {\alpha_{i}} \le C,i = 1, ... , n, \end{array} $$
(2)

where α = [α1, ... , αn]T is the vector of nonnegative Lagrange multipliers. The corresponding decision function is sgn \({\sum \limits _{i=1}^{n}{{{y}_{i}}{{\alpha }_{i}}\textbf {x}_{i}}^{T}{\textbf {x}+b}}\).

Since, traditional classification algorithm of support vector machine assumes each sample vector has the same importance for classification, the discrepancies of training samples were ignored when fed into the model. Therefore, this may affect the classification performance of support vector machine when we predict a sample category. To solve this problem, researchers proposed sample weighted methods, which give large weight to the training samples with high relevance, and small weight to the training samples with low relevance.

Extending the original probabilistic c-means (PCM) algorithm into a kernel space based on kernel methods, Yang developed the KPCM algorithm where partitioned relative values are used as weights for the proposed W-SVM [29]. The weights used in WSVM are generated by kernel-based probabilistic c-means (KPCM) algorithm, the corresponding optimization problem can be formulated as

$$ \begin{array}{@{}rcl@{}} \max \sum\limits_{i = 1}^{n} {{\alpha_{i}}} &-& \frac{1}{2}\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {{\alpha_{i}}}} {\alpha_{j}}{y_{i}}{y_{j}}K(\textbf{x}_{\textbf{i}},\textbf{x}_{\textbf{j}}) \\ &\text{s.t.}&{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \sum\limits_{{\text{i}} = 1}^{N} {{\alpha_{i}}{\kern 1pt} {\kern 1pt} {y_{i}}} = 0, \\ 0 \le {\alpha_{i}} &\le& C*{u_{i}},i = 1, ... , n, \end{array} $$
(3)

where K(xi,xj) can be represented by a dot product in the feature space as K(xi,xj)= φxiφxj, the nonlinear mapping function φ maps an input vector x in the input space X onto φx in the feature space F, that is φ : XF. Note that a weight is assigned to the data point xi in (3). Penalty parameter C is a constant, and Cui will set different penalty parameters for each training sample. It can be drawn from formula (3) that the larger the Cui, the smaller the possibility of misclassification of sample xi.

Cui et al [6] combined an outlier detection approach and adaptive weight value for the training samples. Suppose the weight is ui and the error \({{\xi }_{i}}^{2}\) is weighted, the optimization problem is presented as follows

$$ \begin{array}{@{}rcl@{}} \min \frac{1}{2} \left\| \textbf{w} \right\|^{2} &+& C\sum\limits_{i = 1}^{n} {{\xi_{i}}^{2}} {u_{i}} \\ \text{s.t.}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {y}_{i}(\textbf{w}^{\textbf{T}} \textbf{x}_{\textbf{i}} + b) \ge 1 &-& {\xi_{i}},i = 1, ... , n, , \\ {\xi_{i}} \ge 0,i &=& 1, ... , n, \end{array} $$
(4)

where the initial weight is calculated according to the fitting error of each data sample, the larger the fitting error of data sample, the smaller the weight. By this way, the interference of noise and isolated points can be reduced in classification, while the normal samples remain unchanged in classification.

In addition, with regards to features of the sample vector, some features are strongly correlated with the classification, whereas others are weakly correlated or even unrelated. If we do not consider the distinct importance of different features in the classification, then the kernel function may be determined by the weak related or unrelated features leading to performance degradation. Therefore, many improved feature weighted methods were proposed.

Do et al [7] introduced a vector of parameters ui, which in fact performs feature weighting. The weight ui will be used to calculation of kernel function. The radius R is bounded with \({u_{i}}{R_{i}}^{2} \le {R_{i}}^{2} \le \sum \limits _{i = 1}^{n} {{u_{i}}{R}_{i}}^{2} \le 1\). The MR-SVM solves the following convex relaxation problem

$$ \begin{array}{@{}rcl@{}} \frac{1}{2}\sum\limits_{i = 1}^{n} \frac {{\textbf{w}_{\textbf{i}}}^{2}}{{u_{i}}} + \frac{C}{{\sum\limits_{i = 1}^{n} {{u_{i}}{R_{i}}^{2}} }}\sum\limits_{i = 1}^{n} {{\xi_{i}}^{2}} \\ \text{s.t.}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} y_{i} = \textbf{w}^{\textbf{T}} \textbf{x}_{\textbf{i}} + b + \xi_{i},i = 1, ... , n, \\ \xi_{i} \ge 0,i = 1, ... , n, \end{array} $$
(5)

where ui is a weight for the i th feature and Ri is the radius of dimensions i. This vector ui weights the different features in the feature space. Under the sparsity constraint, it forces many trivial features to have a zero weight. Therefore, it can reduce a larger number of trivial features at each iteration.

Wu et al [27] proposed a convex radius-margin-based SVM model for joint learning of feature transformation and SVM classifier. The generalized block coordinate descent method is used to solve the F-SVM model, and the feature transformation is updated by gradient descent. F-SVM introduces a linear transformation matrix A and integrates the radius information, the radius-margin-based model given as follows

$$ \begin{array}{@{}rcl@{}} \min \frac{1}{2}{\left\| \textbf{w} \right\|^{2}}{R_{i}}^{2} &+& C\sum\limits_{i = 1}^{n} {{\xi_{i}}^{2}} {u_{i}} \\ \text{s.t.}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {y_{i}} = \textbf{w}^{\textbf{T}}\textbf{A} \textbf{x}_{\textbf{i}} &+& b + {\xi_{i}},i = 1, ... , n, \\ {\xi_{i}} \ge 0,i &=& 1, ... , n, \end{array} $$
(6)

where the radius R is bounded with \(\left \| {\textbf {A}{\textbf {x}_{\textbf {i}}} - \textbf {A}{\textbf {x}_{\textbf {0}}}} \right \| \le 1,{\xi _{i}},i = 1, ... , n\). In this way, the computation of kernel function can avoid being dominated by irrelevant or trivial features.

3 The proposed if-SVM

This section presents our proposed novel Iterative Factoring Support Vector Machine method which can effectively decrease the influence of non-critical samples and improve the classification performance. We introduce a scaling factor in order to measure the significance of each data point. The proposed If-SVM is formulated to in the following section.

3.1 The proposed method

Given a training data set T in feature space, to obtain optimal separating hyperplane, the traditional SVM model in (1) can be designed as follows

$$ \begin{array}{@{}rcl@{}} \min \frac{1}{2}{\left\| \textbf{w} \right\|^{2}} + C\sum\limits_{i = 1}^{n} {\xi_{i}} \\ \text{s.t.}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {y_{i}}(\textbf{w}^{\textbf{T}}\textbf{x}_{\textbf{i}}{d_{i}} + b) \ge 1 - {\xi_{i}},i = 1, ... , n, \\ {\xi_{i}} \ge 0,i = 1, ... , n, \\ {d_{i}} \ge 0,i = 1, ... , n, \end{array} $$
(7)

where C is a constant which determines the trade-off between margin maximization and the amount of misclassification. Note that as shown in (7), a sample factoring di is introduced in the standard SVM to measure the significance of each data point.

The above optimization problem can be solved by its dual problem. Then we construct the Lagrange function as follows

$$ \begin{array}{@{}rcl@{}} L(\textbf{w},b,\xi ,d,\alpha ) = \frac{1}{2}{\left\| \textbf{w} \right\|^{2}} + C\sum\limits_{i = 1}^{n} {{\xi_{i}}} {\kern 1pt} \\ {\text{ - }}\sum\limits_{i = 1}^{n} {{\alpha_{i}}} {y_{i}}(\textbf{w}^{\textbf{T}} \textbf{x}_{\textbf{i}}{d_{i}} + b){\text{ - }}1{\text{ + }}{\xi_{i}} \\ {\kern 1pt} {\kern 1pt} {\text{-}}\sum\limits_{i = 1}^{n} {{V_{i}}{\xi_{i}}} - \gamma (\sum\limits_{i = 1}^{n} {{d_{i}} - 1} ) - \sum\limits_{i = 1}^{n} {{\varphi_{i}}{d_{i}}}, \end{array} $$
(8)

By taking the derivative of the Lagrange L(w,b,ξ,d,α) with respect to parameters, w,b,ξ,d,α the following dual optimization problem is presented

$$ \frac{\partial L}{\partial \textbf{w}} = \textbf{w} - \sum\limits_{i = 1}^{n} {{\alpha_{i}}{y_{i}}{\textbf{x}_{\textbf{i}}}^{{T}}{d_{i}}} = 0, $$
(9)
$$ \frac{\partial L}{\partial b} = - \sum\limits_{i = 1}^{n} {{\alpha_{i}}{y_{i}}} = 0, $$
(10)
$$ \frac{\partial L}{{\partial {\xi_{i}}}} = C - {\alpha_{i}} - {V_{\text{i}}} = 0, $$
(11)
$$ \frac{\partial L}{{\partial {d_{i}}}} = - {\alpha_{i}}{y_{i}}{\textbf{x}_{\textbf{i}}}^{{T}}{d_{i}} + \gamma - {\varphi_{i}} = 0. $$
(12)

According to the Karush-Kuhn-Tuker conditions, the following expression can be defined

$$ {\varphi_{i}}{d_{i}} = 0. $$
(13)

It can be observed that if Lagrange parameter φi≠ 0 the di = 0, then we further obtain \(\textbf {w}^{\textbf {T}}\textbf {x}_{\textbf {i}}{d_{i}} = 0\), which means xi is not a support vector since it is not involved in training the model. From this observation, the physical explanation of sample factoring di is an indicator variable to determine whether the data point xi is a critical sample or not. That is to say that, if di ≠ 0, αi has a non-trivial solution with a high probability.

With this explanation of sample factoring, substituting (9)-(12) into the Lagrange (8), yields the following dual optimization problem

$$ \begin{array}{@{}rcl@{}} \min {\kern 1pt} {\kern 1pt} \frac{1}{2}{\kern 1pt} {\kern 1pt} &\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {{\alpha_{i}}} {\alpha_{j}}} {y_{i}}{y_{j}}{\textbf{x}_{\textbf{i}}}^{{T}}\textbf{x}_{\textbf{j}}{d_{i}}{d_{j}}{\kern 1pt} {\kern 1pt} {\text{ - }}\sum\limits_{i = 1}^{n} {{\alpha_{i}}} \\ &\text{s.t.}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \sum\limits_{i = 1}^{n} {{\alpha_{i}}{y_{i}}} = 0,i = 1, ... , n, \\ &0 \le {\alpha_{i}} \le C,i = 1, ... , n. \end{array} $$
(14)

For a test sample xi, its class label can be determined by the following function

$$ F(\textbf{x}_{\text{i}}) = sign(\sum\limits_{i = 1}^{n} {{\alpha_{i}}{y_{i}}{\textbf{x}_{\textbf{i}}}^{{T}}{d_{i}}\textbf{x}} + b). $$
(15)

For the nonlinear SVM, the optimization problem can be generalized for nonlinear kernels as follows

$$ \begin{array}{@{}rcl@{}} {\kern 1pt} \frac{1}{2}{\kern 1pt} {\kern 1pt} \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {{\alpha_{i}}} {\alpha_{j}}} {y_{i}}{y_{j}}K(\textbf{x}_{\textbf{i}},\textbf{x}_{\textbf{j}}){d_{i}}{d_{j}}{\kern 1pt} {\kern 1pt} {\text{-}}\sum\limits_{i = 1}^{n} {{\alpha_{i}}} \\ \text{s.t.}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \sum\limits_{i = 1}^{n} {{\alpha_{i}}{y_{i}}} = 0,i = 1, ... , n, \\ 0 \le {\alpha_{i}} \le C,i = 1, ... , n. \end{array} $$
(16)

Finally, the decision function can be generalized for nonlinear kernels as follows:

$$ F(\textbf{x}_{\textbf{i}}) = sign(\sum\limits_{i = 1}^{n} {{\alpha_{i}}{y_{i}}K(\textbf{x}_{\textbf{i}},\textbf{x}_{\textbf{j}}){d_{i}}{d_{j}}{\kern 1pt} {\kern 1pt}} + b) $$
(17)

3.2 Obtaining sample factors

Since the sample factor di is able to indicate whether the data point xi is a critical sample or not, which is illustrated in Fig. 1, we obtain the value of sample factor as follows

$$ {s_{i}} = {{(\textbf{w}^{\textbf{T}}\textbf{x}_{\textbf{i}} + b)} \mathord{\left/ {\vphantom {{(\textbf{w}^{T} \textbf{x}_{i} + b)} {\left\| {\textbf{w}^{2}} \right\|}}} \right. \kern-\nulldelimiterspace} {\left\| {\textbf{w}^{2}} \right\|}}, $$
(18)
$$ {d_{i}} = \max (0,1 - {s_{i}}), $$
(19)

where si is the distance between data point xi and the hyperplane. As di is an indicator of xi being a support vector or not, we use hinge loss distance di to evaluate the importance of xi. If xi is close to the hyperplane, we set a larger factor value. On the contrary, if xi is far away from the hyperplane, it means that xi is less important with a smaller factor value. If xi is 0, namely, not a support vector, it will be discarded and have no any impact in the next iteration of modeling. By this way, our proposed If-SVM model is capable of focusing on the critical data falling around the hyperplane, and abandons those data which are far away from the hyperplane. With this sample factoring setting, our If-SVM can obtain better hyperplane with this iterative data reducing model. For conciseness, the main steps of our If-SVM optimization algorithm are listed in Table 1 and Fig. 2. In the initialization stage, we assume di= 1 and initialize w,b using the off-the-shelf SVM solver. Then, di is updated according to the (18) and (19) and the dataset is updated by Ti + 1 = Tdi. Our If-SVM algorithm alternates between updating w,b and di until convergence. As a result, our proposed model can concentrate on the critical data around the hyperplane to achieve a better classification hyperplane.

Fig. 1
figure 1

SVM optimal separation hyperplane

Fig. 2
figure 2

The flowchart of If-SVM

Table 1 The steps of If-SVM

4 Experiments

In this section, we evaluate the performance of our proposed If-SVM method in comparison with several state-of-the-art methods including SVM [33], WSVM [16], RMM [14] and FWSVM [34]. Experiments have been conducted on the 11 UCI data sets as described in Table 3, the LFW database and four large-scale image data sets, USPS, Extended Yale B, CIFAR-10 and MNIST.

4.1 Dataset description

We select 15 publicly available datasets for the evaluation of the performance of our algorithm: UCI, USPS, Extended Yale B, CIFAR-10 and MNIST. 11 of them were taken from the UCI repository. Experiments are conducted on USPS, MNIST, Extended Yale B, CIFAR-10 and UCI datasets as described in Tables 2 and 3. The results for the various methods are shown in Tables 4 and 5 with best results in bold.

Table 2 Description of the USPS, MNIST, Extended Yale B and CIFAR-10 datasets used in the experiments
Table 3 Description of the 11 UCI datasets used in the experiments

The USPS dataset: This dataset consists of handwritten numbers from 0 to 9. The training and testing sets consist of 7291 examples and 2007 examples respectively. Each example has 256 attributes or pixels that describe each number.

The Extended Yale B dataset: The Extended Yale B database consists of 2,414 frontal-face images of 38 subjects. The cropped 192<Á168 face images were captured under various laboratory-controlled lighting conditions and with different facial expressions. For each subject, half of the images are randomly selected for training (i.e., about 32 images per subject), and the left half for testing.

The CIFAR-100 dataset: The CIFAR-100 dataset consists of 60000 natural color images and has 100 classes. The dataset contains a training set of 50000 images and a test set of 10000 images. Each example has 1024 attributes or pixels that describe each image.

The MNIST dataset: The MNIST dataset comes from the National Institute of Standards and Technology (NIST). The training set consists of 250 different handwritten digits. The training and testing sets consist of 60000 examples and 10000 examples respectively. Each example has 256 attributes or pixels that describe each number.

4.2 Experiment settings

For each dataset, we use the average classification accuracy obtained by 10 runs of the tenfold cross validation (CV) as the performance indicators. In our tenfold CV, the training set of n samples is randomly divided into 10 folds of size n/10. Then, the classifier is trained using 9 folds while the learned classifier is evaluated using the retained test fold. Therefore, all samples in the dataset are used as training set and test set, and each sample is verified once. Finally, the results on the ten test folds are averaged to produce a single estimation. Moreover, the running time of each method is provided according to the ten runs of our tenfold CV. All the experiments are conducted on a desktop PC with Intel(R) Xeon(R) CPU (3.30 GHz) and 32GB RAM under the MATLAB 2017b programming environment. In experiments, a coarse-to-fine search strategy is adopted for determining the hyper parameters. The grid search method is first adopted for coarse searching, and then the line bisection method is exploited to refine the hyper parameters within a small range. Concretely, we set penalty parameter \(C \in {\text {\{ 1}}{{\text {0}}^{{\min \limits } :step:{\max \limits } }}{\text {\} }}\) with min=-3, step= 1, max = 5 in linear If-SVM and \(\sigma \in {\text {\{ }}{{\text {2}}^{{\min \limits } :step:{\max \limits } }}{\text {\} }}\) with min=-10, step= 1, max = 5 for Gaussian RBF kernel in kernel If-SVM.

4.3 Experimental results on UCI datasets

For each UCI dataset, the If-SVM method is compared with several existing methods, including the LIBSVM, WSVM [29], RMM [21] and FWSVM [33].

  1. 1)

    Evaluation on Linear If-SVM: Table 4 presents the classification accuracy of our proposed linear If-SVM and the competing methods. As we can see in Table 4, If-SVM achieves the best or the second best classification accuracy on 11 UCI data sets. The classification accuracy of If-SVM is 85.29% above that of 2.10% to traditional SVM, that of WSVM 1.15%, that of FWSVM 0.89%, that of RMM 1.21%, Specifically, the improvement of If-SVM over SVM is higher than 3.0% by accuracy on 3 data sets, i.e., Breast, Titanic and Wpbc. We also evaluate the effect of hyper parameter C in linear If-SVM on Wpbc dataset. It can be seen from Fig. 3 that when C < 0.1, the accuracy is relatively low. The classification accuracy can be improved along with the increase of C to 10. Nevertheless, the accuracy decreases significantly when C > 100.

  2. 2)

    Evaluation on Kernel If-SVM: Table 5 lists the classification accuracy of our proposed kernel If-SVM and the competing methods. As shown in Table 5, Kernel If-SVM achieves the highest classification accuracy on 10 of the 11 data sets among the competing methods. The classification accuracy of If-SVM is 88.67% in excess of 2.13% to traditional SVM, 1.46% to W-SVM, 1.20% to FWSVM, 1.52% to RMM. Specifically, the improvement of If-SVM over SVM is higher than 3.0% by accuracy on 4 data sets, i.e., Ionosphere, Parkinsons, Wpbc and German. As shown in Fig. 4, we evaluate the effect of hyper parameters using the Liver dataset, including the tradeoff C and the kernel parameter σ in kernel F-SVM. We can see that better accuracy can be obtained by using larger C (e.g., C = 100) and smaller σ (e.g., σ = 0.25). Similar conclusion can be drawn from other data sets.

Fig. 3
figure 3

Parameters C of If-SVM on the Wpbc dataset

Fig. 4
figure 4

Parameters analysis on the Liver dataset

Table 4 Comparison of the average classification accuracy by linear SVM, linear WSVM, linear RMM, linear FWSVM and linear If-SVM
Table 5 Comparison of the average classification accuracy by kernel SVM, kernel WSVM, kernel RMM, kernel FWSVM and kernel If-SVM

4.4 Experimental results on image classification datasets

For each dataset, the If-SVM method is compared with several existing methods, including the LIBSVM,WSVM [29], RMM [21] and FWSVM [33]

  1. 1)

    Evaluation on Linear If-SVM: Table 6 and Fig. 5 presents the classification accuracy and average classification accuracy of our proposed linear If-SVM and the competing methods. As shown in Table 6, If-SVM achieves the best or the second best classification accuracy on USPS, Extended Yale B, CIFAR-100 and MNIST datasets. It is shown in Fig. 5 that the classification accuracy of If-SVM is 76.37% above that of 2.50% to traditional SVM, that of WSVM 1.59%, that of FWSVM 1.13%, that of RMM 1.01%.

  2. 2)

    Evaluation on Kernel If-SVM: Table 7 and Fig. 6 present the classification accuracy and average classification accuracy of our proposed kernel If-SVM and the competing methods. As we can see in Table 6, the average classification accuracy of If-SVM is 83.13% in excess of 2.75% to traditional SVM, 1.83% to W-SVM, 1.09% to FWSVM, 1.73% to RMM. In order to further compare the performance of various methods, we use the box diagrams and line chart to display intuitively. The accuracy of five methods on MNIST dataset is depicted in Fig. 7, which shows that the median accuracy of If-SVM is better than the comparative methods on MNIST dataset. It means that the proposed method can effectively improve the classification accuracy. In addition, the shape of each box diagram of If-SVM is relatively narrow, which is an indication that our method is more stable than the others.

Fig. 5
figure 5

The average classification accuracy of linear SVM, linear WSVM, linear RMM, linear FWSVM and linear If-SVM on LFW database

Fig. 6
figure 6

The average classification accuracy of kernel SVM, kernel WSVM, kernel RMM, kernel FWSVM and kernel If-SVM on LFW database

Fig. 7
figure 7

The accuracy of five methods on MNIST dataset

Table 6 Comparison of the average classification accuracy by linear SVM, linear WSVM, linear RMM, linear FWSVM and linear If-SVM
Table 7 Comparison of the average classification accuracy by kernel SVM, kernel WSVM, kernel RMM, kernel FWSVM and kernel If-SVM

4.5 Experimental results on LFW database

The face recognition method can be evaluated with two test protocols for LFW: the restricted and the unrestricted settings. In our experiments, the performance is evaluated by our tenfold CV on a set of 300 positive and 300 negative image pairs under the restricted settings. The only information available is whether each pair of training images is the same person. We use SIFT features to extract 128 features at nine fiducial points on three scales, and finally get 3456 dimensional feature vector. However, due to the large scale of dimension and the limitation of computational overhead, we use the PCA for dimensionality reduction to 100. We compare the accuracy of face recognition with other state-of-the-art algorithms, and the results are illustrated in Table 8. From Table 8, the accuracy of our If-SVM is higher than other state-of-the-art methods. The performance of kernel If-SVM is 0.85% higher than that of W-SVM on LFW database. Moreover, kernel If-SVM can still get an improvement of 0.75% over FWSVM. Figure 8 shows the ROC curves of the competing methods. It can be seen from Fig. 8 that If-SVM algorithm has better classification performance to the other algorithm.

Fig. 8
figure 8

Roc curves on LFW database

Table 8 Comparison the average classification accuracy by SVM, W-SVM, RMM, FWSVM and If-SVM in LFW database

The disadvantages and advantages of five methods are listed in Table 9. Compared with the existing SVM and RMM methods, our If-SVM model improves the robustness and shows better classification performance by reducing the adverse impact of trivial or noisy data points on the classifier. Unlike the WSVM and FWSVM which are difficult to tune additional parameter, our proposed method has robustness with a roughly parameters searching.

Table 9 Summary of the characteristics of five methods

4.6 Applying sample factoring idea into state-of-the-art methods

Moreover, our further experiments also demonstrate that, by applying our sample factoring idea into other state-of-the-art kernel methods, it can also improve the performance of those methods.

In order to clearly show the advantages of our method over the comparative methods, we discuss and analyze the results in this section. From Table 10, it is obvious that the accuracy of classification has been improved by combining our sample factoring idea with other state-of-the-art kernel algorithm. In the image classification dataset, as we can see the average classification accuracy of If-SVM above that of F-SVM 1.49%, that of R-SVM 1.26%, that of W-SVM 1.72%. This indicates that the samples being extracted by the proposed If-SVM are more discriminatory as compare to those by comparative methods, which indicates that If-SVM is more stable than others. The reason is that our methods can reduce the effect of those imperfect samples through a factor weighting in the model. For a comprehensive illustration, we also present some experimental results on the number of samples and run time in training.

Table 10 Comparison the average classification accuracy by F-SVM, F-SVM++, RSV M+ [35], RSV M+++, W-SVM and W-SVM ++(++ means add sample factor idea to the original method

In each step of the iteration, we have to compute the solution of a standard SVM which has a complexity O(n2), where n is number of training samples. Moreover, we also need to update the factor di which has a complexity O(n3). However, beneficial from our algorithm, it can decrease the number of non-critical samples in SVM, and the number of samples in the training set are used for training decreases dramatically at second iteration. Therefore, time complexity increase is not very high.

In order to prove that our algorithm does not increase the time complexity of the algorithm, we use the proposed method to classify four data sets from UCI. As can be seen from Fig. 9, we reduced the scale of the training set and removed noise samples in the process of iteration. In our proposed method, only 60% of the samples in the training set are used for training. Therefore, the run time of this algorithm has not increased too much (Figs. 1011 and 12).

Fig. 9
figure 9

The number of training samples on the Titanic, Image, Breast and German dataset

Fig. 10
figure 10

Comparison of the run time (in seconds, s) of SVM and SVM++(++ means add sample factor idea to the original method

Fig. 11
figure 11

Comparison of the run time (in seconds, s) of F-SVM and F-SVM++ (++ means add sample factor idea to the original method

Fig. 12
figure 12

Comparison of the run time (in seconds, s) of RSV M+ and RSV M+++(++ means add sample factor idea to the original method

To further prove the effect of our method, we use bar charts to show the run time on the 7 datasets in Fig. 10 to Fig. 13. If-SVM is about three times slower than SVM on Titanic and German dataset. However, benefitted from our proposed algorithm, the number of non-critical samples in SVM is decreased. Therefore, this method is about 70% time slower than SVM in large data sets such as Ringnorm, Musk and Twonorm datasets. And with the increase of training set amount, the effect of our proposed method is more obvious.

Fig. 13
figure 13

Comparison of the run time (in seconds, s) of W-SVM and W-SVM ++(++ means add sample factor idea to the original method

F-SVM++, RSV M+++ and W-SVM++ is moderately quicker than original methods without sample factor. From Fig. 11, it is shown that the run time of F-SVM++ and the original methods without sample factor on the 7 datasets in training. F-SVM++ spends less run time on 6 of the 7 data sets than the original methods without sample factor. From Fig. 12, it is shown that the run time of RSV M+++ and the original methods without sample factor on the 7 datasets. RSV M+++ is faster than the original methods without sample factor on 7 data sets, From Fig. 13, it is shown that the run time of W-SVM++ and the original methods without sample factor on the 7 datasets. W-SVM++ achieves the less run time on 5 of the 7 data sets than the original methods without sample factor. Although our proposed method has a slight disadvantage in training time on Titanic, German and Breast datasets. It is evident from Figs. 10 to 13 that F-SVM++, RSV M+++ and W-SVM++ can spend less run time in larger datasets above 2000 samples than original methods without sample factor methods. This reveals that, the proposed method ensures the classification performance of SVM model, effectively reducing the amount of training data, which reduces the run time in training.

5 Conclusion

In this paper, we propose a novel If-SVM method. A sample factoring is introduced in the standard SVM to measure the significance of each data point. It can avoid the classifier being trained by trivial or noisy samples. Therefore the influence of non-critical samples in SVM is decreased. By this way, our proposed model can pay more attention to the critical data which fall around the hyperplane. And it can help to achieve a better classification hyperplane. Experimental results with several UCI datasets show that our If-SVM can decrease the numbers of support vectors and have better classification performance than state-of-the-art SVM methods. Extensive experiments on different image classification datasets demonstrate that our proposed method have advantage of better performances in image classification accuracy among the other comparative SVM methods. Our further experiments also demonstrate that, by applying our sample factoring idea into other state-of-the-art kernel methods, it can also help to improve the performance of those methods. Finally, motivated by the recent success of image segmentation, we will extend and apply our idea to other image processing applications in future work.