Abstract
This paper proposes Parametric non-parallel support vector machines for binary pattern classification. Through an intelligent redesigning of the Support vector machine optimisation, not only do we bring noise resilience into the model, but also retain its sparsity. Our model exhibits properties similar to Support vector machines, hence many SVM related learning algorithms can be extended to make it scalable for large scale problems. Experimental results on several benchmark UCI datasets validate our claims.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Support vector machines (SVM) is a celebrated binary classifier built upon the direct practical application of the Statistical Learning Theory. The practicality and strong theoretical backing of SVMs have enabled them to be successfully adapted in a diverse variety of real-world applications ranging from bioinformatics (Yin et al., 2015), scene classification (Subasi, 2013), to power applications (Hao & Lewin, 2010). SVM has also attracted many researchers and inspired the development of new classifiers, like Generalised eigenvalue proximal support vector machines (GEPSVM) (Mangasarian & Wild, 2005), Twin Support vector machines (TSVM) (Khemchandani & Chandra, 2007). Each of them gave rise to an entirely new plethora of research.
TSVM showed that a two hyperplane rule is better adept at dealing with XOR(cross-planes) dataset, without the use of kernel methods. Also, leveraging the divide and conquer rule to solve individual quadratic programming problems (QPPs) for each class, reduced the constraints by half and hence were around four times faster than the conventional SVMs. Moreover, TSVM was able to output comparable performance with SVMs.
Nevertheless, TSVM also has its own set of limitations like matrix inversion, sparsity and sensitivity to outliers due to L\(_2\)-norm. Each of these problems has captivated researchers’ attention resulting in a new set of research.
TSVM adopts two hyperplane rule to classify samples which results in better accuracy than the SVM. However, the inconsistency between the training and prediction processes leads to sub-optimal results in certain cases. Authors in Shao et al. (2014), inculcated the two hyperplane prediction rule into their optimisation to address this issue. Similar to the TSVM their model minimises the respective class variances simultaneously maximising the projection difference between the classes.
Authors in Tian et al. (2013) presented an interesting extension to TSVM, that avoids matrix inversion, is sparse, and shows uniformity with the linear kernel and non-linear classifier. The first two sets of constraints in their optimisation enforce the respective class patterns to lie in the \(\epsilon\)-bands. Their formulation shows correspondence with L\(_{1}\)-norm if \(\epsilon\) tends to zero. The last set of constraints models the hinge loss to allow optimal separation from the opposite class patterns. Since the data points are implicitly modelled in the optimisation objective, it avoids matrix inversion. While on the other hand, explicit modulation of the data points in the constraints bestows the model properties such as sparsity as well as uniformity. Further to speed up the training process, fast SVM solvers can be adopted with little modification for their model.
All the aforementioned models in a way implement the hinge loss function which is unstable in the presence of noise. Traditionally, this problem has been dealt with by replacing the hinge loss with the pinball loss function. The pinball loss function focuses on maximising the quantile distance between the classes in contrast to hinge loss which focuses on the maximising closest set of points in the convex hull of the two classes. Alternatively, pinball loss grants noise resilience by penalising the correctly classified samples by a small amount and hence minimising the scatter as much as permissible. However, pinball loss trades noise resilience in exchange for the sparsity of the classifier. To attain back the sparsity, several pinball loss variants have been proposed, the most popular of them being the \(\epsilon\)-pinball loss, truncated pinball loss etc. They do succeed in equipping the classifier with noise resilience and sparsity by increasing the number of variables, constraints and parameters of the model. The explicit impact is a more complicated optimisation having enormous time complexity.
In this research, we focus on some set of problems in SVM which are critical and a solution to them can help make SVM better applicable to real-world applications such as being noise resilience simultaneously maintaining sparsity and being time-efficient than the current state of the art pinball loss models.
Taking inspiration from the TSVM’s proximal classification rule and through intelligent redesigning of the SVM optimisation, we achieve the solution to the problems above and our model stands in comparison with pinball loss models.
The present work is described in the following sections. Related work is reviewed in Sect. 2. Our proposed model PN-SVM is introduced in Sect. 3 and subsequently talks about the motivation behind the proposed model and its connection with the existing approaches. The experimental results are reported in Sect. 4, and finally, Sect. 5 concludes the paper.
2 Related work
In this paper, we denote class A as the set of samples having positive label and class B as the set of samples having negative label, i.e \(A = \{(x_{i}, y_{i} = 1)\},\) and \(B = \{ (x_{i}, y_{i} = -1) \}\), \(i=1, 2,\ldots , m_{1} + m_{2}\), \(D = [A;B]\), where \(x_i \in X \subset\) \(R^{N}\). Labels of the samples are \(y_i = +1\) for \(i=1,2,\ldots , m_{1}\), \(y_i = -1\) for \(i=m_{1}+1, m_{1}+2,\ldots , n=m_{1}+m_{2}\). For simplicity we write \(Y=diag(y_1,\ldots ,y_{m_{1}}, y_{m_{1}+1},\ldots ,y_{m_{1}+m_{2}})\). Here note that number of samples in class A, class B are denoted as \(m_{1}\), \(m_{2}\) respectively.
2.1 Support vector machines
State of the art binary classification algorithm, SVM is backed by statistical learning theory’s (Vapnik, 1999) structural risk minimization principle. For a binary classification dataset D, SVM tends to find a hyperplane from a family of separating hyperplanes such that the width of the separation between the two classes is maximised. Consider the optimal hyperplane as: \(J(w,b) = w^Tx + b = 0\), where \(b \in R\) is the bias term, \(w \in R^N\) is the normal vector of J(w, b). The optimal hyperplane J can be obtained by solving the following optimisation problem:
here the minimization of first term (or the L\(_2\)-norm regularisation) ensures maximising margin or separation between the classes. e is the vector of ones of appropriate dimension. The second term is the sum of errors(q) of incorrectly positioned samples in the dead-zone or are incorrectly classified. The parameter c controls the importance of the two terms in the objective function. The first set of constraints dictates the data points projections to hyperplane, be at least unit distance away. However, if this is violated, then the error variable takes minimum value to satisfy the constraint and hence results in soft-margin hyperplane. For more details we advise the reader to refer to the original SVM paper (Cortes & Vapnik, 1995).
The dual solution, is generally sparse, implying that the SVM need only a fraction of the total data points to determine the optimal hyperplane. This property is called sparsity. There are several research works that tend to identify these data points from the training set. Since the classifier model is dependent on these points only, the training set can be reduced thereby making the model faster (Jung & Kim, 2013; Nalepa & Kawulok, 2019)
2.2 Support vector machines with pinball loss function
Authors in Huang et al. (2014) proposed Support vector machines with Pinball loss (Pin-SVM) taking into consideration the problems with SVMs namely: sensitivity towards noise and instability with respect to resampling. Their idea was to penalise the correctly classified samples by introducing a parameter \(\frac{1}{\tau }\), so as to keep the correctly classified samples close to the separating hyperplane at the same time maximising the margin as in SVMs. The Pin-SVM formulation is as follows:
Here c and \(\tau\) are model parameters. As it is eminent from the above set of equations, Pin-SVM loses sparsity in exchange for noise insensitivity and resampling robustness. The issue of sparsity is handled by introducing a \(\epsilon\) zone which is mentioned in the subsequent subsection.
2.3 Modified support vector machine with pinball loss
Authors in Rastogi et al. (2018) motivated by the properties of \(\epsilon\) Pin-SVM, built a data-driven, asymmetric \(\epsilon\) zone Pin-SVM termed as \((\epsilon _{1},\epsilon _{2})\) Modified Pin Support Vector Machine (Mod-Pin-SVM). Please note that here we take into consideration the version 2 mentioned in their paper, we advise the reader to Rastogi et al. (2018) for more details.
Here \(c_{1}\), \(c_{2}\) \(\tau _{1}\), \(\tau _{2}\) and \(\epsilon _{1}\) being the model parameters. This model optimisation enforces the data points to lie in the respective \(\epsilon\) zones, wherein they have zero value for the error variable q. Thus the obtained solution is naturally sparse. From the results reported, it is evident that their model achieves sparsity, also retaining the noise sensitivity and robustness to resampling properties. The downside is many model parameters, and increased training time due to the increased number of constraints and variables.
2.4 Twin support vector machines
Twin support vector machines (TSVM) find a pair of hyperplanes such that each hyperplane is proximal to the patterns of its own class and opposite class patterns are atleast unit distance away. It solves the following two sets of quadratic problems:
For brevity, we avoid writing the equation for the second hyperplane. Here \(q_1\) is the error vector, and \(c_1\) parameter controls the effectiveness between the L\(_2\)-norm square and the sum of errors. TSVM deals with only half of the constraints in each QPP, hence are four times faster than conventional SVM (Khemchandani & Chandra, 2007).
TSVM loses sparsity because of the two-loss functions (L\(_{2}\) loss, hinge loss) used on the classes. Moreover, for large datasets, matrix inversion might be problematic. To overcome this, an appropriate small value \(\epsilon I\) (i.e. (\(H^{T}H\,+\,\epsilon I\))) is added to make the matrices invertible. Once the class proximal hyperplanes \(J_{1}(x) = w_{1}^Tx\,\,+\,\, b_{1}\) and \(J_{2}(x) = w_{2}^Tx\,\,+\,\, b_{2}\) are found, the test data can be annotated according to the rule:
To overcome the matrix inversion step in the dual, many alternatives have been put forward by the researchers (Chen et al., 2020; Shao et al., 2011). One solution to avoid matrix inversion is replacing L\(_2\)-norm with L\(_1\)-norm. Another reason for using L\(_1\)-norm in the primal objective of TSVM, rather than using L\(_2\)-norm is that the latter is highly influenced by outliers and noise in comparison to the former. Also L\(_1\)-norm is more robust than L\(_2\)-norm from experimental and statistical point of view (Li et al., 2016; Kwak, 2008).
Moreover, the use of L\(_1\)-norm can make the twin solution sparse. Note that a technique to solve L\(_1\)-norm has been applied in Twin Support Vector Clustering (Wang et al., 2015). It has been shown to converge to the optimal solution, but it would require more number iterations to converge, in a way increasing the complexity of the model.
2.5 L\(_1\)-norm twin support vector machines
Authors in Peng et al. (2016) motivated by the problems due to the L\(_2\)-norm, introduced L\(_1\)-norm based TSVM (L\(_1\)TSVM). The optimisation is as follows:
Hyperplane 1 (\(J_{1}\)):
This formulation avoids the matrix inversion step and the solution obtained is partly sparse. From the empirical observations reported in their paper (Peng et al., 2016), it is evident their model becomes less sensitive to outliers with the introduction of L\(_1\)-norm. The problem with L\(_1\)-norm TSVM is that it is highly dependent on the parameters. Looking at the constraints, it has n constraints in both optimisation Eqs. (6) and (7). So they lose the time efficacy of original TSVM, in-fact it requires almost twice the training time as compared to conventional SVM. Overall, they do succeed in introducing partial sparsity in TSVM, simultaneously avoiding the matrix inversion step.
2.6 Twin parametric margin support vector machine
Authors in Peng (2011) inspired from TSVM framework introduced Twin parametric margin support vector machines (TPMSVM). The main optimisation of their model is as follows:
Here \(v_{1}, \, c_{1}\) being the model parameters. For the sake of brevity, we avoid writing the set of equations for the second hyperplane. The constraints dictate the Class A samples to lie on the positive half-space. Due to the minimisation of the second term in the objective function, class B samples will lie on the negative half-space keeping them as far as possible from the hyperplane. Notice that in this formulation the class A is kept in constraint and class B is kept in objective while calculating the hyperplane for class A. TPMSVM obtains marginal hyperplanes but TSVM obtains Proximal hyperplanes. A good advantage of this formulation is that it doesn’t require any matrix inversion, simultaneously enjoying similar time complexity as TSVM. This model mainly finds applicability in scenarios where heteroscedastic noise is prevalent.
2.7 Twin-support vector machines with pinball loss
Authors in Xu et al. (2016) introduced pinball loss in Twin support vector machines (Pin-TSVM) to make it less sensitive to noise. They take Twin Parametric Margin Support Vector Machine (TPMSVM) as the building block of their research and model it to inculcate the pinball loss function. In their paper they have shown how Pin-TSVM is an extension to TPMSVM, we advise the reader to reference (Peng, 2011; Xu et al., 2016) for more details. The optimisation of Pin-TSVM is as follows:
For the sake of brevity, we avoid writing the set of equations for the second hyperplane. Among the pinball loss models, Pin-TSVM does seem to be faster in terms of training time if there is no data imbalance, but it does not have a sparse solution.
3 Proposed model
3.1 Motivation
Conventional SVM has proved to be effective in many practical applications owing to its SRM principle which avoids overfitting.
SVM has cubic complexity which means it requires humongous training time for large datasets. TSVM is around 4 times faster than SVM due to the reduction in the number of constraints. TSVM still needs to solve two QPPs, as their decision rule dictates, to classify samples. Moreover, it is sensitive to outliers and needs to invert matrices due to the usage of squared loss on the proximal class. The L\(_{1}\)-norm TSVM addresses these problems by replacing L\(_{2}\)-norm by L\(_{1}\)-norm. However, it manages to be only partly sparse. This model served as the primal point of our research i.e. to further improve upon sparsity in the TSVM framework. Surprisingly, the design of PN-SVM turned out to be very similar to the Pinball loss models instead. Also, our proposed model is approximately four times (or more) faster than the Pin-SVM (and its generations). To sum up, not only does PN-SVM manages to address the noise instability problem in the SVM framework, but is also sparse, faster, and less complicated than Pin-SVM based models.
3.2 Formulation
3.2.1 Linear model
The mathematical formulation of our proposed model is as follows:
Hyperplane 1 (\(J_{1}\)):
Hyperplane 2 (\(J_{2}\)):
Consider the equation for Hyperplane 1 (Eq. 9), the second term in the optimisation forces the hyperplane to be close to class A so that the mean of class A projections is minimised. The first constraint dictates the hyperplane to keep class A projections to be above the hyperplane, if not, \(q_{1}\) (the slack vector) takes the minimum value to satisfy the constraint. Similarly, class B projections need to be at least \(\epsilon\) unit away from the hyperplane. \(q_{2}\) is the slack vector for class B samples. The minimisation of the first term in optimisation along with the third term adds the SRM principle to the model to avoid overfitting. Notice that the Eq. (9), without considering the second term is simply the bounding hyperplane equation for the SVM for class A. The minimisation of the mean of class A projections, in the objective, dictates the hyperplane to keep close to class A. Note that the nature of the hyperplane is still bounding if the sum of training errors is emphasised more than the mean of the projections i.e. \(c_1 > v_1\). Please note that if \(c_1>> v_1\), or if the effectiveness of the second term in the objective is very small the hyperplanes (i.e. Eqs. 9 and 10) becomes parallel and hence converge to the conventional SVM. The minimisation of the mean allows the bounding hyperplanes to minimise the scatter of the classes individually. Since this is similar to penalising the correctly classified samples by some amount, it points towards an analogy with the pinball loss popularly used to equip SVMs with noise robustness.
For class A, the points which lie on the opposite side of the class A respective hyperplanes and the class B points which lie in the \(\epsilon\)-margin of the hyperplane, contribute towards the training error and hence only few points contribute to the model building for class A, allowing sparsity. Similarly for class B.
The Eq. (9) can be simplified to as follows:
here \(q = [q_{1}\,; \,q_{2}]\), \(e = [e_{1}\,; \,e_{2}]\), \(D_{1}=[A;B]\), \(p_{1}=[0*e_{1}\,; \,\epsilon *e_{2}]\)
For calculating the dual of Eq. (11), we write the Lagrangian \(L_{1}\) as follows:
Here \(\alpha _{1}\) and \(\beta _{1}\) are the Lagrangian multipliers. The Karush-Kuhn-Tucker(KKT) necessary and sufficient conditions of Eq. (11) are given by:
Substituting Eqs. (13)– (17) in Eq. (12), we get the dual of primal problem Eq. (11) which is as follows:
Here \(Q_{1} = (Y*D_{1})(Y*D_{1})^{T}\) and \(F_{1} = -(\frac{v_{1}}{m_1}(Y*D_{1})A^{T}e_{1}\,+\,p_{1})\).
Once the optimal \(\alpha _{1}\) is obtained, \(w_{1}\) can be obtained from Eq. (13) and \(b_{1}\) for the required hyperplane can be obtained by substituting \(w_{1}\) in Eq. (16) as:
Note that the diag() function column vectorizes the diagonal elements of the square matrix. Hence the required hyperplane is: \(J_{1}(w_{1},b_{1}) = w_{1}^Tx + b_{1}\)=0.
Similarly, the hyperplane for class B can be calculated.
Here \(Q_{2} = ({\bar{Y}}*D_{2})({\bar{Y}}*D_{2})^{T}\) and \(F_{2} = -(\frac{v_{1}}{m_2}({\bar{Y}}*D_{2})B^{T}e_{2}\,+\,p_{2})\).
\(p_{2}=[0*e_{2}; \epsilon *e_{1}]\), \({\bar{Y}} = -Y\), and we redefine \(D_{2}=[B;A]\).
Hence the required hyperplane for class B is: \(J_{2}(w_{2},b_{2}) = w_{2}^{T}x + b_{2}\). The test data sample, x can be annotated according to the rule as in Eq. (5).
3.2.2 Nonlinear model
In this section we extend the linear model to the nonlinear case. Here we directly use kernel matrices to deal with the non-linear data. The nonlinear model optimizes the following optimization problem:
Here note that we have used the rectangular kernel technique, here it indicates that the linear kernel model and nonlinear formulations are non-uniform.
Similar to the linear case we can obtain the dual of the primal problem Eq. (22) which is as follows:
Here \(\hat{Q_{1}} = (Y*\hat{D_{1}})(Y*\hat{D_{1}})^{T}\), \({\hat{F}} = -(\frac{v_{1}}{m_1}(Y*\hat{D_{1}}))K(A,D_{1})^{T}e_{1}\,+\,p_{1})\) and \(\hat{D_{1}}=K(D_{1},D_{1})\).
Solving we get the above Eq. (24):
Similarly we can obtain the dual of the primal problem Eq. (23) which is as follows:
Here \(\hat{Q_{2}} = ({\bar{Y}}*\hat{D_{2}})({\bar{Y}}*\hat{D_{2}})^{T}\), \(\hat{F_{2}} = -(\frac{v_{1}}{m_1}({\bar{Y}}*\hat{D_{2}}))K(B,D_{2})^{T}e_{2}\,+\,p_{2})\) and \(\hat{D_{2}}=K(D_{2},D_{2})\).
Solving the above Eq. (26) we get:
The test data sample, x can be annotated according to the kernelized version of rule Eq. (5), which is as follows:
3.3 Algorithm
The algorithm for the proposed model is as follows:
3.4 Connection with existing approaches
In this section, we briefly describe how our proposed model shares its connection with the existing approaches.
3.4.1 L\(_1\)-norm TSVM versus PN-SVM
In L\(_1\)-norm TSVM, the first set of constraints has an equality sign (Eq. 7) so as to develop a class A centric hyperplane. In our case, we don’t keep the hyperplane centric to class A but instead, keep it close to the boundary of class A simultaneously minimising class A projections. This modification not only makes our model sparse but also allows it to act as a marginal/bounding or even a proximal hyperplane.
3.4.2 SVM versus PN-SVM
Our proposed model seems to have a similar formulation as SVM but the difference is the margin and the introduction of the second term in the optimisation Eq. (9). It is simply the bounding hyperplane equation for class A in SVM, if the mean of class A is ignored. It is evident that the onset of this change in optimisation makes SVM capture optimal variance of the respective class, hence the name PN-SVM.
3.4.3 Pin-SVM versus PN-SVM
The introduction of pinball loss pull-down sparsity in SVMs. To bring back sparsity \(\epsilon\) zone-based Pin-SVM models (Rastogi et al., 2018) are developed. This in turn makes pinball loss models complicated. The addition of class A projections into the objective explicitly makes this process much simpler and time-efficient. Along similar lines, our model can be regarded as an improved version of Pin-TSVM on the advent of sparsity.
3.4.4 MCM versus PN-SVM
Minimal Complexity Machines (MCM) (Jayadeva,, 2015) also share homology with Pin-SVM models. Our model can be extended as a two hyperplane version of MCM, as it minimises the scatter of two classes separately. However, it does not have the time efficacy relationship as TSVM has compared with SVM. This can be a subject of further research.
3.4.5 TPMSVM versus PN-SVM
TPMSVM extends the idea of Parametric Margin-SVM (Hao, 2010) by solving two independent QPPs. Their model does not implement the SRM principle, thus resulting in sub-optimal results in certain cases. On the other hand, similar to TPMSVM, PN-SVM also tries to model marginal hyperplanes but is not limited to it, i.e. it can also model proximal hyperplanes on a suitable choice of parameters. Moreover, PN-SVM gives regard to the variance of the respective classes, simultaneously implementing the SRM principle and maintaining sparsity. In TPMSVM, the mean of the opposite class is used in the objective to model separation from the opposite class patterns. However, in PN-SVM mean of the respective class is used to model proximity to its own class patterns. On the basis of the aforementioned points, PN-SVM can be also regarded as an improved version of TPMSVM analogously. For more details on the TSVM based models we advise the reader to refer to a recently published survey paper (Tanveer et al., 2022).
3.5 Computation complexity
In this section, we compare the computational cost of our proposed PN-SVM with other related algorithms. During the training phase, similar to conventional SVM, the proposed model needs to find coefficients of n constraints for solving the dual QPP. Hence the computation of the proposed model is twice that of SVM is 2\({\mathcal {O}}(n^{3})\). As for the TSVM it needs to optimise a pair of smaller QPPs with approximately n/2 constraints, it has the computational complexity 2\({\mathcal {O}}(\frac{n^{3}}{8})\), which is four times faster than the SVM. It should be pointed out that TSVM has some extra computational cost since it needs to invert a pair of matrices. For TPMSVM it only needs to solve two QPPs hence it has the computational complexity 2\({\mathcal {O}}(\frac{n^{3}}{8})\). As for the pinball loss models it has 2n constraints so it should have computation of order \({\mathcal {O}}({(2n)^{3}})\) except for Pin-TSVM which has a computation of approximately \(2{\mathcal {O}}({n^{3}})\) only if the datasets are balanced (i.e. \(m_1=m_2\) ). Here note that the PN-SVM is theoretically four times faster than Pin-SVM.
During the testing phase, since the PN-SVM and SVM both are sparse, hence their computational cost depends only on the non-zero coefficients (or support vectors, sv) that is \({\mathcal {O}}(sv)\) Generally \(sv<<n\), hence due to the sparsity these models are faster in the testing phase. For the other models, TSVM and other non-sparse models testing complexity is simply \({\mathcal {O}}(n)\).
4 Experiments
The experiments are performed on well known diverse datasets using ten-fold cross-validation in MATLAB (Matlab, 2012) version 9.4 under Microsoft Windows environment on a machine with 3.40 GHz i7 CPU and 16 GB RAM. The optimal value of user-defined parameters in different models is obtained by fine-tuning a validation set generated using ten percent of training data. After the model parameters are known the validation set is sent back to training data for retraining. All datasets are normalised in the range \([1,-1]\). All comparing models are implemented in Matlab. Quadprog() function is used for all models to provide uniformity for comparison. For all two hyperplane models, parameters for both hyperplanes are kept the same. For measuring the sparsity of sparse models, we use Gini index. In Hurley and Rickard (2009) authors show that Gini index is well suited for evaluating sparsity for sparse models. Note that the lower the Gini index, the better the sparsity of the model. Note that TSVM, TPMSVM, Pin-TSVM and Pin-SVM are non-sparse models hence the Gini index for such non-sparse models is explicitly reported as 1 without any calculation. For measuring the time (seconds) of the algorithms we use tic-toc (Matlab). Note that we measure only the time taken by quadprog function by each algorithm for uniformity’s sake. The training time taken is reported in seconds. The best testing accuracy and Gini index are highlighted in bold, and training accuracy is reported to check for overfitting. We use gaussian kernel (\(K(x_1,x_2) = e^{-\rho \left\Vert x_1-x_2\right\Vert ^{2}}\)) for all kernelised models, \(\rho\) is tuned in range [0.1 : 0.1 : 1].
For our proposed model the setting of model parameters is very important. We tune \(v_{1}\), \(c_{1}\) in range [0.1 : 0.1 : 0.9] and [1 : 1 : 10] respectively. For brevity, we select the same value for the second hyperplane. However, the parameters for both hyperplanes can be tuned separately for better accuracy. Note that \(c_{1}\) should be sufficiently larger that \(v_{1}\) (i.e. \(c_{1} > v_{1}\)).
For all our experiments we fix \(\epsilon =0.1\) just to avoid trivial solution for the optimisation.
4.1 Illustration on toy dataset
To show how our proposed model works, we show its plot on a synthetic two-dimensional dataset. The synthetic data is generated using mvnrnd function Matlab. We take the mean of class A as \([0.1, -1.8]^{T}\) and the mean of class B as \([-0.1, 1.8]^{T}\). Covariance matrices are [0.3, 0.2; 0.2, 0.3] and \([0.3, -0.2; -0.2, 0.3]\) for the two classes respectively. The size of the two classes is taken to be 200 each. Refer to Fig. 1, it is easy to observe the differences among the various two hyperplane classifiers. Our proposed model does not run through the centre of the respective classes but it runs along the margin of the class such that the sum of projections of the respective class is minimised. Most of the samples lie on one side of the hyperplane but the trade-off between the second and third term in Eq. (9) forces some noisy samples to lie on the opposite side of the hyperplane (note that we have kept the parameter for the third term to be higher than that of the second term).
The separation can be made better by significantly increasing the value of \(c_{1}\). Doing so will allow the hyperplane to focus on more separation between the classes. In fact, if \(c_{1}>>v_{1}\) then it can be seen that the hyperplanes become parallel and the PN-SVM effectively behaves as conventional-SVM.
Look at Fig. 2, notice how increasing the value of \(c_{1}\) changes the hyperplanes and the results for PN-SVM. So it’s very important to select the parameters correctly.
4.2 Illustration on noise-corrupted datasets
The Fig. 3 shows the effect of noise on the classification accuracy of the two comparing models, namely PN-SVM and Pin-SVM. For a more accurate representation for comparison, natural log of testing accuracy is reported. Datasets are corrupted by adding noise using the mvrnd function Matlab. Dataset mean and covariance are used as input to the mvrnd function. Finally, some percentage of samples are corrupted by this generated noise. From the plots shown, it is evident that PN-SVM behaves approximately as Pin-SVM in the presence of noise. However, a comparison between their optimisations points toward functional analogy.
4.3 Datasets
The datasets are selected keeping into consideration the diversity of sizes. The characteristics of the datasets (Instances \(\times\) features) used are described alongside the results. Datasets used are obtained from UCI machine learning repository (Asuncion & Newman, 2007) and Gunnar Rätsch’s repository (Diethe, 2015).
4.4 Results for linear models
The Table 1 show the linear results for our proposed models PN-SVM against the comparing models. In comparison with TSVM and SVM, PN-SVM needs more training time. On the other hand, PN-SVM has comparable performance with these two and in many cases, it outputs greater Accuracy. Compared with TSVM and PN-SVM, the latter introduces sparsity into its framework which is comparable to the sparsity of SVM. In many cases, it exhibits superiority in this regard.
In comparison with pinball loss models, i.e. Pin-SVM, Pin-TSVM and Mod-Pin-SVM, the time efficacy of PN-SVM is clearly visible. Since pinball loss adds another set of constraints, it makes these models slower. Please note that our model is not the first attempt to introduce sparsity into the Pinball loss based models. However, it can be regarded as a better, simpler model, in terms of training time, accuracy and sparsity as compared with Pinball loss based models. Our model has superior sparsity in comparison with Mod-Pin-SVM. Our model adopts the sparsity due to the SVM framework and noise resilience is granted by the minimisation of the mean of respective class projections.
4.5 Results for NDC datasets
To clearly demonstrate the effect of dataset size on the training time of the linear classifiers we conduct experiments on NDC datasets (Musicant, 1998). The characteristic of these datasets is described in the Table 2 as samples \(\times\) dimension \(\times\) Imbalance. Imbalance is defined as the ratio of number of positive labelled samples to the number of negative labelled samples in the dataset. The Table 2 show the linear results for our proposed models namely PN-SVM against the comparing models. The dimension of the NDC datasets is fixed (i.e. 32) while the sample size is varied. For all the datasets the model parameters are fixed to study the change of time with increasing data size. It can be seen that as the imbalance ratio increases, the pin-TSVM model becomes more time-consuming. In fact from the Fig. 4, it can be seen that its curve grows the fastest among all the other models. Also, it is clear that the pinball loss models are very heavy on training time compared to our proposed model PN-SVM. Among these three, TSVM is the fastest, followed by SVM and PN-SVM. So it is important to improve upon the time requirement of the pinball models.
4.6 Results for non-linear models
The Table 3 show the kernelised results for our proposed models namely PN-SVM against the comparing models. Due to the higher computational requirement of Pinball models, we experiment on medium-sized datasets. Our proposed model PN-SVM is able to output comparable results and achieves superior sparsity. Similar to the observations in the linear case, the time requirement of our model, here is higher than SVM and TSVM, but still lesser in comparison to Pinball models.
4.7 Statistical analysis
We use ranking criteria to evaluate better the results obtained from the algorithms. The ranks are allotted to each algorithm according to the order in which they perform w.r.t to a metric, i.e. if a model has the best results, it is ranked 1; if it has a second-best result, it is given rank 2 on a particular dataset. If two algorithms have the same results, they are given the same rank. Mean rank obtained via averaging ranks for all datasets w.r.t to a metric is reported in Tables 4 and 5. It can be seen that the proposed algorithm PN-SVM achieves the best rank amongst the comparing algorithms for linear as well as non-linear classifiers.
5 Conclusions
In this paper, we have proposed the PN-SVM model to handle binary classification, wherein we effectively address the problem of sparsity and noise resilience through reformulation of the SVM optimisation framework by minimising average projections of the respective classes. The results on benchmark datasets prove the efficacy of the proposed model against the comparing approaches. Further, our proposed model presents a faster and much simpler model than the Pinball loss models. PN-SVM exhibits homology to SVM. Hence the fast SVM type approaches can be modified to adapt to our model. Further, it would be interesting to develop its extension for multiclass classification.
Code availability
The code for the proposed model can be made available on request from the corresponding author.
References
Asuncion, A., & Newman, D. (2007). Uci machine learning repository.
Chen, W. J., Shao, Y. H., Li, C. N., et al. (2020). \(\nu\)-projection twin support vector machine for pattern classification. Neurocomputing, 376, 10–24.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273–297.
Diethe, T. (2015). 13 benchmark datasets derived from the UCI, DELVE and STATLOG repositories. https://github.com/tdiethe/gunnar_raetsch_benchmark_datasets/, https://doi.org/10.5281/zenodo.18110
Hao, L., & Lewin, P. (2010). Partial discharge source discrimination using a support vector machine. IEEE Transactions on Dielectrics and electrical Insulation, 17(1), 189–197.
Hao, P. Y. (2010). New support vector algorithms with parametric insensitive/margin model. Neural Networks, 23(1), 60–73.
Huang, X., Shi, L., & Suykens, J. (2014). Support vector machine classifier with pinball loss. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2013.178
Hurley, N., & Rickard, S. (2009). Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10), 4723–4741.
Jayadeva. (2015). Learning a hyperplane classifier by minimizing an exact bound on the vc dimensioni. Neurocomputing, 149, 683–689.
Jung, H. G., & Kim, G. (2013). Support vector number reduction: Survey and experimental evaluations. IEEE Transactions on Intelligent Transportation Systems, 15(2), 463–476.
Khemchandani, R., Chandra, S., et al. (2007). Twin support vector machines for pattern classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 905–910.
Kwak, N. (2008). Principal component analysis based on l1-norm maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9), 1672–1680.
Li, C. N., Shao, Y. H., & Deng, N. Y. (2016). Robust l1-norm non-parallel proximal support vector machine. Optimization, 65(1), 169–183. https://doi.org/10.1080/02331934.2014.994627
Mangasarian, O. L., & Wild, E. W. (2005). Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 69–74.
Matlab, S. (2012). Matlab. Natick, MA: The MathWorks.
Musicant, D.R. (1998). NDC: Normally distributed clustered datasets. Www.cs.wisc.edu/dmi/svm/ndc/
Nalepa, J., & Kawulok, M. (2019). Selecting training sets for support vector machines: A review. Artificial Intelligence Review, 52(2), 857–900.
Peng, X. (2011). Tpmsvm: A novel twin parametric-margin support vector machine for pattern recognition. Pattern Recognition, 44(10–11), 2678–2692.
Peng, X., Xu, D., Kong, L., et al. (2016). L1-norm loss based twin support vector machine for data recognition. Information Sciences. https://doi.org/10.1016/j.ins.2016.01.023.
Rastogi, R., Pal, A., & Chandra, S. (2018). Generalized pinball loss SVMs. Neurocomputing, 322, 151–165.
Shao, Y. H., Zhang, C. H., Wang, X. B., et al. (2011). Improvements on twin support vector machines. IEEE Transactions on Neural Networks, 22(6), 962–968.
Shao, Y. H., Chen, W. J., & Deng, N. Y. (2014). Nonparallel hyperplane support vector machine for binary classification problems. Information Sciences, 263, 22–35.
Subasi, A. (2013). Classification of EMG signals using PSO optimized SVM for diagnosis of neuromuscular disorders. Computers in Biology and Medicine, 43(5), 576–586.
Tanveer, M., Rajani, T., Rastogi, R., et al. (2022). Comprehensive review on twin support vector machines. Annals of Operations Research, 1–46.
Tian, Y., Qi, Z., Ju, X., et al. (2013). Nonparallel support vector machines for pattern classification. IEEE Transactions on Cybernetics, 44(7), 1067–1079.
Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5), 988–999.
Wang, Z., Shao, Y. H., Bai, L., et al. (2015). Twin support vector machine for clustering. IEEE Transactions on Neural Networks and Learning Systems, 26(10), 2583–2588.
Xu, Y., Yang, Z., & Pan, X. (2016). A novel twin support-vector machine with pinball loss. IEEE Transactions on Neural Networks and Learning Systems, 28(2), 359–370.
Yin, H., Jiao, X., Chai, Y., et al. (2015). Scene classification based on single-layer SAE and SVM. Expert Systems with Applications, 42(7), 3368–3380.
Acknowledgements
Authors of this manuscript would like to express gratitude to Dr S. Chandra for their valuable feedback and Dr Aman Pal for providing Matlab codes for Mod-Pin-SVM, Pin-SVM. We would also like to thank South Asian University for monthly stipend to Sambhav Jain, which helped conduct this research.
Funding
No funding was received for conducting this study.
Author information
Authors and Affiliations
Contributions
SJ: Conceptualization, data curation, formal analysis, investigation, methodology, software, supervision, validation, visualization, writing—original draft, writing - review and editing. RR: Conceptualization, formal analysis, investigation, methodology, software, supervision, validation, visualization, writing— review and editing.
Corresponding author
Ethics declarations
Conflict of interest
There are no conflicts of interest in this study.
Ethics approval
This research paper does not involve any studies with human participants or animals performed by any authors.
Additional information
Editors: Yu-Feng Li, Prateek Jain.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jain, S., Rastogi, R. Parametric non-parallel support vector machines for pattern classification. Mach Learn 113, 1567–1594 (2024). https://doi.org/10.1007/s10994-022-06238-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-022-06238-0