Abstract
This paper presents an efficient and robust Large-margin Distribution Machine formulation for regression. The proposed model is termed as ‘Large-margin Distribution Machine-based Regression’ (LDMR) model, and it is in the spirit of Large-margin Distribution Machine (LDM) (Zhang and Zhou, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014) classification model. The LDM model optimizes the margin distribution instead of minimizing a single-point margin as is done in the traditional SVM. The optimization problem of the LDMR model has been mathematically derived from the optimization problem of the LDM model using an interesting result of Bi and Bennett (Neurocomputing 55(1):79–108, 2003). The resulting LDMR formulation attempts to minimize the \(\epsilon\)-insensitive loss function and the quadratic loss function simultaneously. Further, the successive over-relaxation technique (Mangasarian and Musicant, IEEE Trans Neural Netw 10(5):1032−1037, 1999) has also been applied to speed up the training procedure of the proposed LDMR model. The experimental results on artificial datasets, UCI datasets and time-series financial datasets show that the proposed LDMR model owns better generalization ability than other existing models and is less sensitive to the presence of outliers.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Support vector machine (SVM) is one of the most popular machine learning algorithms (Cortes and Vapnik [1]; Burges [2]; Cherkassky and Mulier [3]; Vapnik [4]). SVM has emerged from the research in statistical learning theory on how to regulate the trade-off between the structural complexity and empirical risk. It has outperformed the other existing tools in a wide variety of applications. Some of these applications can be found in Osuna et al. [5], Joachims [6], Schlkopf et al. [7] and Lal et al. [8]. SVM has been initially developed for the problem of pattern classification, but it has been extended to solve the problems of regression and clustering as well.
SVM classifier attempts to reduce the generalization error by maximizing the minimum margin, i.e. the minimum distance of the training points from the classification boundary (Cortes and Vapnik [1]; Burges [2]; Cherkassky and Mulier [3]; Vapnik [4]; Bradely and Mangasarian [9]). The maximization of margin in SVM classification problem amounts to the minimization of an upper bound on the VC dimension (Burges [2]; Vapnik [4]) of the classifying hyperplane. The maximum margin theory is not only relevant in the case of the SVM, but it has also been extended to interpret the good generalization ability of many other learning approaches such as AdaBoost (Freund and Schapire [10]) which is a major representative of ensemble methods (Zhou [11]). The theoretical studies which advocate the relevance of the maximum margin principle in these learning approaches can be found in Breiman [12] and Schapire et al. [13]. Moreover, some of the recent theoretical results (Reyzin and Schapire [14]; Wang et al. [15]; Gao and Zhou [16]) suggest that rather than simply considering a single-point margin, the margin distribution is important in these algorithms. Taking motivation from these studies, Teng and Zhou [17] have proposed Large-margin Distribution Machine (LDM) which tries to achieve better generalization performance by considering the margin mean as well as margin variance in the objective function of its optimization problem.
The support vector methodology has also been extended for handling the problem of regression (Vapnik et al. [18]; Drucker et al. [19]) The standard \(\epsilon \)-SVR is an \(\epsilon \)-insensitive model which sets an epsilon tube around the data points. The data points outside the epsilon tube contribute to the errors which are penalized in the objective function via a user-specified parameter. Bi and Bennett [20] have developed a geometric framework for SVR, showing that it can be related to an appropriate SVM problem. This result of Bi and Bennett [20] is very significant as it provides a classification eye to see the problem of regression.
This paper proposes an efficient Large-margin Distribution Machine-based Regression (LDMR) model which is similar in principle to LDM model for classification (Zhang and Zhou [17]). The proposed LDMR model is a more general regression model as the standard \(\epsilon \)-SVR and LS-SVR models are special cases of LDMR model with particular choice of parameters. The optimization problem of LDMR model has been derived from the LDM model for classification using a well-known result of Bi and Bennett [20]. The resulting optimization problem simultaneously minimizes the quadratic loss function used in LS-SVR (Suykens et al. [21, 22]) and \(\epsilon \)-insensitive loss function used in \(\epsilon \)-SVR. It is noteworthy that the LS-SVR model fails to predict well on noisy datasets, whereas \(\epsilon \)- SVR model totally ignores all data points which lie inside of \(\epsilon \)-tube for the determination of the regressor. This strategy makes \(\epsilon \)-SVR model sparse but does not minimize the scatter inside of the \(\epsilon \)- tube. Our proposed LDMR model aims to take advantage of both of these models. Further, an effective successive over-relaxation (SOR) (Mangasarian and Musicant [25]) technique has also been applied for the efficient solution of the LDMR problem to reduce its training time complexity. Experimental results on the artificial datasets, UCI benchmark datasets (Blake and Merz [28]) and time-series datasets show that the LDMR model owns better generalization ability than the existing SVR models.
Taking forward the arguments of Huang et al. [29] to the regression analogue, it makes sense that apart form the sparsity, we should also minimize the scatter of data points which lie inside the \(\epsilon \)-tube. Therefore, it is meaningful to have both of these loss functions in the proposed formulation so as to do a trade-off between sparsity and scatter minimization. It also enables the proposed model to utilize the full information of the training set and avoid over-fitting simultaneously.
We now briefly describe notations used in the rest of this paper. All vectors will be taken as column vectors, unless it has been specified otherwise. For any vector \(x \in R^{n}\), ||x|| will denote the \(l_2\) norm. A vector of ones of arbitrary dimension will be denoted by e. Let (A, Y) denote the training set where \(A = [A_1;A_2;\ldots ;\,A_l]\) contains the l points in \({\mathbb {R}}^n\) represented by l rows of the matrix A and \(Y = [y_1,y_2,\ldots ,y_l]\in {\mathbb {R}}^{l\times 1}\) contains the corresponding response value of the row of matrix A.
The rest of this paper is organized as follows. Section 2 discusses LDM formulation proposed by Zhang and Zhou [17]. Section 3 briefly describes SVR models. Section 4 proposes the Linear Distribution Machine-based Regression (LDMR) and its extension for the nonlinear case. Section 5 contains the mathematical derivation of the optimization problem of the LDMR from the LDM formulation (Zhang and Zhou [17]) using a result of Bi and Bennett [20]. Section 6 describes the experimental results, while Sect. 7 is devoted to the conclusions.
2 Large-margin distribution machine
Inspired by the theoretical result of Gao and Zhou [16], the LDM model (Zhang and Zhou [17]) attempts to maximize the margin mean and minimize the margin variance simultaneously for optimizing the margin distribution. The margin mean \(\bar{\gamma }\) and margin variance \(\hat{\gamma }\) are given by
Here the matrix \(X = [x_1, x_2\ldots , x_l]\) contains the given l data points in \({\mathbb {R}}^n,\) and vector \(d \in {\mathbb {R}}^{l\times 1}\) of \(\pm 1\) represents the corresponding class.
The linear LDM model finds the separating hyperplane \(w^\mathrm{T}x = 0 \) by solving the following optimization problem
\(\lambda _1\) and \(\lambda _2\) are positive parameters for trading off the margin mean and margin variance. It is also notable that the LDM formulation reduces to the SVM when \(\lambda _1\) and \(\lambda _2\) = 0 in (3).
3 Support vector regression
3.1 \(\epsilon \)-Support Vector Regression
Linear \(\epsilon \)-SVR model finds a linear function \(f(x) = w^\mathrm{T}x + b\), where \(w \in {\mathbb {R}}^n\) and \(b \in {\mathbb {R}}\). To measure the empirical risk, it uses the \(\epsilon \)-insensitive loss function
where \(|y_i-f(x_i)|_\epsilon =\) max\((0,|y_i-f(x_i)|-\epsilon )\). The \(\epsilon \)-SVR model minimizes the \(\epsilon \)-insensitive loss function with a regularization term \(\frac{1}{2}||w|| ^{2}\) in its optimization problem which is given as follows
Here \(C >0\) is the user-specified parameter that balances the trade-off between the fitting error and the flatness of the function \(f(x) =w^\mathrm{T}x+b\).
3.2 Least-squares support vector regression model
Similar to \(\epsilon \)-SVR model, least-squares support vector regression (LS-SVR) model (Suykens et al. [21, 22]) also finds a linear function \(f(x) = w^\mathrm{T}x + b\), where \(w \in {\mathbb {R}}^n\) and \(b \in {\mathbb {R}}\). The LS-SVR model minimizes the quadratic loss function
in its optimization problem along with the regularization term \(\frac{1}{2}||w||^2\). The optimization problem of the LS-SVR model can be expressed as
where \(C_1 > 0\) is a user-defined parameter.
4 Large-margin distribution machine-based regression
In this section, we propose an efficient Large-margin Distribution Machine-Based Regression (LDMR) model.
4.1 Linear LDMR model
Given the training set (A, Y), the LDMR model finds a linear function \(f(x) = w^\mathrm{T}x+ b, \) where \(w \in R^{n}\) and \(b \in R\). The proposed LDMR model minimizes the following generalized loss function
Along with the regularization term, here \(k>0\) and \(C>0\). By introducing the regularization term \(\frac{1}{2}||w||^2\) and slack variables \(\xi _1\) and \(\xi _2\), the primal form of the LDMR can be expressed as
where C, k, \(\epsilon \) and c are user-defined positive parameters.
The proposed LDMR model minimizes three terms in its optimization problem. The first term \(\frac{1}{2} w^\mathrm{T}w\) attempts to make the regressor as flat as possible. The second term \(\sum _{i=1}^{l}(y_i- f(x_i))^2\) attempts to minimize the scatter of the data points, while the third term \(\sum _{i=1}^{l}|y_i- f(x_i)|_\epsilon \) tries to have better sparsity. These three terms in the objective functions are traded off appropriately to make full use of the training set and avoid over-fitting of the data points simultaneously.
The \(\epsilon \)-SVR model only minimizes the \(\epsilon \)-insensitive loss function which ignores the error up to \(\epsilon \). The points lying on the bounding regressor \(f(x)+\epsilon \) and \(f(x)-\epsilon \) and outside the \(\epsilon \)-insensitive zone are support vectors and participate in the construction of the final regressor. The data points which lie inside the \(\epsilon \)-insensitive zone have been ignored to achieve the sparsity, but it also causes \(\epsilon \)-SVR to lose the information contained in the training set. The outliers which are supposed to reside outside the \(\epsilon \)-insensitive zone also affect the orientation and position of the regressor. On the other hand, the LS-SVR model minimizes the quadratic loss function where all of the data points are participating in the construction of the final regressor but cannot avoid over-fitting of the regressor. That is why, the LS-SVR model fails to perform well on datasets which contain much noise. The proposed LDMR model obtains better generalization ability by finding a good trade-off between the \(\epsilon \)-insensitive loss function and the quadratic loss function via the user-defined parameters c, k and C. Since the proposed LDMR model also assigns some weights to the points which are inside of the \(\epsilon \)-insensitive zone in the construction of the final regressor, it is less sensitive to the presence of outliers and therefore can avoid the over-fitting as well. In this sense, the proposed LDMR model combines the benefits of both the SVR and LS-SVR models.
The proposed LDMR model is in the true spirit of the LDM classification model. The optimization problem of the LDMR model has been mathematically derived by the optimization problem of the LDM model using a result of the Bi and Bennett [20]. Unlike SVM, which minimizes the single-point margin only, the LDM model also minimizes the margin mean and margin variance together. It makes LDM formulation insensitive towards noises. These advantages of the LDM model have also been inherited in the LDMR model.
It is also noteworthy that the proposed LDMR model is a very general model. When \(k=0,\) the primal problem (9) of proposed LDMR model reduces to primal problem (5) of \(\epsilon \)-SVR. Also, when C becomes zero in the primal problem (9) of the proposed LDMR formulation, the variables \(\xi _1~,\xi _2 \ge 0\) are no more minimized in (9) and hence can take any values. Therefore, the constraints of the optimization problem (9) do not make any sense as these are always satisfied for any value of (w,b). Thus, these constraints become redundant. So with \(C=0\), the proposed LDMR model only minimizes \(\frac{c}{2}||w||^2 + \frac{k}{2}||Y - (Aw + eb)||^2\) in its optimization problem (9) which is equivalent to solving the optimization problem (7) of the LS-SVR model with \(C_1 = \frac{k}{2}\).
In order to find the solution of the primal problem (9), we need to derive its corresponding dual problem. Let us assume \(H = [A, e]\) is a augmented matrix and \(v = \begin{bmatrix} w \\ b \end{bmatrix},\) then \(||w||^2\) can be written as \(||w||^2 = v^\mathrm{T}I_0v\), where \(I_0 = \begin{bmatrix} I&0\\&. \\&. \\ 0&\ldots 0 \end{bmatrix}\) and I is \(n\times n\) identity matrix. Now the Lagrangian function for the primal problem (9) can be given by
where \(\alpha _1 = {(\alpha _1^1,\alpha _1^2,\ldots ,\alpha _1^l)},~\alpha _2 = {(\alpha _2^1,\alpha _2^2,\ldots ,\alpha _2^l)},~\beta _1\) and \(\beta _2\) are the vector of Lagrangian multipliers. The KKT optimality conditions are given by
Using the above KKT conditions, the dual problem of the primal problem (9) can be obtained as
After obtaining the optimal value of the \(\alpha _1\) and \(\alpha _2\) from (19), we can obtain v using (10) as follows \(v = \begin{bmatrix} w \\ b \end{bmatrix} = (cI_0+kH^\mathrm{T}H) ^{-1}H^\mathrm{T}(\alpha _1-\alpha _2+Y).\)
For the given \(x \in {R}^n\), the estimated regressor is obtained as follows
4.2 Nonlinear LDMR model
The nonlinear LDMR model will seek to estimate the function \( f(x) = K(x^\mathrm{T},A^\mathrm{T})u +b\), where K is an appropriately chosen positive definite kernel.
The nonlinear LDMR model solves the following optimization problem
where C, k, \(\epsilon \) and \(c_3\) are user-supplied positive parameters. Let us assume \(G = [K(A,A^\mathrm{T}), e]\) be a augmented matrix and \(v = \begin{bmatrix} u \\ b \end{bmatrix},\) then \(||u||^2\) can be written as \(||u||^2 = v^\mathrm{T}I_0v\), where \(I_0 = \begin{bmatrix} I&0\\&. \\&. \\ 0&\ldots 0 \end{bmatrix}\) and I is \(m\times m\) identity matrix. Similar to the line of the linear case, the dual problem of the primal problem (20) can be obtained as follows
After obtaining the optimal value of the \(\alpha _1\) and \(\alpha _2\) from (21), we can obtain v as \(v = \begin{bmatrix} u \\ b \end{bmatrix} = (cI_0+kG^\mathrm{T}G) ^{-1}G^\mathrm{T}(\alpha _1-\alpha _2+Y).\) For the given \(x \in {R}^n\), the estimated regressor is obtained as follows
4.3 A fast LDMR model using successive over-relaxation technique
The dual problem of the proposed LDMR model (19) or (21) can be written in the following unified form
For example, the problem (22) becomes the problem (19) when \(\beta = \begin{bmatrix} \alpha _1 \\\alpha _2 \end{bmatrix}\)\(Q= \begin{bmatrix} H(cI_0+kH^\mathrm{T}H)^{-1}H^\mathrm{T}&-H(cI_0+kH^\mathrm{T}H)^{-1}H^\mathrm{T} \\ -H(cI_0+kH^\mathrm{T}H)^{-1}H^\mathrm{T}&H(cI_0+kH^\mathrm{T}H)^{-1}H^\mathrm{T} \end{bmatrix}\) and \(p = -\begin{bmatrix} Y^\mathrm{T}H(cI_0+kH^\mathrm{T}H)H^\mathrm{T}-Y^\mathrm{T}+e^\mathrm{T}\epsilon,& -Y^\mathrm{T}H(cI_0+kH^\mathrm{T}H)H^\mathrm{T}+Y^\mathrm{T}+e^\mathrm{T}\epsilon \end{bmatrix}\)
Now problem (22) can be efficiently solved by following the successive over-relaxation (SOR) technique (Mangasarian OL and Musicant DR [25]) as follows.
Algorithm 1
We choose \(t \in (0,2)\) and start with any initial value of \(\beta \) say \(\beta ^0 \in R^{n}\). After computing \(\beta _i,\) we compute
until \(||\beta _{i+1}-\beta _{i}||\) is not less than some prescribed tolerance. Here nonzero elements of \(L \in R^{l \times l}\) constitute the strictly lower triangular part of the symmetric matrix Q, and the nonzero elements of \(E\in R^{l \times l}\) constitute the diagonal of Q (Shao et al. [23]).
It should be noted that it is well justified in [25] and [26] that the iterates \(\{\beta _i\}\) converges R-linearly to the optimal solution \(\bar{\beta }\) of the problem (22).
5 LDMR: regression via LDM
In this section, we derive the optimization problem of the proposed LDMR model from the optimization problem of the LDM model by making use of a result of the Bi and Bennett [20]. It has been shown in [20] that for a given \(\bar{\epsilon } > 0\) and regression training set (A, Y), a regressor \( y=\frac{w}{-\eta }^\mathrm{T}x+\frac{b}{-\eta }\), \((\eta > 0)\) is an \(\epsilon \)-insensitive regressor if and only if the sets \(D^+\) and \(D^-\) locate on different sides of \(n+1\)-dimensional hyperplane \(w^\mathrm{T}x+ \eta y+b=0,\) respectively, where
In view of this result of Bi and Bennett [20], the regression problem is equivalent to the classification problem of sets \(D^+\) and \(D^-\) in \(R^{n+1}\). If we use the LDM methodology (Zhang and Zhou [17]) for the classification of these two sets \(D^+\) and \(D^-,\) then we can find the LDMR formulation. For this we calculate the margin mean \(\bar{\gamma }\) and margin variance \(\hat{\gamma }\) as follows
where \(w \in R^n\), \( \eta \in R\) and \(b \in R\). Now the classification of sets \(D^+\) and \(D^-\) using LDM model will result in the following QPP
Here we note that \(\eta \ne 0\) and therefore, without loss of generality, we can assume that \(\eta >0\). The constraint of (26) can then be rewritten as
Further, the objective function of (26) can also be written as
On replacing \(\epsilon := \bar{\epsilon }-\frac{1}{\eta }\), \(w:= (\frac{w}{-\eta })\), \(b:=(\frac{b}{-\eta })\), \(\xi _i: = \frac{\xi _i}{\eta }\), \(\xi _i^*: = \frac{\xi ^*_i}{\eta }, C:=\frac{C}{\eta }\) and \(\lambda := \frac{\lambda }{\eta }\) and noting that \(\eta \ge 0,\) the objective function of the optimization problem (26) can be rewritten as
Also the constraints of the optimization problem (26) can be expressed as
We further need to show that \(\epsilon = \bar{\epsilon }-\frac{1}{\eta } \) is always non-negative. We prove this assertion as follows.
Let (\({\bar{w}}\), \(\bar{\eta }\), \(\bar{\xi }\)\(\bar{\xi ^*}\)) be the solution of QPP (26) which finds the classifier for the sets \(D^+\) and \(D^-\). There would always exists an index j such that
Adding (28) and (29), we get \(\bar{\epsilon } \ge \frac{1}{\eta }\), which proves that \(\epsilon = \bar{\epsilon }-\frac{1}{\eta } \ge 0.\)
Now for \(\eta > 0 \), the classifier \(w^\mathrm{T}x+ \eta y +b =0\) for the classes \(D^+\) and \(D^-\) gives the regressor \(y = (w^\mathrm{T}x+ b)\) with \(w:= \left( \frac{w}{-\eta }\right) \), \(b:=\left( \frac{b}{-\eta }\right) \). Since constraints in (27) do not involve \(\eta \), \(\eta \) does not play any role now in the determination of the regressor. Therefore, problem (26) becomes
Also, since \(\epsilon \) has been taken as constant, it can be removed from the objective function of (30). Further, after replacing \(k:= 2\lambda _2\frac{1}{l}\), the problem (30) can be written in the vector form as follows
6 Experimental results
We have performed a number of experiments to verify the efficacy of the our proposed LDMR model. For this, we have compared the LDMR model with existing SVR models, namely standard SVR, SMO-based SVR, LS-SVR and \(L_1\)-Norm SVR (Tanveer [24]) on certain artificial datasets, UCI benchmark datasets (Blake and Merz [28]) and time-series financial datasets.
All the simulations have been performed in MATLAB 12.0 environment (http://in.mathworks.com/) on Intel XEON processor with 16.0 GB RAM. The \(L_1\)-Norm SVR model has been solved by using the ‘linprog’ function of the MATLAB. For small- and medium-scale datasets, the proposed LDMR and standard SVR models have been solved by using the ‘quadprog’ function of MATLAB. For the large-scale datasets, we have used the SOR technique (Mangasarian and Musicant [25]) to solve the proposed LDMR model efficiently, whereas the SMO (Chang and Lin [27]) method has been used for obtaining the solution of standard SVR for large-scale datasets. For this, we have downloaded its code form (https://www.csie.ntu.edu.tw/~cjlin/libsvm/). Throughout these experiments, we have used RBF kernel \(\mathrm{exp}(\frac{-||x-y||^2}{q})\) where q is the kernel parameter.
The optimal values of parameters in SVR models have been obtained using the exhaustive search method (Hsu and Lin [30]) with cross-validation. For all SVR models, the value of the kernel parameter q has been searched in the set \(\{2^i,~ i = -10,-2,\ldots ,10\}\). The value of the parameter \(\epsilon \) in \(\epsilon \)-SVR, \(L_1\)-Norm SVR and proposed LDMR model has been searched in the set of \(\{ 0.05,0.1,0.2,0.3 \ldots ,1,1.5,2\}\). The value of the parameter k of the proposed LDMR has been fixed to 1 throughout the experiments. The value of parameters C in \(\epsilon \)-SVR, LS-SVR, \(L_1\)-Norm SVR and proposed LDMR model has been searched in the set \(\{2^i,~ i = -10,-2,\ldots ,12 \}\). The value of parameter c in \(L_1\)-Norm SVR and proposed LDMR model has also been searched in the set \(\{2^i,~ i = -10,-2,\ldots ,12 \}\).
6.1 Performance criteria
In order to evaluate the performance of the regression methods, we first summarize some commonly used evaluation criteria. Without the loss of generality, let l and k be the number of the training samples and testing samples, respectively. Furthermore, for \(i=1,2,\ldots k\), let \({y'_i}\) be the predicted value for the response value \(y_i\) and \(\bar{y}\) = \(\frac{1}{k}\sum _{i}^{k}y_i\) is the average of \(y_1, y_2,\ldots ,y_k\). The definition and significance of the some evaluation criteria have been listed as follows.
- (i)
SSE Sum of squared error of testing, which is defined as SSE = \(\sum _{i=1}^{k}(y_i-y'_i)^2\). SSE represents the fitting precision.
- (ii)
SST Sum of squared deviation of testing samples, which is defined as SST = \(\sum _{i=1}^{k}(y_i-\overline{y})^2\). SST shows the underlying variance of the testing samples.
- (iii)
SSR Sum of square deviation of the testing samples which can be explained by the estimated regressor. It is defined as SSR = \(\sum _{i=1}^{k}(y'_i-\overline{y})^2\).
- (iv)
RMSE Root mean square of the testing error, which is defined as RMSE = \(\sqrt{\frac{1}{k}\sum _{i=1}^{k}(y_i-y'_i)^2}\).
- (v)
SSE/SST SSE/SST is the ratio between the sum of the square of the testing error and sum of the square of the deviation of testing samples. In most cases, small SSE/SST means good agreement between estimations and real values.
- (vi)
SSR/SST It is the ratio between the variance obtained by the estimated regressor on testing samples and actual underlying variance of the testing samples.
6.2 Experiment 1 (artificial datasets)
To compare the performance of the proposed methods with the existing methods, we have synthesized some artificial datasets. For the training samples \((x_i,y_i)\) for \(i =1,2 ,\ldots ,l \), datasets have been generated as follows.
TYPE 1
TYPE 2
TYPE 3
TYPE 4
TYPE 5
TYPE 6
TYPE 6 dataset contains 200 training samples and 400 non-noise testing samples, while other datasets contain 100 training samples and 500 non-noise testing samples. To avoid the biased comparison, ten independent groups of noisy samples were generated randomly using MATLAB toolbox for all types of training sets.
Figures 1 and 2 illustrate the estimated function obtained by the LDMR, standard SVR, \(L_1\)-Norm SVR and LS-SVR models for TYPE 3 and TYPE 6 datasets, respectively. Table 1 shows the comparison of the proposed LDMR with SVR, \(L_1\)-Norm SVR and LS-SVR models on artificial datasets. It can be observed that the proposed LDMR, irrespective of the nature of noises present in the training set, owns always better generalization ability than other regression methods.
6.3 Experiment 2 (artificial datasets with outliers)
As compared to other regression methods, the proposed LDMR is a robust method and is less sensitive to the presence of outliers. To realize this, we have generated datasets by adding five and ten outliers points in the TYPE 3 and TYPE 6 datasets, respectively. Figure 3 shows the estimated function obtained by LDMR, standard SVR, \(L_1\)-Norm SVR and LS-SVR models on TYPE 3 dataset with outliers. For \(L_1\)-Norm SVR and standard SVR models, the optimal RMSE has been found at \(\epsilon =0.3\) but, still at this value of the \(\epsilon \), outliers lie outside of the \(\epsilon \)-insensitive zone and affect the orientation and position of the estimated function. The LDMR model reduces the effect of these outliers by also assigning some weight to the points lying inside of the \(\epsilon \)-insensitive zone in the optimization problem. Table 2 shows the comparison of the proposed LDMR, standard SVR, \(L_1\)-Norm SVR and LS-SVR models on artificial datasets with outliers. In these datasets, 200 points were used for training and 400 non-noise points were used for testing.
6.4 Experiment 3 (benchmark datasets)
We have checked the performance of the proposed methods on eight UCI benchmark datasets, namely Yacht Hydrodynamics, Concrete Slump, Pyrims, Motorcycle, NO2, Chwirut, Auto MPG and Boston Housing which are commonly used in evaluating a regression method. For the Motorcycle dataset, the criterion leave-one out (Kohavi [32]) was used to report the numerical results. For other datasets, we have used tenfold cross-validation (Duda and Hart [31]) to report the numerical results. For all the datasets, only feature vectors were normalized in the range of [0,1].
Table 3 lists the performance of LDMR, standard SVR, LS-SVR and \(L_1\)-Norm SVR using different evaluation criteria described in Sect. 6.1 for eight different UCI datasets. It can be observed that the proposed LDMR model outperforms the other existing regression methods in most of the datasets. For the statistical analysis of performance of the regression methods, their average rank has been computed using the values of SSE/SST. The obtained average ranks are summarized in Table 4. It can be observed that on an average, the proposed LDMR models obtain better ranks than existing regression methods.
For all SVR models, we have tuned their parameters to obtain their best choice in each datasets. We list the tuned parameter values of SVR models for UCI datasets in Table 5. Figure 6 shows the effect of the parameters C and c on SSE/SST values in LDMR model for Pyrims and Auto MPG datasets. It shows that the performance of the LDMR model is sensitive to the choice of parameters c and C.
6.5 Experiment 4 (large-scale datasets)
We have also compared the proposed model with SVR model on large--scale datasets. Since the ‘quadprog’ implementation of the proposed LDMR model and SVR model is not efficient for the large-scale datasets, we have used the SOR method in the proposed LDMR model for these datasets. For the SVR model, we have used its SMO implementation. We have downloaded Parkinsons Telemonitoring (5847 \(\times \) 22), Wine Quality Red (1599 \(\times \) 12) and Wine Quality White (4898 \(\times \) 12) datasets from the UCI Repository [28]. For all these datasets, we have normalized their input feature vectors in the range of [0,1]. For each dataset, we have fixed the number of training points and testing points and randomly permuted the data points in training set and testing set in ten different trials. The regression methods have been evaluated for each trial.
Table 6 shows the comparison of the SMO SVR and proposed SOR LDMR on large-scale datasets using different evaluation criteria. It can be observed that the proposed LDMR model with its SOR implementation outperforms the existing SVR model with SMO implementation.
6.6 Experiment 5 (time-series financial dataset)
For further evaluation of the proposed methods, we have simulated the existing as well as proposed regression methods on real-world time-series financial datasets. We have downloaded the daily stock prices of the IBM and SBI.INS for the period of 27 March 2016 to 27 March 2017 and 27 March 2015 to 27 March 2017, respectively, from the Yahoo financial website. The datasets were generated by taking the last three days closing stock prices as input feature and next day closing stock price as response value. In experiments, the first \(50\%\) data points of the datasets were used for training and rest of them have been used for the testing. The input features were normalized in the range of [\(-1,1\)].
Table 7 lists the comparison of the proposed methods with other existing regression methods using different evaluation criteria on IBM and SBI.INS finance datasets along with their training time. Figures 4 and 5 show the performance of the proposed regression methods along with existing regression methods on IBM and SBI.INS finance dataset, respectively. It can be easily observed in Table 5 and Figures 5 and 6 that the proposed LDMR formulations outperform the existing regression methods.
7 Conclusion
An efficient LDM-based regression model has been proposed in this paper which can be mathematically derived from the optimization problem of the LDM (Zhang and Zhou [17]) by making use of a result of Bi and Bennett [20]. The proposed LDMR model simultaneously minimizes the \(\epsilon \)-insensitive loss function as well as the quadratic loss function. In this sense, the proposed LDMR formulation combines the benefits of the LS-SVR model (Suykens et al. [21, 22]) as well as the \(\epsilon \)-SVR model. The proposed LDMR model obtains better generalization ability by finding a trade-off between the \(\epsilon \)-insensitive loss and the quadratic loss via the user-defined parameters k and C. The proposed LDMR model has also been observed to be less sensitive to the presence of outliers. Further, the application of the SOR technique (Mangaserian and Musicant [25]) significantly reduces the training time of the proposed LDMR model.
One of the major difficulties with the proposed LDMR model is that it requires more parameters to be tuned. This requires future study so as to develop a model which can autolearn parameters \(\epsilon \), c and C from the data. Further, we have observed that the SOR technique works well in the proposed LDMR formulation but, as compared to SMO method in SVR model, it takes more time to train the model. So we would like to test the SMO method for solving the optimization problem in the proposed LDMR model in future.
References
Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20(3):273–297
Burges JC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167
Cherkassky V, Mulier F (2007) Learning from data: concepts, theory and methods. Wiley, New York
Vapnik V (1998) Statistical learning theory, vol 1. Wiley, New York
Osuna E, Freund R, Girosit F (1997) Training support vector machines: an application to face detection. In: Proceedings of IEEE computer vision and pattern recognition, San Juan, Puerto Rico, pp 130–136
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features, European conference on machine learning. Springer, Berlin
Schlkopf B, Tsuda K, Vert JP (2004) Kernel methods in computational biology. MIT Press, Cambridge
Lal TN, Schroder M, Hinterberger T, Weston J, Bogdan M, Birbaumer N, Scholkopf B (2004) Support vector channel selection in BCI. IEEE Trans Biomed Eng 51(6):10031010
Bradley P, Mangasarian OL (2000) Massive data discrimination via linear support vector machines. Optim Methods Softw 13(1):1–10
Freund Y, Schapire RE (1995) A decision-theoretic generalization of on-line learning and an application to boosting. In: Proceedings of the 2nd European conference on computational learning theory, Barcelona, Spain, p. 2337
Zhou ZH (2012) Ensemble methods: foundations and algorithms. CRC Press, Boca Raton
Breiman L (1999) Prediction games and arcing classifiers. Neural Comput 11(7):14931517
Schapire RE, Freund Y, Bartlett PL, Lee WS (1998) Boosting the margin: a new explanation for the effectives of voting methods. Annu Stat 26(5):16511686
Reyzin L, Schapire RE (2006) How boosting the margin can also boost classifier complexity. In: Proceedings of 23rd international conference on machine learning, Pittsburgh, PA, p 753–760
Wang L, Sugiyama M, Yang C, Zhou ZH, Feng J (2008) On the margin explanation of boosting algorithm. In: Proceedings of the 21st annual conference on learning theory, Helsinki, Finland, p 479–490
Gao W, Zhou ZH (2013) On the doubt about margin explanation of boosting. Artif Intell 199–200:2244
Zhang T, Zhou ZH (2014) Large margin distribution machine. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM
Vapnik V, Golowich SE, Smola AJ (1997) Support vector method for function approximation, regression estimation and signal processing. In: Mozer M, Jordan M, Petsche T (eds) Advances in neural information processing systems. MIT Press, Cambridge, pp 281–287
Drucker H, Burges CJ, Kaufman L, Smola AJ, Vapnik V (1997) Support vector regression machines. In: Mozer MC, Jordan MI, Petsche T (eds) Advances in neural information processing systems. MIT Press, Cambridge, pp 155–161
Bi J, Bennett KP (2003) A geometric approach to support vector regression. Neurocomputing 55(1):79–108
Suykens JAK, Lukas L, van Dooren P, De Moor B, Vandewalle J (1999) Least squares support vector machine classifiers: a large scale algorithm. In: Proceedings of European conference of circuit theory design, pp 839–842
Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293300
Shao YH, Zhang C, Yang Z, Deng N (2013) An \(\epsilon \)-twin support vector machine for regression. Neural Comput Appl 23(1):175–185
Tanveer M, Mangal M, Ahmad I, Shao YH (2016) One norm linear programming support vector regression. Neurocomputing 173:1508–1518
Mangasarian OL, Musicant DR (1999) Successive overrelaxation for support vector machines. IEEE Trans Neural Netw 10(5):1032–1037
Luo ZQ, Tseng P (1993) Error bounds and convergence analysis of feasible descent methods: a general approach. Ann Oper Res 46(1):157–178
Chang CC, Lin CJ (2011) LIBSVM, a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Blake CI, Merz CJ (1998) UCI repository for machine learning databases [http://www.ics.uci.edu/*mlearn/MLRepository.html]
Huang X, Shi L, Suykens JA (2014) Support vector machine classifier with pinball loss. IEEE Trans Pattern Anal Mach Intell 36(5):984–997
Hsu CW, Lin CJ (2002) A comparison of methods for multi class support vector machines. IEEE Trans Neural Netw 13:415–425
Duda RO, Hart PR, Stork DG (2001) Pattern classification, 2nd edn. Wiley, Hoboken
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai, Vol. 14, No. 2
Acknowledgements
We would like to thank the learned referees for their valuable comments and suggestions which has substantially improved the contents and presentation of the manuscript. We would also like to acknowledge Ministry of Electronics and Information Technology, Government of India, as this work has been funded by them under Visvesvaraya Ph.D. Scheme for Electronics and IT, Order No. Phd-MLA/4(42)/2015-16.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
Rights and permissions
About this article
Cite this article
Rastogi, R., Anand, P. & Chandra, S. Large-margin Distribution Machine-based regression. Neural Comput & Applic 32, 3633–3648 (2020). https://doi.org/10.1007/s00521-018-3921-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-018-3921-3