1 Introduction

Support vector machine (SVM) is one of the most popular machine learning algorithms (Cortes and Vapnik [1]; Burges [2]; Cherkassky and Mulier [3]; Vapnik [4]). SVM has emerged from the research in statistical learning theory on how to regulate the trade-off between the structural complexity and empirical risk. It has outperformed the other existing tools in a wide variety of applications. Some of these applications can be found in Osuna et al. [5], Joachims [6], Schlkopf et al. [7] and Lal et al. [8]. SVM has been initially developed for the problem of pattern classification, but it has been extended to solve the problems of regression and clustering as well.

SVM classifier attempts to reduce the generalization error by maximizing the minimum margin, i.e. the minimum distance of the training points from the classification boundary (Cortes and Vapnik [1]; Burges [2]; Cherkassky and Mulier [3]; Vapnik [4]; Bradely and Mangasarian [9]). The maximization of margin in SVM classification problem amounts to the minimization of an upper bound on the VC dimension (Burges [2]; Vapnik [4]) of the classifying hyperplane. The maximum margin theory is not only relevant in the case of the SVM, but it has also been extended to interpret the good generalization ability of many other learning approaches such as AdaBoost (Freund and Schapire [10]) which is a major representative of ensemble methods (Zhou [11]). The theoretical studies which advocate the relevance of the maximum margin principle in these learning approaches can be found in Breiman [12] and Schapire et al. [13]. Moreover, some of the recent theoretical results (Reyzin and Schapire [14]; Wang et al. [15]; Gao and Zhou [16]) suggest that rather than simply considering a single-point margin, the margin distribution is important in these algorithms. Taking motivation from these studies, Teng and Zhou [17] have proposed Large-margin Distribution Machine (LDM) which tries to achieve better generalization performance by considering the margin mean as well as margin variance in the objective function of its optimization problem.

The support vector methodology has also been extended for handling the problem of regression (Vapnik et al. [18]; Drucker et al. [19]) The standard \(\epsilon \)-SVR is an \(\epsilon \)-insensitive model which sets an epsilon tube around the data points. The data points outside the epsilon tube contribute to the errors which are penalized in the objective function via a user-specified parameter. Bi and Bennett [20] have developed a geometric framework for SVR, showing that it can be related to an appropriate SVM problem. This result of Bi and Bennett [20] is very significant as it provides a classification eye to see the problem of regression.

This paper proposes an efficient Large-margin Distribution Machine-based Regression (LDMR) model which is similar in principle to LDM model for classification (Zhang and Zhou [17]). The proposed LDMR model is a more general regression model as the standard \(\epsilon \)-SVR and LS-SVR models are special cases of LDMR model with particular choice of parameters. The optimization problem of LDMR model has been derived from the LDM model for classification using a well-known result of Bi and Bennett [20]. The resulting optimization problem simultaneously minimizes the quadratic loss function used in LS-SVR (Suykens et al. [21, 22]) and \(\epsilon \)-insensitive loss function used in \(\epsilon \)-SVR. It is noteworthy that the LS-SVR model fails to predict well on noisy datasets, whereas \(\epsilon \)- SVR model totally ignores all data points which lie inside of \(\epsilon \)-tube for the determination of the regressor. This strategy makes \(\epsilon \)-SVR model sparse but does not minimize the scatter inside of the \(\epsilon \)- tube. Our proposed LDMR model aims to take advantage of both of these models. Further, an effective successive over-relaxation (SOR) (Mangasarian and Musicant [25]) technique has also been applied for the efficient solution of the LDMR problem to reduce its training time complexity. Experimental results on the artificial datasets, UCI benchmark datasets (Blake and Merz [28]) and time-series datasets show that the LDMR model owns better generalization ability than the existing SVR models.

Taking forward the arguments of Huang et al. [29] to the regression analogue, it makes sense that apart form the sparsity, we should also minimize the scatter of data points which lie inside the \(\epsilon \)-tube. Therefore, it is meaningful to have both of these loss functions in the proposed formulation so as to do a trade-off between sparsity and scatter minimization. It also enables the proposed model to utilize the full information of the training set and avoid over-fitting simultaneously.

We now briefly describe notations used in the rest of this paper. All vectors will be taken as column vectors, unless it has been specified otherwise. For any vector \(x \in R^{n}\), ||x|| will denote the \(l_2\) norm. A vector of ones of arbitrary dimension will be denoted by e. Let (AY) denote the training set where \(A = [A_1;A_2;\ldots ;\,A_l]\) contains the l points in \({\mathbb {R}}^n\) represented by l rows of the matrix A and \(Y = [y_1,y_2,\ldots ,y_l]\in {\mathbb {R}}^{l\times 1}\) contains the corresponding response value of the row of matrix A.

The rest of this paper is organized as follows. Section 2 discusses LDM formulation proposed by Zhang and Zhou [17]. Section 3 briefly describes SVR models. Section 4 proposes the Linear Distribution Machine-based Regression (LDMR) and its extension for the nonlinear case. Section 5 contains the mathematical derivation of the optimization problem of the LDMR from the LDM formulation (Zhang and Zhou [17]) using a result of Bi and Bennett [20]. Section 6 describes the experimental results, while Sect. 7 is devoted to the conclusions.

2 Large-margin distribution machine

Inspired by the theoretical result of Gao and Zhou [16], the LDM model (Zhang and Zhou [17]) attempts to maximize the margin mean and minimize the margin variance simultaneously for optimizing the margin distribution. The margin mean \(\bar{\gamma }\) and margin variance \(\hat{\gamma }\) are given by

$$\begin{aligned}&\bar{\gamma } = \frac{1}{l}\sum _{i=1}^{l} d_i\left( w^\mathrm{T}x_i\right) = \frac{1}{l}d^\mathrm{T}Xw, \end{aligned}$$
(1)
$$\begin{aligned}&\hat{\gamma } = \frac{1}{l} \sum _{i=1}^{l} \left( d_i(w^\mathrm{T}x_i)-\bar{\gamma }\right) ^2. \end{aligned}$$
(2)

Here the matrix \(X = [x_1, x_2\ldots , x_l]\) contains the given l data points in \({\mathbb {R}}^n,\) and vector \(d \in {\mathbb {R}}^{l\times 1}\) of \(\pm 1\) represents the corresponding class.

The linear LDM model finds the separating hyperplane \(w^\mathrm{T}x = 0 \) by solving the following optimization problem

$$\begin{aligned}&\min _{w, \xi }\, \frac{1}{2} w^\mathrm{T}w + \lambda _1\hat{\gamma } - \lambda _2\bar{\gamma } + C\sum _{i=1}^{l}\xi _i \nonumber \\&{\text{ subject } \text{ to, }}\nonumber \\&\qquad y_i(w^\mathrm{T}x_i) \ge 1-\xi _i,~ \xi _i \ge 0~,i = 1,2,\ldots ,l, \end{aligned}$$
(3)

\(\lambda _1\) and \(\lambda _2\) are positive parameters for trading off the margin mean and margin variance. It is also notable that the LDM formulation reduces to the SVM when \(\lambda _1\) and \(\lambda _2\) = 0 in (3).

3 Support vector regression

3.1 \(\epsilon \)-Support Vector Regression

Linear \(\epsilon \)-SVR model finds a linear function \(f(x) = w^\mathrm{T}x + b\), where \(w \in {\mathbb {R}}^n\) and \(b \in {\mathbb {R}}\). To measure the empirical risk, it uses the \(\epsilon \)-insensitive loss function

$$\begin{aligned} R^{\epsilon }_\mathrm{emp} = \frac{1}{l}\sum _{i=1}^{l}|y_i-f(x_i)|_\epsilon , \end{aligned}$$
(4)

where \(|y_i-f(x_i)|_\epsilon =\) max\((0,|y_i-f(x_i)|-\epsilon )\). The \(\epsilon \)-SVR model minimizes the \(\epsilon \)-insensitive loss function with a regularization term \(\frac{1}{2}||w|| ^{2}\) in its optimization problem which is given as follows

$$\begin{aligned}&\min _{_{w,b,\xi ,\xi ^{*}}} ~ ~ \frac{1}{2}||w|| ^{2} + C \sum \limits _{i=1}^l (\xi _i + \xi _i^{*})\nonumber \\&{\hbox {subject to,}} \nonumber \\&\quad y_i - (A_iw +b) \le \epsilon + \xi _i, ( i=1,2,\ldots ,l), \nonumber \\&\quad (A_iw +b) -y_i \le \epsilon +\xi _i^{*}, (i=1,2,\ldots ,l ),\nonumber \\&\quad \xi _{i}\ge 0, ~ \xi _i^{*}\ge 0 ~~, (i=1,2,\ldots ,l). \end{aligned}$$
(5)

Here \(C >0\) is the user-specified parameter that balances the trade-off between the fitting error and the flatness of the function \(f(x) =w^\mathrm{T}x+b\).

3.2 Least-squares support vector regression model

Similar to \(\epsilon \)-SVR model, least-squares support vector regression (LS-SVR) model (Suykens et al. [21, 22]) also finds a linear function \(f(x) = w^\mathrm{T}x + b\), where \(w \in {\mathbb {R}}^n\) and \(b \in {\mathbb {R}}\). The LS-SVR model minimizes the quadratic loss function

$$\begin{aligned} R_\mathrm{emp} = \frac{1}{l}\sum _{i=1}^{l}(y_i-f(x_i))^2, \end{aligned}$$
(6)

in its optimization problem along with the regularization term \(\frac{1}{2}||w||^2\). The optimization problem of the LS-SVR model can be expressed as

$$\begin{aligned}&\min _{w,b,\xi } ~ ~ \frac{c}{2}||w||^{2} + C_1 \sum \limits _{i=1}^l(\xi _i^2)\nonumber \\&{\hbox {subject to,}} \nonumber \\&\qquad y_i - (A_iw +b) = \xi _i, ( i=1,2,\ldots ,l), \end{aligned}$$
(7)

where \(C_1 > 0\) is a user-defined parameter.

4 Large-margin distribution machine-based regression

In this section, we propose an efficient Large-margin Distribution Machine-Based Regression (LDMR) model.

4.1 Linear LDMR model

Given the training set (AY), the LDMR model finds a linear function \(f(x) = w^\mathrm{T}x+ b, \) where \(w \in R^{n}\) and \(b \in R\). The proposed LDMR model minimizes the following generalized loss function

$$\begin{aligned} R_\mathrm{emp}^f = \frac{k}{2}\sum _{i=1}^{l}(y_i -f(x_i))^2+ C\cdot \sum _{i=1}^{l}|y_i-f(x_i)|_\epsilon . \end{aligned}$$
(8)

Along with the regularization term, here \(k>0\) and \(C>0\). By introducing the regularization term \(\frac{1}{2}||w||^2\) and slack variables \(\xi _1\) and \(\xi _2\), the primal form of the LDMR can be expressed as

$$\begin{aligned}&\min _{w,b,\xi _1,\xi _2} \frac{c}{2}||w||^2 + \frac{k}{2}||Y - (Aw + eb)||^2 + {C}e^\mathrm{T}(\xi _1 + \xi _2)\nonumber \\&{\text{ subject } \text{ to, }} \nonumber \\&\qquad Y - (Aw + eb) \le e\epsilon + \xi _1, \nonumber \\&\qquad ( Aw + eb)-Y \le e\epsilon + \xi _2, \nonumber \\ &\xi _1~,\xi _2 \ge 0, \end{aligned}$$
(9)

where C, k, \(\epsilon \) and c are user-defined positive parameters.

The proposed LDMR model minimizes three terms in its optimization problem. The first term \(\frac{1}{2} w^\mathrm{T}w\) attempts to make the regressor as flat as possible. The second term \(\sum _{i=1}^{l}(y_i- f(x_i))^2\) attempts to minimize the scatter of the data points, while the third term \(\sum _{i=1}^{l}|y_i- f(x_i)|_\epsilon \) tries to have better sparsity. These three terms in the objective functions are traded off appropriately to make full use of the training set and avoid over-fitting of the data points simultaneously.

The \(\epsilon \)-SVR model only minimizes the \(\epsilon \)-insensitive loss function which ignores the error up to \(\epsilon \). The points lying on the bounding regressor \(f(x)+\epsilon \) and \(f(x)-\epsilon \) and outside the \(\epsilon \)-insensitive zone are support vectors and participate in the construction of the final regressor. The data points which lie inside the \(\epsilon \)-insensitive zone have been ignored to achieve the sparsity, but it also causes \(\epsilon \)-SVR to lose the information contained in the training set. The outliers which are supposed to reside outside the \(\epsilon \)-insensitive zone also affect the orientation and position of the regressor. On the other hand, the LS-SVR model minimizes the quadratic loss function where all of the data points are participating in the construction of the final regressor but cannot avoid over-fitting of the regressor. That is why, the LS-SVR model fails to perform well on datasets which contain much noise. The proposed LDMR model obtains better generalization ability by finding a good trade-off between the \(\epsilon \)-insensitive loss function and the quadratic loss function via the user-defined parameters ck and C. Since the proposed LDMR model also assigns some weights to the points which are inside of the \(\epsilon \)-insensitive zone in the construction of the final regressor, it is less sensitive to the presence of outliers and therefore can avoid the over-fitting as well. In this sense, the proposed LDMR model combines the benefits of both the SVR and LS-SVR models.

The proposed LDMR model is in the true spirit of the LDM classification model. The optimization problem of the LDMR model has been mathematically derived by the optimization problem of the LDM model using a result of the Bi and Bennett [20]. Unlike SVM, which minimizes the single-point margin only, the LDM model also minimizes the margin mean and margin variance together. It makes LDM formulation insensitive towards noises. These advantages of the LDM model have also been inherited in the LDMR model.

It is also noteworthy that the proposed LDMR model is a very general model. When \(k=0,\) the primal problem (9) of proposed LDMR model reduces to primal problem (5) of \(\epsilon \)-SVR. Also, when C becomes zero in the primal problem (9) of the proposed LDMR formulation, the variables \(\xi _1~,\xi _2 \ge 0\) are no more minimized in (9) and hence can take any values. Therefore, the constraints of the optimization problem (9) do not make any sense as these are always satisfied for any value of (w,b). Thus, these constraints become redundant. So with \(C=0\), the proposed LDMR model only minimizes \(\frac{c}{2}||w||^2 + \frac{k}{2}||Y - (Aw + eb)||^2\) in its optimization problem (9) which is equivalent to solving the optimization problem (7) of the LS-SVR model with \(C_1 = \frac{k}{2}\).

In order to find the solution of the primal problem (9), we need to derive its corresponding dual problem. Let us assume \(H = [A, e]\) is a augmented matrix and \(v = \begin{bmatrix} w \\ b \end{bmatrix},\) then \(||w||^2\) can be written as \(||w||^2 = v^\mathrm{T}I_0v\), where \(I_0 = \begin{bmatrix} I&0\\&. \\&. \\ 0&\ldots 0 \end{bmatrix}\) and I is \(n\times n\) identity matrix. Now the Lagrangian function for the primal problem (9) can be given by

$$\begin{aligned}&L(v,\alpha _1,\alpha _2,\beta _1,\beta _2) = \frac{c}{2}vI_0v + \frac{k}{2}(Y-Hv)^\mathrm{T}(Y-Hv) \\&\quad + Ce^\mathrm{T}(\xi _1 + \xi _2) + \alpha _1^\mathrm{T}(Y -Hv - e\epsilon - \xi _1) \\&\quad + ~\alpha _2^\mathrm{T}(Hv-Y - e\epsilon - \xi _2) -\beta _1^\mathrm{T}\xi _1-\beta _2^\mathrm{T}\xi _2, \end{aligned}$$

where \(\alpha _1 = {(\alpha _1^1,\alpha _1^2,\ldots ,\alpha _1^l)},~\alpha _2 = {(\alpha _2^1,\alpha _2^2,\ldots ,\alpha _2^l)},~\beta _1\) and \(\beta _2\) are the vector of Lagrangian multipliers. The KKT optimality conditions are given by

$$\begin{aligned}&\frac{\partial L}{\partial v} = (cI_0 + kH^\mathrm{T}H)v-H^\mathrm{T}Y- H^\mathrm{T}\alpha _1+ H^\mathrm{T}\alpha _2 = 0, \end{aligned}$$
(10)
$$\begin{aligned}&\frac{\partial L}{\partial \xi _1}=Ce -\alpha _1 -\beta _1 =0, \end{aligned}$$
(11)
$$\begin{aligned}&\frac{\partial L}{\partial \xi _2}= Ce -\alpha _2 -\beta _2 =0, \end{aligned}$$
(12)
$$\begin{aligned}&Y-Hv \le e\epsilon + \xi _1 ~, \xi _1 \ge 0, \end{aligned}$$
(13)
$$\begin{aligned}&Hv-Y \le e\epsilon + \xi _2 ~, \xi _2 \ge 0, \end{aligned}$$
(14)
$$\begin{aligned}&\alpha _1^\mathrm{T}(Y-Hv-e\epsilon _1-\xi _1)=0,\ \end{aligned}$$
(15)
$$\begin{aligned}&\alpha _2^\mathrm{T}(Hv-Y-e\epsilon _1-\xi _2)=0,\ \end{aligned}$$
(16)
$$\begin{aligned}&\beta ^\mathrm{T}_1\xi _1 = 0, \beta ^\mathrm{T}_2\xi _2 = 0, \end{aligned}$$
(17)
$$\begin{aligned}&\alpha _1 \ge 0, \alpha _2 \ge 0, \beta _1 \ge 0, \beta _2 \ge 0 . \end{aligned}$$
(18)

Using the above KKT conditions, the dual problem of the primal problem (9) can be obtained as

$$\begin{aligned}&\min _{\alpha _1,\alpha _2} \frac{1}{2}(\alpha _1- \alpha _2)^\mathrm{T}H(cI_0+kH^\mathrm{T}H)^{-1}H^\mathrm{T}(\alpha _1 - \alpha _2)+Y^\mathrm{T}H(cI_0+kH^\mathrm{T}H)^{-1}H^\mathrm{T}(\alpha _1-\alpha _2) \nonumber \\&\quad - Y^\mathrm{T}(\alpha _1-\alpha _2) + \epsilon e^\mathrm{T}(\alpha _1+\alpha _2) \nonumber \\&{\text{ subject } \text{ to, }}\nonumber \\&\quad 0 \le \alpha _1 \le Ce, \nonumber \\&\quad 0 \le \alpha _2 \le Ce. \end{aligned}$$
(19)

After obtaining the optimal value of the \(\alpha _1\) and \(\alpha _2\) from (19), we can obtain v using (10) as follows \(v = \begin{bmatrix} w \\ b \end{bmatrix} = (cI_0+kH^\mathrm{T}H) ^{-1}H^\mathrm{T}(\alpha _1-\alpha _2+Y).\)

For the given \(x \in {R}^n\), the estimated regressor is obtained as follows

$$\begin{aligned} f(x) = w^\mathrm{T}x+ b \end{aligned}$$

4.2 Nonlinear LDMR model

The nonlinear LDMR model will seek to estimate the function \( f(x) = K(x^\mathrm{T},A^\mathrm{T})u +b\), where K is an appropriately chosen positive definite kernel.

The nonlinear LDMR model solves the following optimization problem

$$\begin{aligned}&\min _{u,b} \frac{c}{2}||u||^2 + \frac{k}{2}||Y - (K(A,A^\mathrm{T})u + eb)||^2 + {C}e^\mathrm{T}(\xi _1 + \xi _2)\nonumber \\&{\text{ subject } \text{ to, }} \nonumber \\&\quad Y - (K(A,A^\mathrm{T})u + eb) \le e\epsilon + \xi _1, \nonumber \\&\quad ( K(A,A^\mathrm{T})u + eb)-Y \le e\epsilon + \xi _2, \nonumber \\&\quad \xi _1~,\xi _2 \ge 0, \end{aligned}$$
(20)

where C, k, \(\epsilon \) and \(c_3\) are user-supplied positive parameters. Let us assume \(G = [K(A,A^\mathrm{T}), e]\) be a augmented matrix and \(v = \begin{bmatrix} u \\ b \end{bmatrix},\) then \(||u||^2\) can be written as \(||u||^2 = v^\mathrm{T}I_0v\), where \(I_0 = \begin{bmatrix} I&0\\&. \\&. \\ 0&\ldots 0 \end{bmatrix}\) and I is \(m\times m\) identity matrix. Similar to the line of the linear case, the dual problem of the primal problem (20) can be obtained as follows

$$\begin{aligned}&\min _{\alpha _1,\alpha _2} \frac{1}{2}(\alpha _1- \alpha _2)^\mathrm{T}G(cI_0+kG^\mathrm{T}G)^{-1}G^\mathrm{T}(\alpha _1 - \alpha _2) + Y^\mathrm{T}G(cI_0+kG^\mathrm{T}G)^{-1}G^\mathrm{T}(\alpha _1-\alpha _2) \nonumber \\&\quad - Y^\mathrm{T}(\alpha _1-\alpha _2) + \epsilon e^\mathrm{T}(\alpha _1+\alpha 2) \nonumber \\&{\text{ subject } \text{ to, }}\nonumber \\&\quad 0 \le \alpha _1 \le Ce, \nonumber \\&\quad 0 \le \alpha _2 \le Ce. \end{aligned}$$
(21)

After obtaining the optimal value of the \(\alpha _1\) and \(\alpha _2\) from (21), we can obtain v as \(v = \begin{bmatrix} u \\ b \end{bmatrix} = (cI_0+kG^\mathrm{T}G) ^{-1}G^\mathrm{T}(\alpha _1-\alpha _2+Y).\) For the given \(x \in {R}^n\), the estimated regressor is obtained as follows

$$\begin{aligned} f(x) = K(x^\mathrm{T},A^\mathrm{T})u+ b. \end{aligned}$$

4.3 A fast LDMR model using successive over-relaxation technique

The dual problem of the proposed LDMR model (19) or (21) can be written in the following unified form

$$\begin{aligned}&\max _{\beta } ~~ p\beta - \beta ^\mathrm{T}Q\beta \nonumber \\&{\text{ subject } \text{ to, }}\nonumber \\&\quad \beta \in S = \{0\le \beta \le Ce\}. \end{aligned}$$
(22)

For example, the problem (22) becomes the problem (19) when \(\beta = \begin{bmatrix} \alpha _1 \\\alpha _2 \end{bmatrix}\)\(Q= \begin{bmatrix} H(cI_0+kH^\mathrm{T}H)^{-1}H^\mathrm{T}&-H(cI_0+kH^\mathrm{T}H)^{-1}H^\mathrm{T} \\ -H(cI_0+kH^\mathrm{T}H)^{-1}H^\mathrm{T}&H(cI_0+kH^\mathrm{T}H)^{-1}H^\mathrm{T} \end{bmatrix}\) and \(p = -\begin{bmatrix} Y^\mathrm{T}H(cI_0+kH^\mathrm{T}H)H^\mathrm{T}-Y^\mathrm{T}+e^\mathrm{T}\epsilon,& -Y^\mathrm{T}H(cI_0+kH^\mathrm{T}H)H^\mathrm{T}+Y^\mathrm{T}+e^\mathrm{T}\epsilon \end{bmatrix}\)

Now problem (22) can be efficiently solved by following the successive over-relaxation (SOR) technique (Mangasarian OL and Musicant DR [25]) as follows.

Algorithm 1

We choose \(t \in (0,2)\) and start with any initial value of \(\beta \) say \(\beta ^0 \in R^{n}\). After computing \(\beta _i,\) we compute

$$\begin{aligned} \beta _{i+1} = (\beta _i-tE^{-1}(Q\beta _i-p+L(\beta _{i+1}-\beta _{i}))), \end{aligned}$$
(23)

until \(||\beta _{i+1}-\beta _{i}||\) is not less than some prescribed tolerance. Here nonzero elements of \(L \in R^{l \times l}\) constitute the strictly lower triangular part of the symmetric matrix Q, and the nonzero elements of \(E\in R^{l \times l}\) constitute the diagonal of Q (Shao et al. [23]).

It should be noted that it is well justified in [25] and [26] that the iterates \(\{\beta _i\}\) converges R-linearly to the optimal solution \(\bar{\beta }\) of the problem (22).

5 LDMR: regression via LDM

In this section, we derive the optimization problem of the proposed LDMR model from the optimization problem of the LDM model by making use of a result of the Bi and Bennett [20]. It has been shown in [20] that for a given \(\bar{\epsilon } > 0\) and regression training set (AY), a regressor \( y=\frac{w}{-\eta }^\mathrm{T}x+\frac{b}{-\eta }\), \((\eta > 0)\) is an \(\epsilon \)-insensitive regressor if and only if the sets \(D^+\) and \(D^-\) locate on different sides of \(n+1\)-dimensional hyperplane \(w^\mathrm{T}x+ \eta y+b=0,\) respectively, where

$$\begin{aligned} &D^+ = \{(A_i,y_i+\bar{\epsilon }),i=1,2,\ldots ,l\}\\ &{\text{ and }} D^- = \{(A_i,y_i-\bar{\epsilon }),i=1,2,\ldots ,l\}. \end{aligned}$$

In view of this result of Bi and Bennett [20], the regression problem is equivalent to the classification problem of sets \(D^+\) and \(D^-\) in \(R^{n+1}\). If we use the LDM methodology (Zhang and Zhou [17]) for the classification of these two sets \(D^+\) and \(D^-,\) then we can find the LDMR formulation. For this we calculate the margin mean \(\bar{\gamma }\) and margin variance \(\hat{\gamma }\) as follows

$$\begin{aligned}&\bar{\gamma } = \frac{1}{2l}\left\{ {-\sum _{i=1}^{l}\left( A_iw + \eta (y_i-\bar{\epsilon )}+ b\right) \quad+ \sum _{i=1}^{l}\left( A_iw + \eta (y_i+\bar{\bar{\epsilon }}) + b\right) }\right\} = {\eta }\bar{\epsilon }, \end{aligned}$$
(24)
$$\begin{aligned}&\hat{\gamma } = \frac{1}{2l}\left\{ \sum _{i=1}^{l} {(-(A_iw + \eta (y_i-\bar{\bar{\epsilon }}) + b) -\eta \bar{\bar{\epsilon }})}^2 \quad+ \sum _{i=1}^{l} {((A_iw + \eta (y _i+\bar{\epsilon })+ b) -\eta \bar{\epsilon })}^2 \right\} \nonumber \\&\quad = \frac{1}{2l}\left\{ \sum _{i=1}^{l}(-A_iw -\eta y_i -b)^2 + \sum _{i=1}^{l}(A_iw +\eta y_i + b)^2 \right\} \nonumber \\&\quad = \frac{1}{l}\left\{ \sum _{i=1}^{l}(A_iw +\eta y_i + b)^2\right\} , \end{aligned}$$
(25)

where \(w \in R^n\), \( \eta \in R\) and \(b \in R\). Now the classification of sets \(D^+\) and \(D^-\) using LDM model will result in the following QPP

$$\begin{aligned}&\min _{w,\eta ,b,\xi ,\xi ^*} ~~ \frac{1}{2} (w^\mathrm{T}w + \eta ^2) - \lambda _1 {\eta }\bar{\epsilon } + \lambda _2\frac{1}{l}\left\{ \sum _{i=1}^{l}(A_iw +\eta y_i + b)^2 \right\} + C\left( \sum _{i=1}^{l}\xi _i + \sum _{i=1}^{l}\xi ^*_i\right) \nonumber \\&{ \text{ subject } \text{ to, }}\nonumber \\&\quad A_iw + \eta (y_i+\bar{\epsilon }) + b \ge 1 -\xi _i \quad{\hbox{for}}\quad i = 1,2,\ldots l, \nonumber \\&\quad -(A_iw + \eta (y_i-\bar{\epsilon }) + b) \ge 1 -\xi ^*_i \quad{\hbox{for}}\, i = 1,2,\ldots l, \nonumber \\&\quad \xi _i,\xi ^*_i \ge 0\quad {\hbox{for}}\quad i =1,2,\ldots ,l. \end{aligned}$$
(26)

Here we note that \(\eta \ne 0\) and therefore, without loss of generality, we can assume that \(\eta >0\). The constraint of (26) can then be rewritten as

$$\begin{aligned}&A_i\left( \frac{w}{-\eta }\right) - \left( y_i+\bar{\epsilon }\right) + \left( \frac{b}{-\eta }\right) \le \frac{1}{-\eta } -\left( \frac{\xi _i}{-\eta }\right),\quad i=1,2,\ldots ,l\\&\quad -A_i\left( \frac{w}{-\eta } \right) + \left( y_i-\bar{\epsilon }\right) - \left( \frac{b}{-\eta }\right) \le \frac{1}{-\eta }- \left( \frac{\xi _i^*}{-\eta }\right) ,\quad i=1,2,\ldots ,l\\&\quad \frac{-\xi _i}{\eta } \le 0, \frac{-\xi ^*_i}{\eta } \le 0,\quad i=1,2,\ldots ,l \end{aligned}$$

Further, the objective function of (26) can also be written as

$$\begin{aligned}&\min _{w,\eta ,b,\xi ,\xi ^*} {\eta ^2} \left[ ~~\frac{1}{2} \left( \left( \frac{w}{-\eta }\right) ^\mathrm{T}\left( \frac{w}{-\eta }\right) + 1 \right) - \frac{\lambda _1}{\eta }\bar{\epsilon } + \lambda _2\frac{1}{l}\left\{ \sum _{i=1}^{l}\left( A_i\left( \frac{w}{-\eta }\right) - y_i + \left( \frac{b}{-\eta }\right) \right) \right. \right. \\&\quad \left. \left. \left( A_i\left( \frac{w}{-\eta }\right) - y_i + \left( \frac{b}{-\eta }\right) \right) \right\} \right. \\&\quad \left. + \frac{C}{\eta }\left( \sum _{i=1}^{l}\frac{\xi _i}{\eta } + \sum _{i=1}^{l}\frac{\xi ^*_i}{\eta }\right) ~~ \right] \end{aligned}$$

On replacing \(\epsilon := \bar{\epsilon }-\frac{1}{\eta }\), \(w:= (\frac{w}{-\eta })\), \(b:=(\frac{b}{-\eta })\), \(\xi _i: = \frac{\xi _i}{\eta }\), \(\xi _i^*: = \frac{\xi ^*_i}{\eta }, C:=\frac{C}{\eta }\) and \(\lambda := \frac{\lambda }{\eta }\) and noting that \(\eta \ge 0,\) the objective function of the optimization problem (26) can be rewritten as

$$\begin{aligned} \min _{w,b,\xi ,\xi ^*}~~ \eta ^2 \left[ \frac{1}{2} w^\mathrm{T}w - \lambda _1\epsilon + \lambda _2\frac{1}{l}\left\{ \sum _{i=1}^{l}( y_i-(A_iw + b))^2 \right\} + C\left( \sum _{i=1}^{l}\xi _i + \sum _{i=1}^{l}\xi ^*_i\right) + \frac{1}{2} \right] . \end{aligned}$$

Also the constraints of the optimization problem (26) can be expressed as

$$\begin{aligned} &(A_iw + b) - y_i \le \epsilon + \xi _i, \qquad \mathrm{for} \,i = 1,2,\ldots l, \nonumber \\ &y_i - (A_iw + b) \le \epsilon + \xi ^*_i, \qquad \mathrm{for}\, i = 1,2,\ldots l, \nonumber \\ \xi _i,\xi ^*_i \ge 0 \qquad {\hbox{for}}\, ~i =1,2,\ldots ,l,\qquad \qquad \mathrm{where}\qquad \epsilon \ge 0. \end{aligned}$$
(27)

We further need to show that \(\epsilon = \bar{\epsilon }-\frac{1}{\eta } \) is always non-negative. We prove this assertion as follows.

Let (\({\bar{w}}\), \(\bar{\eta }\), \(\bar{\xi }\)\(\bar{\xi ^*}\)) be the solution of QPP (26) which finds the classifier for the sets \(D^+\) and \(D^-\). There would always exists an index j such that

$$\begin{aligned}&(A_jw + \eta (y_j+\bar{\epsilon }) + b) \ge 1, \end{aligned}$$
(28)
$$\begin{aligned}&-(A_jw + \eta (y_j-{\bar{\epsilon }}) + b) \ge 1. \end{aligned}$$
(29)

Adding (28) and (29), we get \(\bar{\epsilon } \ge \frac{1}{\eta }\), which proves that \(\epsilon = \bar{\epsilon }-\frac{1}{\eta } \ge 0.\)

Now for \(\eta > 0 \), the classifier \(w^\mathrm{T}x+ \eta y +b =0\) for the classes \(D^+\) and \(D^-\) gives the regressor \(y = (w^\mathrm{T}x+ b)\) with \(w:= \left( \frac{w}{-\eta }\right) \), \(b:=\left( \frac{b}{-\eta }\right) \). Since constraints in (27) do not involve \(\eta \), \(\eta \) does not play any role now in the determination of the regressor. Therefore, problem (26) becomes

$$\begin{aligned}&\min _{w,b,\xi ,\xi ^*}~~\frac{1}{2} w^\mathrm{T}w - \lambda _1\epsilon + \lambda _2\frac{1}{l}\left\{ \sum _{i=1}^{l}( y_i-(A_iw + b))^2 \right\} + C\left( \sum _{i=1}^{l}\xi _i + \sum _{i=1}^{l}\xi ^*_i\right) \nonumber \\&{ \text{ subject } \text{ to, }}\nonumber \\&\quad (A_iw + b) - y_i \le \epsilon + \xi _i \qquad \mathrm{for} \,i = 1,2,\ldots l, \nonumber \\&\quad y_i - (A_iw + b) \le \epsilon + \xi ^*_i \qquad \mathrm{for}\, i = 1,2,\ldots l, \nonumber \\&\qquad \xi _i,\xi ^*_i \ge 0 \qquad \mathrm{for} \,i =1,2,\ldots ,l. \end{aligned}$$
(30)

Also, since \(\epsilon \) has been taken as constant, it can be removed from the objective function of (30). Further, after replacing \(k:= 2\lambda _2\frac{1}{l}\), the problem (30) can be written in the vector form as follows

$$\begin{aligned}&\min _{w,b,\xi ,\xi ^*} \frac{c}{2}||w||^2 + \frac{k}{2}||Y - (Aw + eb)||_2 + {C}e^\mathrm{T}(\xi + \xi ^*)\nonumber \\&{\text{ subject } \text{ to, }} \nonumber \\&\quad Y - (Aw + eb) \le e\epsilon + \xi , \nonumber \\&\quad (Aw + eb)-Y \le e\epsilon + \xi ^*, \nonumber \\&\qquad \xi ~,\xi ^* \ge 0. \end{aligned}$$
(31)

6 Experimental results

We have performed a number of experiments to verify the efficacy of the our proposed LDMR model. For this, we have compared the LDMR model with existing SVR models, namely standard SVR, SMO-based SVR, LS-SVR and \(L_1\)-Norm SVR (Tanveer [24]) on certain artificial datasets, UCI benchmark datasets (Blake and Merz [28]) and time-series financial datasets.

All the simulations have been performed in MATLAB 12.0 environment (http://in.mathworks.com/) on Intel XEON processor with 16.0 GB RAM. The \(L_1\)-Norm SVR model has been solved by using the ‘linprog’ function of the MATLAB. For small- and medium-scale datasets, the proposed LDMR and standard SVR models have been solved by using the ‘quadprog’ function of MATLAB. For the large-scale datasets, we have used the SOR technique (Mangasarian and Musicant [25]) to solve the proposed LDMR model efficiently, whereas the SMO (Chang and Lin [27]) method has been used for obtaining the solution of standard SVR for large-scale datasets. For this, we have downloaded its code form (https://www.csie.ntu.edu.tw/~cjlin/libsvm/). Throughout these experiments, we have used RBF kernel \(\mathrm{exp}(\frac{-||x-y||^2}{q})\) where q is the kernel parameter.

The optimal values of parameters in SVR models have been obtained using the exhaustive search method (Hsu and Lin [30]) with cross-validation. For all SVR models, the value of the kernel parameter q has been searched in the set \(\{2^i,~ i = -10,-2,\ldots ,10\}\). The value of the parameter \(\epsilon \) in \(\epsilon \)-SVR, \(L_1\)-Norm SVR and proposed LDMR model has been searched in the set of \(\{ 0.05,0.1,0.2,0.3 \ldots ,1,1.5,2\}\). The value of the parameter k of the proposed LDMR has been fixed to 1 throughout the experiments. The value of parameters C in \(\epsilon \)-SVR, LS-SVR, \(L_1\)-Norm SVR and proposed LDMR model has been searched in the set \(\{2^i,~ i = -10,-2,\ldots ,12 \}\). The value of parameter c in \(L_1\)-Norm SVR and proposed LDMR model has also been searched in the set \(\{2^i,~ i = -10,-2,\ldots ,12 \}\).

6.1 Performance criteria

In order to evaluate the performance of the regression methods, we first summarize some commonly used evaluation criteria. Without the loss of generality, let l and k be the number of the training samples and testing samples, respectively. Furthermore, for \(i=1,2,\ldots k\), let \({y'_i}\) be the predicted value for the response value \(y_i\) and \(\bar{y}\) = \(\frac{1}{k}\sum _{i}^{k}y_i\) is the average of \(y_1, y_2,\ldots ,y_k\). The definition and significance of the some evaluation criteria have been listed as follows.

  1. (i)

    SSE Sum of squared error of testing, which is defined as SSE = \(\sum _{i=1}^{k}(y_i-y'_i)^2\). SSE represents the fitting precision.

  2. (ii)

    SST Sum of squared deviation of testing samples, which is defined as SST = \(\sum _{i=1}^{k}(y_i-\overline{y})^2\). SST shows the underlying variance of the testing samples.

  3. (iii)

    SSR Sum of square deviation of the testing samples which can be explained by the estimated regressor. It is defined as SSR = \(\sum _{i=1}^{k}(y'_i-\overline{y})^2\).

  4. (iv)

    RMSE Root mean square of the testing error, which is defined as RMSE = \(\sqrt{\frac{1}{k}\sum _{i=1}^{k}(y_i-y'_i)^2}\).

  5. (v)

    SSE/SST SSE/SST is the ratio between the sum of the square of the testing error and sum of the square of the deviation of testing samples. In most cases, small SSE/SST means good agreement between estimations and real values.

  6. (vi)

    SSR/SST It is the ratio between the variance obtained by the estimated regressor on testing samples and actual underlying variance of the testing samples.

6.2 Experiment 1 (artificial datasets)

To compare the performance of the proposed methods with the existing methods, we have synthesized some artificial datasets. For the training samples \((x_i,y_i)\) for \(i =1,2 ,\ldots ,l \), datasets have been generated as follows.

TYPE 1

$$\begin{aligned}&y_i = \frac{\sin (x_i)}{x_i} + \lambda _i,~~\lambda _i\sim U[-0.2,0.2]\\&{\hbox {and}}\,x_i\hbox { is from } U[-4\pi ,4\pi ]. \end{aligned}$$

TYPE 2

$$\begin{aligned}&y_i = \frac{\sin (x_i)}{x_i} + \lambda _i, ~~\lambda _i\sim U[-0.3,0.3]\\&{\hbox {and}}\,x_i\hbox { is from }U[-4\pi ,4\pi ]. \end{aligned}$$

TYPE 3

$$\begin{aligned}&y_i = \frac{\sin (x_i)}{x_i} + \lambda _i, ~~\lambda _i\sim U[-0.4,0.4]\\&{\hbox {and}}\,x_i\, {\hbox {is from}}\,U[-4\pi ,4\pi ]. \end{aligned}$$

TYPE 4

$$\begin{aligned}&y_i = \frac{\sin (x_i)}{x_i} + \lambda _i, ~~\lambda _i\sim N[0,0.2]\\&{\hbox {and}}\,x_i\hbox { is from }U[-4\pi ,4\pi ]. \end{aligned}$$

TYPE 5

$$\begin{aligned}&y_i = \frac{\sin (x_i)}{x_i} + \lambda _i, ~~\lambda _i\sim N[0,0.3]\\& {\hbox {and}}\,x_i\hbox { is from }U[-4\pi ,4\pi ]. \end{aligned}$$

TYPE 6

$$\begin{aligned}&y_i = ~ \left| \frac{x_i-1}{4}\right| + \left| \sin (\pi (1+\frac{x_i-1}{4})) \right| + 1 +\lambda _i,\\& \lambda _i\sim U[-0.5,0.5] \,{\hbox {and}} \,x_i\hbox { is from }U[-10, 10]. \end{aligned}$$

TYPE 6 dataset contains 200 training samples and 400 non-noise testing samples, while other datasets contain 100 training samples and 500 non-noise testing samples. To avoid the biased comparison, ten independent groups of noisy samples were generated randomly using MATLAB toolbox for all types of training sets.

Figures 1 and 2 illustrate the estimated function obtained by the LDMR, standard SVR, \(L_1\)-Norm SVR and LS-SVR models for TYPE 3 and TYPE 6 datasets, respectively. Table 1 shows the comparison of the proposed LDMR with SVR, \(L_1\)-Norm SVR and LS-SVR models on artificial datasets. It can be observed that the proposed LDMR, irrespective of the nature of noises present in the training set, owns always better generalization ability than other regression methods.

Fig. 1
figure 1

Performance of a LDMR b SVR c\(L_1\)-Norm SVR and d LS-SVR on TYPE 3 dataset

Fig. 2
figure 2

Performance of a LDMR bSVR c\(L_1\)-Norm SVR and d LS-SVR on TYPE 6 dataset

Table 1 Results on artificial datasets

6.3 Experiment 2 (artificial datasets with outliers)

As compared to other regression methods, the proposed LDMR is a robust method and is less sensitive to the presence of outliers. To realize this, we have generated datasets by adding five and ten outliers points in the TYPE 3 and TYPE 6 datasets, respectively. Figure 3 shows the estimated function obtained by LDMR, standard SVR, \(L_1\)-Norm SVR and LS-SVR models on TYPE 3 dataset with outliers. For \(L_1\)-Norm SVR and standard SVR models, the optimal RMSE has been found at \(\epsilon =0.3\) but, still at this value of the \(\epsilon \), outliers lie outside of the \(\epsilon \)-insensitive zone and affect the orientation and position of the estimated function. The LDMR model reduces the effect of these outliers by also assigning some weight to the points lying inside of the \(\epsilon \)-insensitive zone in the optimization problem. Table 2 shows the comparison of the proposed LDMR, standard SVR, \(L_1\)-Norm SVR and LS-SVR models on artificial datasets with outliers. In these datasets, 200 points were used for training and 400 non-noise points were used for testing.

Fig. 3
figure 3

Performance of a LDMR b SVR c\(L_1\)-Norm SVR and d LS-SVR on TYPE 3 dataset with outliers

Table 2 Results on artificial datasets with outliers

6.4 Experiment 3 (benchmark datasets)

We have checked the performance of the proposed methods on eight UCI benchmark datasets, namely Yacht Hydrodynamics, Concrete Slump, Pyrims, Motorcycle, NO2, Chwirut, Auto MPG and Boston Housing which are commonly used in evaluating a regression method. For the Motorcycle dataset, the criterion leave-one out (Kohavi [32]) was used to report the numerical results. For other datasets, we have used tenfold cross-validation (Duda and Hart [31]) to report the numerical results. For all the datasets, only feature vectors were normalized in the range of [0,1].

Table 3 lists the performance of LDMR, standard SVR, LS-SVR and \(L_1\)-Norm SVR using different evaluation criteria described in Sect. 6.1 for eight different UCI datasets. It can be observed that the proposed LDMR model outperforms the other existing regression methods in most of the datasets. For the statistical analysis of performance of the regression methods, their average rank has been computed using the values of SSE/SST. The obtained average ranks are summarized in Table 4. It can be observed that on an average, the proposed LDMR models obtain better ranks than existing regression methods.

For all SVR models, we have tuned their parameters to obtain their best choice in each datasets. We list the tuned parameter values of SVR models for UCI datasets in Table 5. Figure 6 shows the effect of the parameters C and c on SSE/SST values in LDMR model for Pyrims and Auto MPG datasets. It shows that the performance of the LDMR model is sensitive to the choice of parameters c and C.

Table 3 Results on commonly used benchmark datasets
Table 4 Average ranks of SVR, LS-SVR, \(L_1\)-Norm SVR and LDMR models on SSE/SST values
Table 5 Tuned parameter values of SVR models on UCI datasets
Table 6 Results on large-scale datasets

6.5 Experiment 4 (large-scale datasets)

We have also compared the proposed model with SVR model on large--scale datasets. Since the ‘quadprog’ implementation of the proposed LDMR model and SVR model is not efficient for the large-scale datasets, we have used the SOR method in the proposed LDMR model for these datasets. For the SVR model, we have used its SMO implementation. We have downloaded Parkinsons Telemonitoring (5847 \(\times \) 22), Wine Quality Red (1599 \(\times \) 12) and Wine Quality White (4898 \(\times \) 12) datasets from the UCI Repository [28]. For all these datasets, we have normalized their input feature vectors in the range of [0,1]. For each dataset, we have fixed the number of training points and testing points and randomly permuted the data points in training set and testing set in ten different trials. The regression methods have been evaluated for each trial.

Table 6 shows the comparison of the SMO SVR and proposed SOR LDMR on large-scale datasets using different evaluation criteria. It can be observed that the proposed LDMR model with its SOR implementation outperforms the existing SVR model with SMO implementation.

6.6 Experiment 5 (time-series financial dataset)

For further evaluation of the proposed methods, we have simulated the existing as well as proposed regression methods on real-world time-series financial datasets. We have downloaded the daily stock prices of the IBM and SBI.INS for the period of 27 March 2016 to 27 March 2017 and 27 March 2015 to 27 March 2017, respectively, from the Yahoo financial website. The datasets were generated by taking the last three days closing stock prices as input feature and next day closing stock price as response value. In experiments, the first \(50\%\) data points of the datasets were used for training and rest of them have been used for the testing. The input features were normalized in the range of [\(-1,1\)].

Table 7 Results on financial dataset

Table 7 lists the comparison of the proposed methods with other existing regression methods using different evaluation criteria on IBM and SBI.INS finance datasets along with their training time. Figures 4 and 5 show the performance of the proposed regression methods along with existing regression methods on IBM and SBI.INS finance dataset, respectively. It can be easily observed in Table 5 and Figures 5 and 6 that the proposed LDMR formulations outperform the existing regression methods.

Fig. 4
figure 4

SSE/SST values obtained by LDMR model on different values of C and c parameter on a Pyrims and b Auto MPG dataset

Fig. 5
figure 5

Performance of \(L_1\)-Norm SVR, LDMR and SOR LDMR on IBM dataset

Fig. 6
figure 6

Performance of \(L_1\)-Norm SVR, LDMR and SOR LDMR on SBI dataset

7 Conclusion

An efficient LDM-based regression model has been proposed in this paper which can be mathematically derived from the optimization problem of the LDM (Zhang and Zhou [17]) by making use of a result of Bi and Bennett [20]. The proposed LDMR model simultaneously minimizes the \(\epsilon \)-insensitive loss function as well as the quadratic loss function. In this sense, the proposed LDMR formulation combines the benefits of the LS-SVR model (Suykens et al. [21, 22]) as well as the \(\epsilon \)-SVR model. The proposed LDMR model obtains better generalization ability by finding a trade-off between the \(\epsilon \)-insensitive loss and the quadratic loss via the user-defined parameters k and C. The proposed LDMR model has also been observed to be less sensitive to the presence of outliers. Further, the application of the SOR technique (Mangaserian and Musicant [25]) significantly reduces the training time of the proposed LDMR model.

One of the major difficulties with the proposed LDMR model is that it requires more parameters to be tuned. This requires future study so as to develop a model which can autolearn parameters \(\epsilon \), c and C from the data. Further, we have observed that the SOR technique works well in the proposed LDMR formulation but, as compared to SMO method in SVR model, it takes more time to train the model. So we would like to test the SMO method for solving the optimization problem in the proposed LDMR model in future.