1 Introduction

SVR is a widely used regression technique which is extended from support vector classification (SVC) by Boser et al. [1]. SVR has received much attention due to its competent performance in various fields, such as financial prediction, wind speed prediction, etc [2,3,4]. SVR is used to deal with small sample problems originally, and has obtained good results. However, with the massive increase of data quantities in last decades, how to train SVR models efficiently on large-scale datasets has become a hot topic.

Linear SVR has been studied firstly for big data. There have been many improved algorithms for large-scale linear SVR. In 2012, Chia-Hua Ho and Chih-Jen Lin extended state-of-the-art training methods for linear SVC to linear SVR [5]. There were mainly two types of methods for large-scale linear SVC. One was a Newton-type method for the primal-form [6]. The other was a coordinate descent approach for dual form [7]. Both of the two methods were extended to linear SVR in their paper. Similarly in 2015, Xie et al. [8] proposed a mini-batch quasi-Newton optimization algorithm to speed up the training process of linear SVR. The proposed method combined the first and second order gradient information estimated by a small set of training data with the the framework of the quasi-Newton algorithm. Very recently in 2016, Wang et al. [9] developed a novel e-Distance Weighted SVR (e-DWSVR) to address the limitations of original SVR and make it scalable to large-scale SVR through dual coordinate descent and averaged SGD strategies.

Although a lot of these algorithms focus on linear SVR problems, large-scale kernel SVR learning is less explored. The main reason is that training large-scale kernelized SVR faces the same challenge as kernelized SVC called the curse of kernelization [10]. There is a growing set of support vectors (SVs) in the iteration of optimization, and results in a non-linear growth in both model update time and prediction time with data size. Some algorithms have been proposed to solve this problem. In 2015, Zheng [11] proposed a smoothed objective function in the primal form to improve the computational efficiency combined with a conjugate gradient algorithm. But this paper didn’t use the proposed method on large-scale data. Recently, Lu et al. [12] developed the kernel functional approximation techniques to solve large-scale kernel regression problems. The techniques including FOGD and NOGD were used to approximate a kernel function or kernel matrix, which transformed the kernel regression problem into an approximate linear regression problem.

In this paper, we intend to use the famous budgeted SGD (BSGD) algorithm to solve large-scale non-linear SVR that was firstly proposed to solve large-scale kernelized SVC [10]. BSGD algorithm maintains a fixed number of SVs in the model, and updates them during the SGD training iterations. This algorithm is based on SGD, which is a recently popularized approach that can be efficient to deal with very large data or with data streams. Before using SGD, the objective of SVM is cast as an unconstrained optimization problem. Then this algorithm proceeds by iteratively receiving a labeled example and updating the model weights through SGD over the corresponding instantaneous objective function. Besides, the SGD method is combined with some budgeted maintenance strategies to control the number of SVs. Actually budgeted online SVM algorithms were proposed by Crammer et al. [13]. There were three main budget maintenance strategies: removal, projection and merging. In the case of removal, there were some different removal methods. In 2006, Cavallanti et al. [14] proposed a randomized budget perception (RBP) that removed a random SV. RBP achieved satisfactory performance though it was so simple. The method Forgetron was proposed to remove the oldest SV which was created when the quality of perception was the lowest [15], and its removal was considered to be the least hurtful to the model. Another removal strategy used by Wang et al. [10] was to remove the smallest SV with the smallest norm in order to achieve the goal of minimizing the averaged gradient error. Before projection, most previous work has focused on removing SVs. Orabona et al. [16] proposed a Projectron algorithm that projected the SV which would be removed in removal methods onto the other SVs in 2009. The projection was used to minimize the model weight degradation caused by removing a SV. For merging strategy, two SVs were merged into a newly created one which has the information of these two SVs, with the weights remaining unchanged [10]. It also has been proved to be the best budgeted strategy. Furthermore, Levesque et al. [17] proposed a method to combine removal and merging budgeted kernel SVMs trained with SGD in 2013, comprehensive empirical results have demonstrated the effectiveness of these algorithms. Therefore in this paper, we extend the BSGD algorithm for large-scale kernelized SVM to SVR and then the new algorithm can deal with large-scale kernelized SVR learning.

The rest of this paper is organized as follows. In Sect. 2, the formulation of SVR is introduced. In Sect. 3, we introduce the SGD method to linear and non-linear SVR. Then the BSGD method is introduced into kernel SVR for large-scale datasets. In Sect. 4, experiments on some UCI and LIBSVM datasets are conducted to evaluate the efficiency of our proposed method. Section 5 concludes the work.

2 Support vector regression (SVR)

Given a set of training samples \(S=\left\{ x_i,y_i \right\}\), \(x_i \in R^n ,y_i \in R,i = 1,2, \ldots ,l\), where instance \(x_i \in R^n\) is a \(n-\)dimensional input vector, \(y_i \in R\) are the corresponding target values and l is the number of samples. The goal of SVR is to find a function f(x) that can obtain targets \(y_i\) for all the training data with high generalization performance [18]. As for the linear case, f(x) is described as follows (we have omitted the bias term b because it hardly affects the experimental performance [5]),

$$\begin{aligned} f(x) = \left\langle {w,x} \right\rangle , \end{aligned}$$
(1)

where \(\left\langle {\cdot , \cdot } \right\rangle\) denotes the dot product. In order to obtain the function f, SVR solves the following regularized optimization problems:

$$\begin{aligned} \mathop {\min }\limits _w \;P(w) = \frac{1}{2}w^T w + \frac{C}{l}\sum \limits _{i = 1}^l {L(w,x_i ,y_i )}, \end{aligned}$$
(2)

where \(L(w, x_i ,y_i) = \max (|w^T x_i - y_i| - \varepsilon ,0)\) is the epsilon-insensitive loss function, the parameter C is the regularization parameter and the parameter \(\varepsilon\) is given so that the loss is zero if \(|w^T x_i - y_i | \le \varepsilon\).

Furtherly, kernel functions are used in SVR to solve nonlinear problems [19]. The input data are mapped into a high dimensional feature space through a nonlinear function \(\Phi\) and then x is replaced with \(\Phi (x)\), the optimization problem becomes as follows:

$$\begin{aligned} \mathop {\min }\limits _w \;P(w) = \frac{1}{2}w^T w + \frac{C}{l}\sum \limits _{i = 1}^l {L(w,\Phi (x_i) ,y_i )}. \end{aligned}$$
(3)

In order to solve this problem, we can use the dual problem of SVR interchangeably by employing the lagrange multipliers method [5], then we can get

$$\begin{aligned} \begin{array}{l} w= \sum \limits _{j = 1}^l {(\alpha _j - \alpha _j ^* )\Phi (x_j )} \\ \quad \, = \sum \limits _{j = 1}^l {\theta _j \Phi (x_j )}, \\ \end{array} \end{aligned}$$
(4)

where \(\theta\) is defined as the difference between lagrange multipliers \(\alpha\) and \(\alpha ^*\). Then the objective function f(x) becomes

$$\begin{aligned} f(x) = \left\langle {w,\Phi (x)} \right\rangle = \sum \limits _{i = 1}^l {\theta _i \left\langle {\Phi (x_i ),\Phi (x)} \right\rangle } = \sum \limits _{i = 1}^l {\theta _{i} K(x_i,x)}. \end{aligned}$$

After employing Lagrange multipliers method, we only need to compute \({\left\langle {\Phi (x_i ),\Phi (x)} \right\rangle }\) to get f(x). Gaussian kernel is one of the most used kernels. Its formula is as follows:

$$\begin{aligned} K(x,x') = e^{ - \frac{{||x - x' ||^2 }}{{2\sigma ^2 }}}. \end{aligned}$$

3 Budgeted stochastic gradient descent for SVR

In this section, we introduce the classical SGD algorithms to large-scale SVR firstly. Then the budget maintenance strategies are introduced to combine with SGD for solving large-scale kernelized SVR.

3.1 Stochastic gradient descent for SVR

SGD is an appealing algorithm which is widely used in the process of learning SVMs [20,21,22,23]. With SGD, the gradient is approximated by evaluating on training samples, over which the optimal model weight can be achieved with a small number of iterations. Therefore the SVM models with the SGD algorithm are efficient. We introduce the SGD algorithm to SVR and show its details in the following.

SGD works iteratively. It starts with an initial value of the model weight \(w_1\), and at \(t-\)th round it updates the current weight \(w_t\) as

$$\begin{aligned} w_{t + 1} \leftarrow w_t - \eta _t \nabla _w P_t (w_t ), \end{aligned}$$
(5)

where \(\eta _t\) is a learning rate. Different SGD algorithms have different \(\eta _t\). Pegasos and Norma are the main two SGD algorithms [22, 23]. \(\eta _t\) of the former is C / t, while the latter is \(\eta /\sqrt{t}\) (\(\eta\) is a constant). Here we choose Pegasos because it is a improved SGD method [23]. \(\nabla _{w} P_t(w_t )\) is the sub-gradient of the objective function \(P_t(w)\), that is defined on the latest example \((x_t ,y_t )\). That is,

$$\begin{aligned} P_t (w) = \frac{1}{2}w^T w + C * L(w,x_t ,y_t ). \end{aligned}$$
(6)

Therefore Eq. 5 can be written as

$$\begin{aligned} w_{t + 1} \leftarrow (1 - \eta _t )w_t + \beta _t x_t, \end{aligned}$$
(7)

where

$$\begin{aligned} \beta _t \leftarrow \left\{ \begin{array}{l} 0\quad \;\;\;\;if\;|w^T x_t - y_t | \le \varepsilon \\ - C\quad if\;w^T x_t - y_t> \varepsilon \\ C\quad \;\;\;if\;{} w^T x_t - y_t < - \varepsilon \;{} {}. \\ \end{array} \right. \end{aligned}$$
(8)

From Eq. 8, we can know that if \(|w^T x_i - y_i| \le \varepsilon\), \(\beta _{t} = 0\). Then the sample \((x_{t},y_{t})\) is useless to the output and can therefore be ignored.

3.2 Kernelized SGD for SVR

The SGD algorithm can be easily kernelized combined with kernels to train nonlinear SVR models. When introducing a non-linear function \(\Phi\) that maps x from the input to the feature space and replacing x with \(\Phi (x)\), \(w_t\) can be described as

$$\begin{aligned} w_t = \sum \limits _{j = 1}^t {\theta _j \Phi (x_j )}, \end{aligned}$$

where

$$\begin{aligned} \theta _j = \beta _j \prod \limits _{k = j + 1}^t {(1 - \eta _k )}. \end{aligned}$$

After kernelization, the Expression 7 of the weight w is converted to the following:

$$\begin{aligned} w_{t + 1} \leftarrow (1 - \eta _t )w_t + \beta _t \Phi (x_t). \end{aligned}$$
(9)

According to above equation of \(\theta\), the sample can be ignored if the value of \(\theta\) is zero. Therefore the optimization problem of kernel SVR can be described by either a weight vector w or a set of \(\theta\) and samples. We use pseudo-code to show the process of training the non-linear SVR in Algorithm 1. In the algorithm, Gaussian kernel is used. We get the value of \(\beta\) in different cases (line 8 to 13 of Algorithm 1). Then we update the value of \(\theta\) continually, finally output the optimal function.

figure a

3.3 Budgeted maintenance strategies

In this subsection, we employ a budget maintenance strategy to solve non-linear SVR models. The strategy predefines a fixed budget B, which means that the number of SVs can’t exceed B during iterations. When the number of SVs exceeds B, the strategy will execute a budget maintenance step. Generally there are three main budget maintenance strategies: merging, projection, and removal. According to the performance of kinds of strategies in the paper [10], we choose to combine SGD method with a merging strategy. Then the process of employing the merging strategy to SVR is described.

The goal of budgeted strategy is to minimize the averaged gradient error represented by \(\bar{E}\) as paper [10]. The goal is solved by minimizing the current gradient error \(\Vert E_{t}\Vert\) at each round according to the theorem defined in the original paper [10]. Define the gradient error \(E_{t}=\frac{\Delta _{t}}{\eta _{t}}\) and \(\eta _{t}\) is the learning rate. Therefore we finally solve the following objective function:

$$\begin{aligned} min\Vert \bigtriangleup _{t}\Vert ^{2}, \end{aligned}$$
(10)

where \(\Vert \bigtriangleup _{t}\Vert\) is the weight degradation. In the following, we address Eq. 10 using the merging strategy which can merge two SVs to a newly created one. First, for the current weight, we suppose to merge \(\Phi (x_{m})\)\(\Phi (x_{n})\) to a new one M, then M is represented by

$$\begin{aligned} M = \frac{\alpha _{m}\Phi (x_{m}) + \alpha _{n}\Phi (x_{n})}{(\alpha _{m} + \alpha _{n})}, \end{aligned}$$

assuming that \(\alpha _{m} + \alpha _{n} \ne 0\). In order to maintain the weight unchanged, the coefficient of M is set to \(\alpha _{m} + \alpha _{n}\). In order to get M, this problem is turned into another one which find an input vector z whose image \(\Phi (z)\) is at the minimum distance from the \(M's\) [10]. Then the objective function is converted to the following:

$$\begin{aligned} \begin{array}{l} \quad \mathop {\min }\limits _z\;\Vert M - \Phi (z)\Vert ^{2}\\ \qquad = \mathop {\min }\limits _z\;\left(M^{T}M + \Phi (z)^{T}\Phi (z) - 2M^{T}\Phi (z)\right).\\ \end{array} \end{aligned}$$
(11)

In this paper, the Gaussian kernel is used in the experiments. It is a radial kernel, so the kernel can be expressed as \(k(x,x') = \tilde{k}(\Vert x-x'\Vert ^{2})\). \(k(x,x')\le 1\), \(k(x,x)=1\). With this property, Eq. 11 can be written as:

$$\begin{aligned} \mathop {\min }\limits _z\;\left(2 - 2M^{T}\Phi (z)\right). \end{aligned}$$
(12)

Then the Eq. 12 can be reduced to:

$$\begin{aligned} \begin{array}{l} \mathop {\max }\limits _z\;f(z) = \mathop {\max }\limits _z\;M^{T}\Phi (z) \\ \quad \quad \quad \quad \,\,\, = (\frac{\alpha _{m}\Phi (x_{m}) + \alpha _{n}\Phi (x_{n})}{(\alpha _{m} + \alpha _{n})})^{T}\Phi (z)\\ \quad \quad \quad \quad \,\,\, = \frac{1}{\alpha _{m} + \alpha _{n}} ((\alpha _{m}\Phi (x_{m})^{T}\Phi (z) + (\alpha _{n}\Phi (x_{n})^{T}\Phi (z))\\ \quad \quad \quad \quad \,\,\, = \frac{1}{\alpha _{m} + \alpha _{n}} (\alpha _{m}\tilde{k}(\Vert x_{m}-z\Vert ^{2}) + \alpha _{n}\tilde{k}(\Vert x_{n}-z\Vert ^{2})),\\ \end{array} \end{aligned}$$
(13)

Next we take the directive of 13 with respect to z,

$$\begin{aligned} \nabla _{z}f(z) = 0. \end{aligned}$$

Then the formula of z simplified by h is obtained:

$$\begin{aligned} z = hx_{m} + (1-h)x_{n}, \end{aligned}$$
(14)

where \(h = \frac{\alpha _{m}\tilde{k}'(\Vert x_{m}-z\Vert ^{2})}{\alpha _{m}\tilde{k}'(\Vert x_{m}-z\Vert ^{2}) + \alpha _{n}\tilde{k}'(\Vert x_{n}-z\Vert ^{2})}\), and \(\tilde{k}'(x)\) is the first derivative of \(\tilde{k}\). After expressing z, we take Eq. 14 into Eq. 13, then the objective function can be simplified to find the optimal h:

$$\begin{aligned} \mathop {\max }\limits _h \left( \frac{\alpha _{m}}{\alpha _{m} + \alpha _{n}}k_{1-h}(x_{m},x_{n}) + \frac{\alpha _{n}}{\alpha _{m} + \alpha _{n}}k_{h}(x_{m},x_{n})\right) , \end{aligned}$$

where \(k_{h}(x,x') = k(hx,hx')\). Now we have transformed the objective function into the search for the optimal h. Learning from the methods of [10], we use the golden search method to find it. Then the optimal z can be obtained from Eq. 14.

After getting the optimal z, the original objective function to be solved as follows:

$$\begin{aligned} \Vert \bigtriangleup _{t}\Vert ^{2} \equiv \min \Vert \alpha _{m}\Phi (x_{m}) + \alpha _{n}\Phi (x_{n}) - \alpha _{z}\Phi (x_{z})\Vert ^{2}. \end{aligned}$$
(15)

There is still a problem how to choose what pair of SVs to merge. To reduce the computational cost, we firstly compute the smallest value of \(\Vert \alpha _{m}\Vert ^{2}\) as one of pair, then we choose another one through a loop of SVs. After finishing these series of work, the algorithm’s framework of BSGD is listed in Algorithm 2. At the beginning, a budget B is predefined to bound the number of SVs and a parameter b is defined to record the number of current SVs. In the iteration, when b exceeds B, the budgeted strategy is executed.

figure b

4 Experiments

In this section, we evaluate the BSGD method with a merging strategy and compare it to related algorithms including original SGD method [23] and LIBSVM [24] on 9 datasets.

4.1 Experimental settings

We select total 9 datasets in our experiments, and all expect 3 Wind-speed datasets are publicly available at LIBSVM or UCI datasets. We preprocess the datasets as follows. Firstly, all features and target values of datasets are normalized into [0, 1] excepting 3 Wind-speed datasets. Secondly, in order to ensure the experiments for large-scale datasets, datasets Cadata and CASP are repeated for 10 times. Cpusmall and Space_ga are repeated for 100 times. The descriptions of these datasets including the numbers of features, training and test data are listed in Table 1.

Table 1 Properties of different datasets

Besides, the selection of optimal parameters is accomplished through a grid search conducted on each parameter. For the non-linear SVR, RBF kernel is used in the algorithms. SVR with the SGD or BSGD method has a cross loop of regularization \(C = [10^{-4},10^{-3},\ldots ,10^{4}]\), RBF kernel width \(\sigma = [10^{-4},10^{-3},\ldots ,10^{4}]\) and \(\epsilon = [2^{-10},2^{-9},\ldots ,2^{0}]\). While LIBSVM has a cross loop of \(penalty\_C = [2^{-2},2^{-1},\ldots ,2^{6}]\), \(\epsilon = [2^{-10},2^{-9},\ldots ,2^{2}]\) and RBF kernel width \(\sigma = [10^{-4},10^{-3},\ldots ,10^{4}]\). Table 2 lists the optimal parameter values.

All experiments are conducted on a 8G RAM 3.6 GHz Intel core. Our proposed algorithms are implemented in MATLAB 2013B.

Table 2 The optimal parameters of different algorithms

4.2 Experimental results and analysis

In this subsection, the performance of these algorithms on large-scale datasets described above are listed. Mean squared error (MSE), R-square, training time, test time and the number of SVs are 5 indexes of evaluating the proposed algorithms. Tables 3, 4 and 5 details the performance of these algorithms on different datasets. In the tables, numbers in brackets denote the number of SVs and the test time created by evaluating each algorithm. The best MSE, R-square, the shortest training time and test time are indicated in bold.

Then Figs. 1, 2, 3, 4, 5, 6, 7, 8 and 9 show the accuracy, training time, test time and the number of SVs of each algorithms on all datasets change with the increase of sample numbers. The algorithms are represented by the line with different shapes and colors. From Figs. 1, 2, 3, 4, 5, 6, 7, 8 and 9, we know that if the size of dataset is small, MSE is high, and the training time and test time of the BSGD methods are not significantly shorter than those of libsvm. This shows that the BSGD methods are more suitable for large datasets compared with libsvm.

As is shown in tables and figures, obviously the BSGD algorithm has achieved competent accuracy with significantly improved computational efficiency. Pegasos and LIBSVM have an advantage in accuracy owing to their large number of SVs in SVM modeling, thus resulting a high computational cost. For the two budget methods, we sacrifice a little accuracy to achieve a great improvement in efficiency. For the training time, the proposed BSGD method is multiple times even dozens of times faster than non-budget algorithms. For the test time, the proposed BSGD method is dozens of times even hundreds of times faster than others. Comparing with the traditional methods in which the number of SVs increases dramatically, we use a budget method to control the number of SVs at a fixed value. Besides, with the increase of training sample numbers, the MSE has a tendency to decrease. Of course the training time and test time will increase gradually. However, the training time and test time of the BSGD methods increase slower than non-budget methods. Especially for test time, it increases slowly, even there is almost no increase in the BSGD methods. This proves the efficiency and effectiveness of our proposed BSGD method.

Finally, we present the MSE and training time of SVR with BSGD method on CASP dataset with a increase of budget size in Fig. 10. It shows that with the increase of budget size, the MSEs have the tendency to decrease, and the training times increase gradually with the increase of budget sizes. Tables 3 and 4 also show the results of the BSGD algorithm with two budgets including 200 and 500. With the decrease of budget size, the accuracy and training time decrease gradually. So we can balance accuracy and efficiency through regulating the size of budget. As for how to choose suitable budget value, normally we can choose the rough budget value on when the precision increases relatively slow. However for different datasets, the budget value can be choosed depending on the preference on efficiency or accuracy.

Table 6 shows the update time and space of different methods. We can know that our method has constant space and constant time complexity per update, which is O(B). However, time and space complexity of SGD method is growing with the number of SVs and samples, respectively. Therefore, for large datasets, the complexity of SGD will become large while our method is only related the predefined number B.

Table 3 Testing MSE and the number of SVs using the different algorithms
Table 4 Training time (s) and test time (s) using the different algorithms
Table 5 R-square of different algorithms
Table 6 Complexities on different algorithms
Fig. 1
figure 1

MSE, training time, test time and SVs of Space_ga

Fig. 2
figure 2

MSE, training time, test time and SVs of Cpusmall

Fig. 3
figure 3

MSE, training time, test time and SVs of TFIDF-2006

Fig. 4
figure 4

MSE, training time, test time and SVs of Cadata

Fig. 5
figure 5

MSE, training time, test time and SVs of CASP

Fig. 6
figure 6

MSE, training time, test time and SVs of Wind-ningxia

Fig. 7
figure 7

MSE, training time, test time and SVs of Wind-speed

Fig. 8
figure 8

MSE, training time, test time and SVs of MSD

Fig. 9
figure 9

MSE, training time, test time and SVs of Wind-saihanba

Fig. 10
figure 10

MSE and training time of CASP with different budget size

4.3 Experimental summary

In this subsection, we summarize the conclusions obtained in our experiments.

  1. 1.

    Our proposed non-linear SVR with the BSGD method can solve the large-scale non-linear regression problems effectively, and avoid the huge time cost created by the traditional methods.

  2. 2.

    Through regulating the size of budget, we can balance accuracy and efficiency of the SVR models when dealing with non-linear regression problems.

5 Conclusions

In this paper, we propose a BSGD method to slove the large-scale non-linear SVR problems. It extends the SGD method to the optimization process of SVR, and combine with a budget method to control the number of SVs to overcome the curse of kernelization. Our empirical results on different large-scale datases demonstrate that the proposed SVR with the BSGD method can solve the large-scale non-linear regression problems well. It can achieve competent accuracy with significantly improved computational efficiency. Besides, through controlling the size of budget, we can balance the accuracy and efficiency of non-linear regression tasks.