Keywords

1 Introduction

In last decades, the progress in computer hardware, the increasing number of internet users and mobile device access to internet have enabled explosion in data. Researchers from the University of Berkeley estimate that about 1 Exabyte (\(10^9\) Gigabyte) of data are generated every year [1]. Recent book [2] shows that Google, Yahoo!, Microsoft, Facebook, Twitter, YouTube and other internet-based companies have Exabytes of data due to hundreds of millions of users and billion daily active users. Such a huge amount of data yields challenges in data analysis because current analytical techniques are not well suited for that data scale. Therefore, it is a high priority to create new learning algorithms for addressing massive datasets. Our research aims to propose parallel learning algorithms of local support vector regression (local SVR) for dealing with large datasets.

Support vector machines (SVM) proposed by [3] and kernel-based methods have shown practical relevance for classification, regression and novelty detection. Successful applications are reported for face identification, text categorization and bioinformatics [4]. Nevertheless, the SVM learning is accomplished through a quadratic programming (QP), so that the computational cost of a SVM approach is at least square of the number of training datapoints making SVM impractical to handle large datasets. There is a need to scale-up SVM learning algorithms to deal with massive datasets.

In this paper, our investigation aims at developing new parallel algorithms of local SVR to efficiently handle the non-linear regression of large datasets.

Instead of building a global SVR model, as done by the classical algorithm is very difficult to deal with large datasets, our kSVR algorithm [5] is to learn in the parallel way an ensemble of local ones that are easily trained by the standard SVR algorithm like LibSVM [6]. The kSVR algorithm performs the training task with two main steps. The first one is to use kmeans algorithm [7] to partition the large training dataset into k clusters (subsets). The idea is to reduce the data size for training local non-linear SVR models at the second step. The algorithm learns k non-linear SVR models in the parallel way on multi-core computers in which a SVR model is trained in each cluster to predict the data locally. We propose to develop the new ensemble-based learning algorithm, called krSVR. The training task of krSVR learns the multiple random kSVR models to improve the generalization capacity of the kSVR alone.

The performance analysis in terms of the algorithmic complexity and the generalization capacity and the numerical test results on five large datasets from UCI repository [8] showed that our proposed kSVR and krSVR algorithms are faster than the standard SVR for the non-linear regression of large datasets while achieving the high correctness in the prediction.

The paper is organized as follows. Section 2 briefly introduces the SVR algorithm. Section 3 presents our proposed parallel algorithm kSVR of local SVR models for the non-linear regression on large datasets. The learning algorithm krSVR for the ensemble of T random kSVR models is illustrated in Sect. 4. The experimental results is presented in Sect. 5 before the discussion on related works in Sect. 6. We then conclude in Sect. 7.

2 Support Vector Regression

Let’s start with the regression task for a training dataset with m datapoints \(x_i\) (\(i=1,\ \ldots ,m\)) in the n-dimensional input space \(R^n\), having corresponding targets \(y_i \in R\), support vector regression (SVR) proposed by [3] tries to find the best hyperplane (denoted by the normal vector \(w \in R^n\) and the scalar \(b \in R\)) that has at most \(\varepsilon \) deviation from the target value \(y_i\). The SVR pursues this goal with the quadratic programming (1) (Fig. 1).

$$\begin{aligned} \begin{array}{l} \min {(1/2) \displaystyle \sum _{i=1}^m \displaystyle \sum _{j=1}^m{(\alpha _i - {\alpha _i}^*)(\alpha _j - {\alpha _j}^*)K\langle x_i,x_j\rangle } - \displaystyle \sum _{i=1}^m(\alpha _i - {\alpha _i}^*)y_i + \varepsilon \displaystyle \sum _{i=1}^m(\alpha _i + {\alpha _i}^*)} \\ \\ s.t.\left\{ \begin{array}{l} \displaystyle \sum _{i=1}^m(\alpha _i - {\alpha _i}^*) = 0 \\ 0 \le \alpha _i, {\alpha _i}^* \le C {1} {1}\forall i = 1,2,...,m \end{array} \right. \end{array} \end{aligned}$$
(1)

where C is a positive constant used to tune the margin and the error and a linear kernel function \(K\langle x_i,x_j \rangle = \langle x_i \cdot x_j\rangle \)

Fig. 1.
figure 1

Linear support vector regression

The support vectors (for which \(\alpha _i, {\alpha _i}^* > 0\)) are given by the solution of the quadratic programming (1), and then, the predictive hyperplane and the scalar b are determined by these support vectors. The prediction of a new datapoint x based on the SVR model is as follows:

$$\begin{aligned} predict(x, SVR\ model) = \sum _{i=1}^{\sharp SV}{(\alpha _i - {\alpha _i}^*)K\langle x, x_i\rangle } - b. \end{aligned}$$
(2)

SVR algorithms can use different kernel functions [9] to handle variations on prediction problems. It only needs replacing the usual linear kernel function \(K\langle x_i,x_j \rangle = \langle x_i \cdot x_j\rangle \) with other popular non-linear kernel functions, including:

  • a polynomial function of degree d: \(K\langle x_i,x_j\rangle = (\langle x_i \cdot x_j\rangle + 1)^d .\)

  • a RBF (Radial Basis Function): \(K\langle x_i,x_j\rangle = e^{-\gamma \Vert x_i - x_j\Vert ^2}.\)

The SVR models are most accurate and practical relevance for many successful applications reported in [4].

3 Parallel Algorithm of Local Support Vector Regression

The study in [10] illustrated that the computational cost requirements of the SVR solutions in (1) are at least \(O(m^2)\) (where m is the number of training datapoints), making standard SVM intractable for large datasets. Learning a global SVR model on the full massive dataset is challenge due to the very high computational cost.

Fig. 2.
figure 2

Global SVR model (left part) versus 3 local SVR models (right part)

Fig. 3.
figure 3

Training k local SVR models (kSVR)

3.1 Learning k Local SVR Models

Our proposed kSVR algorithm learns an ensemble of local SVR models that are easily trained by the standard SVR algorithm. As illustrated in Fig. 3, the kSVR handles the regression task with two main steps. The first one uses kmeans algorithm [7] to partition the full training set into k clusters, and then the second one trains an ensemble of local SVR models in which a non-linear SVR is learnt from each cluster to predict the data locally. We consider a simplest regression task given a target variable y and a predictor (variable) x. Figure 2 shows the comparison between a global SVR model (left part) and 3 local SVR models (right part) for this regression task, using a non-linear RBF kernel function with \(\gamma = 10\), a positive constant \(C=10^5\) (i.e. the hyper-parameters \(\theta = \{\gamma , C\}\)) and a tolerance \(\varepsilon = 0.05\).

figure a

Since the cluster size is smaller than the full training data size, the standard SVR can easily perform the training task on the data cluster. Furthermore, the kSVR learns independently k local models from k clusters. This training task of kSVR is easily parallelized to take into account the benefits of high performance computing, e.g. multi-core computers or grids. The simplest development of the parallel kSVR algorithm is based on the shared memory multiprocessing programming model OpenMP [11] on multi-core computers. The parallel training of kSVR is described in Algorithm 1.

3.2 Prediction of a New Datapoint x Using k Local SVR Models

The kSVR-model = \( \{(c_1, lsvr_1), (c_2, lsvr_2), \dots , (c_k, lsvr_k)\)} is used to predict the target value of a new datapoint x as follows. The first step is to find the closest cluster based on the distance between x and the cluster centers:

$$\begin{aligned} c_{NN} = \mathop {\text {arg}\,\text {min}}\limits _c\ distance(x, c). \end{aligned}$$
(3)

And then, the target value of x is predicted by the local SVR model \(lsvr_{NN}\) (corresponding to \(c_{NN}\)):

$$\begin{aligned} predict(x, kSVR\ model) = predict(x, lsvr_{NN}). \end{aligned}$$
(4)

We can see that the prediction of a new datapoint x in k local SVR models depends not only on the clustering assignment but also on the prediction of the local SVR model at the final stage. It is the local SVR model that directly predicts the outcome of the new datapoint x.

We study the performance analysis in terms of the algorithmic complexity and the generalization capacity to justify the competence of the learning algorithm kSVR.

3.3 Performance Analysis

Let’s start analysing the algorithmic complexity of building k local SVR models with the parallel kSVR algorithm. The full dataset with m datapoints is partitioned into k balanced clusters. It means that the cluster size is about \(\frac{m}{k}\). According to [10], the computational cost requirements of the SVR solutions is quadratic in the number of training datapoints. Therefore, the training complexity of a local SVR is \(O(\frac{m^2}{k^2})\). Thus, the algorithmic complexity of parallel training k local SVR models on a P-core processor is:

$$\begin{aligned} O(\frac{k}{P}\frac{m^2}{k^2}) = O(\frac{m^2}{k \cdot P}). \end{aligned}$$
(5)

This complexity analysis illustrates that parallel learning k local SVR models in the kSVR algorithmFootnote 1 is k.P times faster than building a global SVR model (the complexity is at least \(O(m^2)\)).

Fig. 4.
figure 4

The regression task of SVR (left part) can be regarded as a binary classification problem of SVC (right part)

The generalization capacity of kSVR models trained by the kSVR algorithm can be explained in terms of the margin size of the binary support vector classification (SVC).

An intuitive geometric approach in [12] illustrates that the regression task of the SVR is considered as a classification problem of the SVC. Let x be the predictor variable and y the target variable. In the left part of Fig. 4, the SVR approach tries to find the optimal plane (\(w.x - b = 0\)) that has at most \(\varepsilon \) deviation from target values \(y_i\). This problem is transformed into the binary classification one as illustrated in the right part of Fig. 4. The positive class \(D+\) is formed by increasing the target values \(y_i\) by \(\varepsilon \). The negative class \(D-\) is formed by decreasing the target values \(y_i\) by \(\varepsilon \). The optimal plane found by the SVC approach for separating \(D+\) from \(D-\) is identical to the solution of the SVR approach. The biggest margin solution (largest separation boundary of two classes) gives the safest prediction model. It means that we can explain the generalization capacity of kSVR models trained by the kSVR algorithm in terms of the largest margin size of the binary SVC.

The performance analysis in terms of the generalization capacity of such local models is illustrated in [13,14,15]. Recently Do and his colleague [16, 17] show that the learning algorithms of local SVC models give a guarantee of the generalization capacity compared to global SVC one.

To assess the generalization capacity of local SVR models, we start with Theorem 5.2 ([18] p. 139). It mentions that the generalization ability of the large margin hyperplane is high. Given a training set with m datapoints being separated by the maximal margin hyperplanes, the expectation of the probability of test error is bounded as follows:

$$\begin{aligned} EP_{error} \le E\left\{ min\left( \frac{sv}{m}, \frac{1}{m}\left[ \frac{R^2}{\varDelta } \right] , \frac{n}{m} \right) \right\} . \end{aligned}$$
(6)

where sv is the number of support vectors, R is the radius of the sphere containing the data and \(\varDelta \) is the size of the margin, n is the number of dimensions.

It means that the good generalization ability of the maximal margin hyperplane is illustrated in:

$$\begin{aligned} min \left( \frac{1}{m}\left[ \frac{R^2}{\varDelta } \right] \right) . \end{aligned}$$
(7)

In the training task, the kSVR algorithm splits the full dataset having m datapoints into k clusters (the cluster size \(m_k\) is about \(\frac{m}{k}\)). And then the generalization capacity of local SVR models is assessed in:

$$\begin{aligned} min\left( \frac{1}{m_k}\left[ \frac{R_k^2}{\varDelta _k} \right] \right) . \end{aligned}$$
(8)

The generalization analysis bases on the comparison between Eq. (7) of the global SVR model trained from the full dataset and Eq. (8) of the local SVR model learnt from a cluster (subset). According to Theorem 1 in [16] and Theorems 2, 3 in [17], the inequality \(\varDelta _{X_k} \ge \varDelta _X\) holds. Nevertheless, the use of the subset in the training task of kSVR leads to \(R_k \le R\) and \(m_k \le m\). It illustrates that there exists a compromise between the locality (the subset size, the radius of the sphere containing the data) and the generalized capacity (the margin size). Therefore, local SVR models trained by the kSVR algorithm can guarantee the prediction performance compared to the global SVR one.

The performance analysis in terms of the algorithmic complexity (5) and the generalization capacity (8) shows that the parameter k is used in the kSVR to give a trade-off between the generalization capacity and the computational cost. This can be understood as follows:

  • If k is large then the kSVR algorithm reduces significant training time. And then, the size of a cluster is small; The locality is extremely with a very low capacity.

  • If k is small then the kSVR algorithm reduces insignificant training time. However, the size of a cluster is large; It improves the capacity.

It leads to set k so that the cluster size is a large enough (e.g. 200 proposed by [14]).

Fig. 5.
figure 5

Training algorithm of T random kSVR models

figure b

4 Parallel Algorithm of T Random Local Support Vector Regression Models

The performance analysis of the kSVR algorithm shows that there is a trade-off between the algorithmic complexity and the generalization capacity. If the kSVR tries to speed up the training time against the learning algorithm of global SVR models by increasing the number of clusters (the parameter k) then it reduces the generalization ability. Due to this problem, we propose the ensemble-based learning algorithm of random local SVR models to improve the generalization capacity of the local one. The main idea is based on the Bias-Variance framework proposed by Breiman [19, 20]. The performance of a learning model depends on Bias term and Variance term. Bias is the systematic error and not depending on the learning sample. Variance is the error with respect to the variability of the learning model due to the learning sample randomness. The main idea of ensemble-based learning algorithms [19,20,21] and [22] is to use the randomization of the learning sample for reducing Variance and/or Bias. This key idea leads to the improvement of the generalization capacity of the use of the single one model. Therefore, we propose the ensemble learning algorithm krSVR to train T random kSVR models for reducing Variance. It means that the krSVR improves the generalization ability of the kSVR model alone.

4.1 Learning T Random Local SVR Models

The krSVR algorithm trains the ensemble of T random kSVR models using the kSVR algorithm (described in Algorithm 1 and Fig. 3). The kSVR algorithm trains the \(t^{th}\) kSVR model from the \(t^{th}\) bootstrap sample (sampling with replacement from the original dataset) using \(n'\) dimensions randomly sampling without replacement from n original dimensions.

As described in Algorithm 2 and Fig. 5, the krSVR constructs independently T random local SVR models. It allows parallelizing the learning task with OpenMP on multi-core computers.

Thus, the complexity of parallel learning krSVR on a P-core processor is as follows:

$$\begin{aligned} O(\frac{T.m^2}{k.P}). \end{aligned}$$
(9)

4.2 Prediction of a New Datapoint x Using T Random Local SVR Models

The prediction for a new individual x is the average of the prediction results obtained by T random kSVR models.

5 Evaluation

We are interested in the performance of new parallel algorithms of local SVR (kSVR, krSVR) for dealing with large datasets. There is a need of numerical test results in terms of training time and prediction correctness to be consistent with the performance analysis of the algorithmic complexity and the generalization capacity in Sect. 3.

5.1 Software Programs

We have implemented algorithms kSVR, krSVR in C/C++, OpenMP [11], using the Automatically Tuned Linear Algebra Software (ATLAS [23]) and the highly efficient standard library SVM, LibSVM [6].

Due to the consistency with the performance analysis of the algorithmic complexity and the generalization capacity, our evaluation is reported in terms of training time and prediction correctness. We are interested in the comparison the regression results obtained by our proposed kSVR, krSVR for local SVR models with LibSVM for global SVR models.

All experiments are run on machine Linux Fedora 20, Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores and 32 GB main memory.

5.2 Datasets

All experiments are conducted with the five datasets from UCI repository [8]. Table 1 presents the description of datasets. The evaluation protocols are illustrated in the last column of Table 1. Datasets are already divided in training set (Trn) and test set (Tst). We used the training data to build the SVR models. Then, we predicted the test set using the resulting models.

Table 1. Description of datasets

5.3 Tuning Parameters

We propose to use RBF kernel function type in training tasks of kSVR, krSVR and LibSVM for building SVR models because it is general and efficient [24]. The cross-validation protocol (2-fold) is used to tune the regression tolerance \(\varepsilon \), the hyper-parameter \(\gamma \) of RBF kernel (RBF kernel of two individuals \(x_i\), \(x_j\), \(K[i,j] = exp(-\gamma \Vert x_i - x_j\Vert ^2)\)) and the cost C (a trade-off between the margin size and the errors) to obtain a good correctness. For two largest datasets (Buzz in social media Twitter, YearPredictionMSD), we used a subset randomly sampling about 5% training dataset for tuning hyper-parameters due to the expensive computational cost.

Our kSVR uses the parameter k local models (number of clusters). We propose to set k so that each cluster consists of about 200 individuals. The idea gives a trade-off between the generalization capacity [15] and the computational cost. Furthermore, the krSVR algorithm learns 20 random k local SVR models (\(T = 20\)) with the number of random dimensions being the square root of the full set (\(n' = \sqrt{(}n)\) as recommended by [20]).

Table 2 presents the hyper-parameters of kSVR, krSVR and LibSVM in the regression.

Table 2. Hyper-parameters of kSVR, krSVR and LibSVM
Table 3. Regression results in terms of mean absolute error (MAE) and training time (minutes)

5.4 Regression Results

The regression results of LibSVM, kSVR and krSVR on the five datasets are given in Table 3, Figs. 6, 7 and 8.

Comparison in Training Time

As it was expected, our proposed algorithms of local SVR models (kSVR, krSVR) outperform LibSVM in terms of training time. The average of kSVR, krSVR and LibSVM training time are 8.45 min, 35.79 min and 1551.15 min, respectively. It means that kSVR is the fastest among the three learning algorithms. kSVR is 183.5 times faster than LibSVM. krSVR is also 43.34 times faster than LibSVM. Training time of krSVR is about 4.23 times longer than kSVR.

Fig. 6.
figure 6

Comparison of training time (minutes) for small datasets

Fig. 7.
figure 7

Comparison of training time (minutes) for large datasets

Fig. 8.
figure 8

Comparison of prediction results

For 3 first small datasets (Appliances energy prediction, Facebook comment volume, BlogFeedback), the training speed improvements of kSVR and krSVR versus LibSVM are 21.10 and 3.97 times, respectively.

With two large datasets (Buzz in social media - Twitter and YearPredictionMSD), the learning time improvements of kSVR and krSVR against LibSVM is more significant. These training speed improvements of kSVR and krSVR compared to LibSVM are 200.44 times and 48.63 times, respectively.

Comparison in Prediction Correctness

In terms of prediction correctness (measured by the mean absolute error - MAE), the error average made by kSVR, krSVR and LibSVM are 23.50, 22.52 and 62.01, respectively. The comparison of prediction correctness, dataset by dataset, shows that:

  • kSVR has 4 wins, 1 defeat against LibSVM; kSVR is beaten only once (with Appliances energy prediction dataset) by LibSVM;

  • krSVR is most accurate with 5/5 wins versus LibSVM and kSVR.

The regression results show that our kSVR, krSVR are more accurate than LibSVM for the prediction. It seems to be that local SVR models are suited for large datasets. The data partition stage allows local SVR algorithms being easily to fit regression models for data subsets. It leads to improve the prediction correctness of local SVR algorithms. In particular with a large dataset as Buzz in social media - Twitter, the target domain is very large [0, 75724.5], therefore the prediction error made by any algorithm is also high. Our kSVR, krSVR can reduce the prediction error about 5 times compared to LibSVM.

The numerical results demonstrate that kSVR, krSVR improve not only the training time, but also the prediction correctness when dealing with large datasets. The regression results allow to believe that our proposed kSVR, krSVR are efficient for handling such large datasets.

6 Discussion on Related Works

Our proposal is related to large-scale SVM learning algorithms. The improvements of SVM training on very large datasets include effective heuristic methods in the decomposition of the original quadratic programming into series of small problems [10, 25, 26] and [6]. Recent works [27, 28] proposed the stochastic gradient descent methods for dealing with large scale linear SVM solvers. The parallel and distributed algorithms [29,30,31] for the linear classification improve learning performance for large datasets by dividing the problem into sub-problems that execute on large numbers of networked PCs, grid computing, multi-core computers.

The review paper [32] provides a comprehensive survey on large-scale linear support vector classification. LIBLINEAR [33] and its extension [34] uses the Newton method for the primal-form of SVM and and the coordinate descent approach for the dual-form SVM to deal with very large linear classification and regression. The parallel algorithms of logistic regression and linear SVM using Spark [35] are proposed in [36]. The distributed Newton algorithm of logistic regression [37] is implemented in the Message Passing Interface (MPI). The parallel dual coordinate descent method for linear classification [38] is implemented in multi-core environments using OpenMP. The incremental and decremental algorithms [39] aim at training linear classification of large data that cannot fit in memory. These algorithms are proposed to efficiently deal large-scale linear classification tasks in a very-high-dimensional input space. But the computational cost of a non-linear SVM approach is still impractical. The work in [40] tries to automatically determine which kernel classifiers perform strictly better than linear for a given data set.

Our proposal is in some aspects related to local SVM learning algorithms. The first approach is to classify data in hierarchical strategy. This kind of training algorithm performs the classification task with two main steps. The first one is to cluster the full dataset into homogeneous groups (clusters) and then the second one is to learn the local supervised classification models from clusters. The paper [41] proposed to use the expectation-maximization (EM) clustering algorithm [42] for partitioning the training set into k joint clusters (the EM clustering algorithm makes a soft assignment based on the posterior probabilities [43]); for each cluster, a neural network (NN) is learnt to classify the individuals in the cluster. The parallel mixture of SVMs algorithm proposed by [44] constructs local SVM models instead of NN ones in [41]. CSVM [45] uses kmeans algorithm [7] to partition the full dataset into k disjoint clusters; then the algorithm learns weighted local linear SVMs from clusters. More recent kSVM [46], krSVM [47] (random ensemble of kSVM), tSVM [16] propose to parallely train the local non-linear SVMs instead of weighting linear ones of CSVM. DTSVM [48, 49] uses the decision tree algorithm [50, 51] to split the full dataset into disjoint regions (tree leaves) and then the algorithm builds the local SVMs for classifying the individuals in tree leaves. These algorithms aim at speeding up the learning time.

The second approach is to learn local supervised models from k nearest neighbors (kNN) of a new testing individual. First local learning algorithm of Bottou & Vapnik [14] find kNN of a test individual; train a neural network with only these k neighborhoods and apply the resulting network to the test individual. k-local hyperplane and convex distance nearest neighbor algorithms are also proposed in [52]. More recent local SVM algorithms aim to use the different methods for kNN retrieval; including SVM-kNN [53] trying different metrics, ALH [54] using the weighted distance and features, FaLK-SVM [55] speeding up the kNN retrieval with the cover tree [56].

A theorical analysis for such local algorithms discussed in [13] introduces the trade-off between the capacity of learning system and the number of available individuals. The size of the neighborhoods is used as an additional free parameters to control generalisation capacity against locality of local learning algorithms.

7 Conclusion and Future Works

We presented new parallel algorithms of local SVR that achieve high performances for the non-linear regression on large datasets. The training task of kSVR is to split the full training dataset into k clusters. This step is to reduce data size in training local SVR. And then it easily learns k non-linear SVR models in the parallel way on multi-core computers in which a SVR model is trained in each cluster to predict the data locally. The krSVR learning algorithm is to improve the generalization capacity of the kSVR alone by training an ensemble random kSVR models. The performance analysis and the numerical test results on five datasets from UCI repository showed that our proposed kSVR, krSVR are efficient in terms of training time and prediction correctness compared to the standard SVR. The learning time improvements of kSVR and krSVR versus LibSVM are 183.5 and 43.3 times. kSVR and krSVR improve 62.10%, 63.70% of the relative prediction correctness compared to the standard SVR, respectively. An example of kSVR’s effectiveness is given with the non-linear regression of YearPredictionMSD dataset (having 400000 datapoints, 90 dimensions) in 6.33 min and 7.86 mean absolute error obtained on the prediction of the testset.

In the near future, we intend to provide more empirical test on large benchmarks and comparisons with other algorithms. A promising avenue for future research is able to automatically tune hyperparameters for kSVR and krSVR.