Keywords

1 Introduction

Support vector machines (SVM) proposed by [1] and kernel-based methods have shown practical relevance for classification, regression and novelty detection. Successful applications are reported for face identification, text categorization and bioinformatics [2]. Nevertheless, the SVM learning is accomplished through a quadratic programming (QP), so that the computational cost of a SVM approach is at least square of the number of training datapoints making SVM impractical to handle large datasets. There is a need to scale-up SVM learning algorithms to deal with massive datasets.

In this paper, we propose the new parallel algorithm of local support vector regression (local SVR), called kSVR for effectively dealing with the non-linear regression of large datasets. Instead of building a global SVR model, as done by the classical algorithm is very difficult to deal with large datasets, the kSVR algorithm is to learn in the parallel way an ensemble of local ones that are easily trained by the standard SVR algorithms. The kSVR algorithm performs the training task with two main steps. The first one is to use kmeans algorithm [3] to partition the training dataset into k clusters. The idea is to reduce the data size for training local non-linear SVR models at the second step. The algorithm learns k non-linear SVR models in the parallel way on multi-core computers in which a SVR model is trained in each cluster to predict the data locally. The numerical test results on datasets from UCI repository [4] showed that our proposal is efficient compared to the standard SVR in terms of training time and prediction correctness. The kSVR algorithm is faster than the standard SVR in the non-linear regression of large datasets while maintaining the high prediction correctness.

The paper is organized as follows. Section 2 briefly introduces the SVR algorithm. Section 3 presents our proposed parallel algorithm of local SVR for the non-linear regression of large datasets. Section 4 shows the experimental results. Section 5 discusses about related works. We then conclude in Sect. 6.

2 Support Vector Regression

Given a training dataset with m datapoints \(x_i\) (\(i=1,\ \ldots ,m\)) in the n-dimensional input space \(R^n\), having corresponding targets \(y_i \in R\), support vector regression (SVR) proposed by [1] tries to find the best hyperplane (denoted by the normal vector \(w \in R^n\) and the scalar \(b \in R\)) that has at most \(\epsilon \) deviation from the target value \(y_i\). The SVR pursues this goal with the quadratic programming (1).

$$\begin{aligned} \begin{array}{l} \min {(1/2) \displaystyle \sum _{i=1}^m \displaystyle \sum _{j=1}^m{(\alpha _i - {\alpha _i}^*)(\alpha _j - {\alpha _j}^*)K\langle x_i,x_j\rangle } - \displaystyle \sum _{i=1}^m(\alpha _i - {\alpha _i}^*)y_i + \epsilon \displaystyle \sum _{i=1}^m(\alpha _i + {\alpha _i}^*)} \\ \\ s{.}t.\left\{ \begin{array}{l} \displaystyle \sum _{i=1}^m(\alpha _i - {\alpha _i}^*) = 0 \\ 0 \le \alpha _i, {\alpha _i}^* \le C ~~\forall i = 1,2,...,m \end{array} \right. \end{array} \end{aligned}$$
(1)

where C is a positive constant used to tune the margin and the error and a linear kernel function .

Fig. 1.
figure 1

Linear support vector regression

The support vectors (for which \(\alpha _i, {\alpha _i}^* > 0\)) are given by the solution of the quadratic programming (1), and then, the predictive hyperplane and the scalar b are determined by these support vectors. The prediction of a new datapoint x based on the SVR model is as follows:

$$\begin{aligned} predict(x, SVR\ model) = \sum _{i=1}^{\sharp SV}{(\alpha _i - {\alpha _i}^*)K\langle x, x_i\rangle } - b \end{aligned}$$
(2)

Variations on SVR algorithms use different prediction functions [5]. It only needs replacing the usual linear kernel function with other popular non-linear kernel functions, including:

  • a polynomial function of degree

  • a RBF (Radial Basis Function): \(K\langle x_i,x_j\rangle = e^{-\gamma \Vert x_i - x_j\Vert ^2}\)

The SVR models are most accurate and practical relevance for many successful applications reported in [2].

Fig. 2.
figure 2

Training k local SVR models (kSVR)

3 Parallel Algorithm of Local Support Vector Regression

The study in [6] illustrated that the computational cost requirements of the SVR solutions in (1) are at least \(O(m^2)\) (where m is the number of training datapoints), making standard SVM intractable for large datasets. Learning a global SVR model on the full massive dataset is challenge due to the very high computational cost.

Learning k Local SVR Models

Our proposed kSVR algorithm learns an ensemble of local SVR models that are easily trained by the standard SVR algorithm. As illustrated in Fig. 2, the kSVR handles the regression task with two main steps. The first one uses kmeans algorithm [3] to partition the full training set into k clusters, and then the second one trains an ensemble of local SVR models in which a non-linear SVR is learnt from each cluster to predict the data locally. We consider a simplest regression task given a target variable y and a predictor (variable) x. Figure 3 shows the comparison between a global SVR model (left part) and 3 local SVR models (right part) for this regression task, using a non-linear RBF kernel function with \(\gamma = 10\), a positive constant \(C=10^5\) (i.e. the hyper-parameters \(\theta = \{\gamma , C\}\)) and a tolerance \(\epsilon = 0.05\).

Since the cluster size is smaller than the full training data size, the standard SVR can easily perform the training task on the data cluster. Furthermore, the kSVR learns independently k local models from k clusters. This is easily parallelized to take into account the benefits of high performance computing, e.g. multi-core computers or grids. The simplest development of the parallel kSVR algorithm is based on the shared memory multiprocessing programming model OpenMP [7] on multi-core computers. The parallel training of kSVR is described in Algorithm 1.

Fig. 3.
figure 3

Global SVR model (left part) versus 3 local SVR models (right part)

Performance Analysis

The performance analysis starts with the algorithmic complexity of building k local SVR models with the parallel kSVR algorithm. The full dataset with m datapoints is partitioned into k balanced clusters (the cluster size is about \(\frac{m}{k}\)). The training complexity of a local SVR is \(O((\frac{m}{k})^2)\). Therefore, the algorithmic complexity of parallel training k local SVR models on a P-core processor is \(O(\frac{k}{P}(\frac{m}{k})^2) = O(\frac{m^2}{kP})\). This complexity analysis illustrates that parallel learning k local SVR models in the kSVR algorithmFootnote 1 is kP times faster than building a global SVR model (the complexity is at least \(O(m^2)\)).

The performance analysis in terms of the generalization capacity of such kSVR models is illustrated in [8,9,10,11]. The parameter k is used in the kSVR to give a trade-off between the generalization capacity and the computational cost. This point can be understood as follows:

  • If k is large then the kSVR algorithm reduces significant training time. And then, the size of a cluster is small; The locality is extremely with a very low capacity.

  • If k is small then the kSVR algorithm reduces insignificant training time. However, the size of a cluster is large; It improves the capacity.

It leads to set k so that the cluster size is a large enough (e.g. 200 proposed by [9]).

figure a

Prediction of a New Datapoint x Using k Local SVR Models

The kSVR-model = \( \{(c_1, lsvr_1), (c_2, lsvr_2), \dots , (c_k, lsvr_k)\)} is used to predict the target value of a new datapoint x as follows. The first step is to find the closest cluster based on the distance between x and the cluster centers:

$$\begin{aligned} c_{NN} = \mathop {{{\mathrm{arg\,min}}}}\limits _c\ distance(x, c) \end{aligned}$$
(3)

And then, the target value of x is predicted by the local SVR model \(lsvr_{NN}\) (corresponding to \(c_{NN}\)):

$$\begin{aligned} predict(x, kSVR\ model) = predict(x, lsvr_{NN}) \end{aligned}$$
(4)

4 Evaluation

We are interested in the performance of the new parallel algorithm of local SVR (denoted by kSVR) for data regression. We have implemented the kSVR algorithm in C/C++, OpenMP [7], using the highly efficient standard library SVM, LibSVM [12]. Our evaluation of the performance is reported in terms of training time and prediction correctness. We are interested in the comparison the regression results obtained by our proposed kSVR with LibSVM.

All experiments are run on machine Linux Fedora 20, Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores and 32 GB main memory.

Datasets

Experiments are conducted with the 5 datasets from UCI repository [4]. Table 1 presents the description of datasets. The evaluation protocols are illustrated in the last column of Table 1. Datasets are already divided in training set (Trn) and test set (Tst). We used the training data to build the SVR models. Then, we predicted the test set using the resulting models.

Table 1. Description of datasets

Tuning Parameters

We propose to use RBF kernel type in kSVR and SVR models because it is general and efficient [13]. The cross-validation protocol (2-fold) is used to tune the regression tolerance \(\epsilon \), the hyper-parameter \(\gamma \) of RBF kernel (RBF kernel of two individuals \(x_i\), \(x_j\), \(K[i,j] = exp(-\gamma \Vert x_i - x_j\Vert ^2)\)) and the cost C (a trade-off between the margin size and the errors) to obtain a good correctness. For two largest datasets (Buzz in social media Twitter, YearPredictionMSD), we used a subset randomly sampling about 5% training dataset for tuning hyper-parameters due to the expensive computational cost. Furthermore, our kSVR uses the parameter k local models (number of clusters). We propose to set k so that each cluster consists of about 200 individuals. The idea gives a trade-off between the generalization capacity [10] and the computational cost. Table 2 presents the hyper-parameters of kSVR and SVR in the regression.

Table 2. Hyper-parameters of kSVR and SVR
Table 3. Regression results in terms of mean absolute error (MAE) and training time (minutes)

Regression Results

The regression results of LibSVM and kSVR on the datasets are given in Table 3, Figs. 4, 5 and 6.

As it was expected, our kSVR algorithm outperforms LibSVM in terms of training time. The average of kSVR and LibSVM training time are 8.45 min and 1551.15 min, respectively. It means that our kSVR is 183.5 times faster than LibSVM.

Fig. 4.
figure 4

Comparison of training time (minutes) for small datasets

Fig. 5.
figure 5

Comparison of training time (minutes) for large datasets

Fig. 6.
figure 6

Comparison of prediction results

For 3 first small datasets, the training speed improvements of kSVR versus LibSVM is 21.10 times. With two large datasets (Buzz in social media - Twitter and YearPredictionMSD), the learning time improvements of kSVR against LibSVM is more significant (about 200.44 times).

In terms of prediction correctness (mean absolute error - MAE), the error average made by kSVR and LibSVM are 23.50 and 62.01, respectively. The comparison of prediction correctness, dataset by dataset, shows that kSVR is beaten only once (with Appliances energy prediction dataset) by LibSVM (4 wins, 1 defeat). It illustrates that our kSVR is more accurate than LibSVM for the prediction.

kSVR improves not only the training time, but also the prediction correctness when dealing with large datasets. The regression results allow to believe that our proposed kSVR is efficient for handling these data volumes.

5 Discussion on Related Works

Our proposal is related to large-scale SVM learning algorithms. The improvements of SVM training on very large datasets include effective heuristic methods in the decomposition of the original quadratic programming into series of small problems [6, 12, 14, 15]. Recent works [16, 17] proposed the stochastic gradient descent methods for dealing with large scale linear SVM solvers. The parallel and distributed algorithms [18,19,20] for the linear classification improve learning performance for large datasets by dividing the problem into sub-problems that execute on large numbers of networked PCs, grid computing, multi-core computers.

The review paper [21] provides a comprehensive survey on large-scale linear support vector classification. LIBLINEAR [22] and its extension [23] uses the Newton method for the primal-form of SVM and the coordinate descent approach for the dual-form SVM to deal with very large linear classification and regression. The parallel algorithms of logistic regression and linear SVM using Spark [24] are proposed in [25]. The distributed Newton algorithm of logistic regression [26] is implemented in the Message Passing Interface (MPI). The parallel dual coordinate descent method for linear classification [27] is implemented in multi-core environments using OpenMP. The incremental and decremental algorithms [28] aim at training linear classification of large data that cannot fit in memory. These algorithms are proposed to efficiently deal large-scale linear classification tasks in a very-high-dimensional input space. But the computational cost of a non-linear SVM approach is still impractical. The work in [29] tries to automatically determine which kernel classifiers perform strictly better than linear for a given data set.

Our proposal is in some aspects related to local SVM learning algorithms. The first approach is to classify data in hierarchical strategy. This kind of training algorithm performs the classification task with two main steps. The first one is to cluster the full dataset into homogeneous groups (clusters) and then the second one is to learn the local supervised classification models from clusters. The paper [30] proposed to use the expectation-maximization (EM) clustering algorithm [31] for partitioning the training set into k joint clusters (the EM clustering algorithm makes a soft assignment based on the posterior probabilities [32]); for each cluster, a neural network (NN) is learnt to classify the individuals in the cluster. The parallel mixture of SVMs algorithm proposed by [33] constructs local SVM models instead of NN ones in [30]. CSVM [34] uses kmeans algorithm [3] to partition the full dataset into k disjoint clusters; then the algorithm learns weighted local linear SVMs from clusters. More recent kSVM [35], krSVM [36] (random ensemble of kSVM), tSVM [11] propose to parallely train the local non-linear SVMs instead of weighting linear ones of CSVM. DTSVM [37, 38] uses the decision tree algorithm [39, 40] to split the full dataset into disjoint regions (tree leaves) and then the algorithm builds the local SVMs for classifying the individuals in tree leaves. These algorithms aim at speeding up the learning time.

The second approach is to learn local supervised models from k nearest neighbors (kNN) of a new testing individual. First local learning algorithm of Bottou and Vapnik [9] find kNN of a test individual; train a neural network with only these k neighborhoods and apply the resulting network to the test individual. k-local hyperplane and convex distance nearest neighbor algorithms are also proposed in [41]. More recent local SVM algorithms aim to use the different methods for kNN retrieval; including SVM-kNN [42] trying different metrics, ALH [43] using the weighted distance and features, FaLK-SVM [44] speeding up the kNN retrieval with the cover tree [45].

A theoretical analysis for such local algorithms discussed in [8] introduces the trade-off between the capacity of learning system and the number of available individuals. The size of the neighborhoods is used as an additional free parameters to control generalisation capacity against locality of local learning algorithms.

6 Conclusion and Future Works

We presented the new parallel algorithm of local SVR that achieves high performances for the non-linear regression of large datasets. The training task of kSVR is to partition the full training dataset into k clusters. This step is to reduce data size in training local SVR. And then it easily learns k non-linear SVR models in the parallel way on multi-core computers in which a SVR model is trained in each cluster to predict the data locally. The numerical test results on datasets from UCI repository showed that our proposed kSVR is efficient in terms of training time and prediction correctness compared to the standard LibSVM. An example of its effectiveness is given with the non-linear regression of YearPredictionMSD dataset (having 400000 datapoints, 90 dimensions) in 6.33 min and 7.86 mean absolute error obtained on the prediction of the testset.

In the near future, we intend to provide more empirical test on large benchmarks and comparisons with other algorithms. A promising avenue for future research aims at improving the prediction correctness and automatically tuning hyperparameters of kSVR.