Speeding Up Budgeted Stochastic Gradient Descent SVM Training with Precomputed Golden Section Search

Glasmachers, Tobias; Qaadan, Sahar

doi:10.1007/978-3-030-13709-0_28

Tobias Glasmachers¹⁷ &
Sahar Qaadan¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11331))

Included in the following conference series:

International Conference on Machine Learning, Optimization, and Data Science

2119 Accesses
1 Citations

Abstract

Limiting the model size of a kernel support vector machine to a pre-defined budget is a well-established technique that allows to scale SVM learning and prediction to large-scale data. Its core addition to simple stochastic gradient training is budget maintenance through merging of support vectors. This requires solving an inner optimization problem with an iterative method many times per gradient step. In this paper we replace the iterative procedure with a fast lookup. We manage to reduce the merging time by up to $65\%$ and the total training time by $44\%$ without any loss of accuracy.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Large-scale support vector regression with budgeted stochastic gradient descent

Article 07 June 2018

Online Budgeted Stochastic Coordinate Ascent for Large-Scale Kernelized Dual Support Vector Machine Training

Stochastic Subgradient Estimation Training for Support Vector Machines

1 Introduction

The Support Vector Machine (SVM; [5]) is a widespread standard machine learning method, in particular for binary classification problems. Being a kernel method, it employs a linear algorithm in an implicitly defined kernel-induced feature space [24]. SVMs yield high predictive accuracy in many applications [6, 15, 16, 19, 28]. They are supported by strong learning theoretical guarantees [1, 9, 12, 17].

When facing large-scale learning, the applicability of support vector machines (and many other learning machines) is limited by their computational demands. Given n training points, training an SVM with standard dual solvers takes quadratic to cubic time in n [1]. Steinwart [23] established that the number of support vectors is linear in n, and so is the storage complexity of the model as well as the time complexity of each of its predictions. This quickly becomes prohibitive for large n, e.g., when learning from millions of data points.

Due to the prominence of the problem, a large number of solutions was developed. Parallelization can help [29, 33], but it does not reduce the complexity of the training problem. One promising route is to solve the SVM problem only locally, usually involving some type of clustering [14, 30] or with a hierarchical divide-and-conquer strategy [8, 11]. An alternative approach is to leverage the progress in the domain of linear SVM solvers [10, 13, 32], which scale well to large data sets. To this end, kernel-induced feature representations are approximated by low-rank approaches [7, 20, 27, 31], either a-priory using random Fourier features, or in a data-dependent way using Nyström sampling.

Budget methods, introducing an a-priori limit $B \ll n$ on the number of support vectors [18, 25], go one step further by letting the optimizer adapt the feature space approximation during operation to its needs, which promises a comparatively low approximation error. The usual strategy is to merge support vectors at need, which effectively enables the solver to move support vectors around in input space. Merging decisions greedily minimize the approximation error.

In this paper we propose an effective computational improvement of this scheme. Finding the best merge partners, i.e., support vectors that induce the lowest approximation error when merged, is a rather costly operation. Usually, $\mathcal {O}(B)$ candidate pairs of vectors are considered, and for each pair an optimization problem is solved with an iterative strategy. By modelling the low-dimensional space of (solutions of the) optimization problems explicitly, we can remove the iterative process entirely, and replace it with a simple and fast lookup.

Our results show that merging-based budget maintenance can account for more than half of the total training time. Therefore reducing the merging time is a promising approach to speeding up training. The speed-up can be significant; on our largest data set we reduce the merging time by $65\%$, which corresponds to a reduction of the total training time by $44\%$. At the same time, our lookup method is at least as accurate as the original iterative procedure, resulting in nearly identical merging decisions and no loss of prediction accuracy.

The remainder of this paper is organized as follows. In the next section we introduce SVMs and stochastic gradient training on a budget. Then we analyze the computational bottleneck of the solver and develop a lookup smoothed with bilinear interpolation as a remedy. In Sect. 4 we benchmark the new algorithm against “standard” BSGD, and we investigate the influence of the algorithmic simplification on different budget sizes. Our results demonstrate systematic improvements in training time at no cost in terms of solution quality.

2 Support Vector Machine Training

In this section we introduce the necessary background: SVMs for binary classification, and training with stochastic gradient descent (SGD) on a budget, i.e., with a-priori limited number of support vectors.

Support Vector Machines. An SVM classifier is a supervised machine learning algorithm. In its simplest form it linearly separates two classes with a large margin. When applying a kernel function $k : X \times X \rightarrow \mathbb {R}$ over the input space X, the separation happens in a reproducing kernel Hilbert space (RKHS). For labeled data $\big ((x_1, y_1), \dots , (x_n, y_n)\big ) \in (X \times \{-1, +1\})^n$, the prediction on $x \in X$ is computed as

with $w = \sum _{j=1}^n \alpha _j \phi (x_j)$, where $\phi (x)$ is an only implicitly defined feature map (due to Mercer’s theorem, see also [24]) corresponding to the kernel function fulfilling $k(x, x') = \langle \phi (x), \phi (x') \rangle $. Training points $x_j$ with non-zero coefficients $\alpha _j \not = 0$ are called support vectors; the summation in the predictor can obviously be restricted to this subset. The SVM model is obtained by minimizing the following (primal) objective function:

$$\begin{aligned} P(w, b) = \frac{\lambda }{2} \Vert w\Vert ^2 + \frac{1}{n} \sum _{i=1}^n L \Big ( y_i, \big \langle w, \phi (x_i) \big \rangle + b \Big ). \end{aligned}$$

(1)

Here, $\lambda > 0$ is a user-defined regularization parameter and $L(y, \mu ) = \max \{0, 1 - y \cdot \mu \}$ denotes the hinge loss, which is a prototypical large margin loss, aiming to separate the classes with a functional margin $y \cdot \mu $ of at least one. The incorporation of other loss functions allows to generalize SVMs to other tasks like multi-class classification, regression, and ranking.

Primal Training. Problem (1) is a convex optimization problem without constraints. It has an equivalent dual representation as a quadratic program (QP), which is solved by several state-of-the-art “exact” solvers like LIBSVM [4] and thunder-SVM [26]. The main challenge is the high dimensionality of the problem, which coincides with the training set size n and can hence easily grow into the millions.

A simple method is to solve problem (1) directly with stochastic gradient descent (SGD), similar to neural network training. When presenting one training point at a time, as done in Pegasos [22], the objective function P(w, b) is approximated by the unbiased estimate

$$\begin{aligned} P_i(w, b) = \frac{\lambda }{2} \Vert w\Vert ^2 + L \Big ( y_i, \big \langle w, \phi (x_i) \big \rangle + b \Big ), \end{aligned}$$

where the index $i \in \{1, \dots , n\}$ follows a uniform distribution. The stochastic gradient $\nabla P_i(w, b)$ is an unbiased estimate of the “batch” gradient $\nabla P(w, b)$ but faster to compute by a factor of n, since it involves only a single training point. Starting from $(w, b) = (0, 0)$, SGD updates the weights according to

$$\begin{aligned} (w, b) \leftarrow (w, b) - \eta _t \cdot \nabla P_{i_t}(w, b), \end{aligned}$$

where t is the iteration counter. With a learning rate $\eta _t \in \varTheta (1/t)$ it is guaranteed to converge to the optimum of the convex training problem [2].

With a sparse representation $w = \sum _{(\alpha , \tilde{x}) \in M} \alpha \cdot \phi (\tilde{x})$ the SGD update decomposes into the following algorithmic steps. We scale down all coefficients $\alpha $ uniformly by the factor $1 - \lambda \cdot \eta _t$. If the margin $y_i (\langle w, \phi (x_i)\rangle + b)$ happens to be less than one, then we add a new point $\tilde{x} = x_i$ with coefficient $\alpha = \eta _t \cdot y_i$ to the model M. With a dense representation holding one coefficient $\alpha _i$ per data point $(x_i, y_i)$ we would add the above value to $\alpha _i$. The most costly step is the computation of $\langle w, \phi (x_i) \rangle $, which is linear in the number of support vectors (SVs), and hence generally linear in n [23].

SVM Training on a Budget. Budgeted Stochastic Gradient Descent (BSGD) breaks the unlimited growth in model size and update time for large data streams by bounding the number of support vectors during training. The upper bound $B \ll n$ is the budget size. Per SGD step the algorithm can add at most one new support vector; this happens exactly if $(x_i, y_i)$ does not meet the target margin of one (and $\alpha _i$ changes from zero to a non-zero value). After $B+1$ such steps, the budget constraint is violated and a dedicated budget maintenance algorithm is triggered to reduce the number of support vectors to at most B. The goal of budget maintenance is to fulfill the budget constraint with the smallest possible change of the model, measured by $\Vert \varDelta \Vert ^2 = \Vert w' - w\Vert ^2$, where w is the weight vector before and $w'$ is the weight vector after budget maintenance. $\varDelta = w' - w$ is referred to as the weight degradation.

Budget maintenance strategies are investigated in detail in [25]. It turns out that merging of two support vectors into a single new point is superior to alternatives like removal of a point and projection of the solution onto the remaining support vectors. Merging was first proposed in [18] as a way to efficiently reduce the complexity of an already trained SVM. With merging, the complexity of budget maintenance is governed by the search for suitable merge partners, which is $\mathcal {O}(B^2)$ for all pairs, while it is common to apply the $\mathcal {O}(B)$ heuristic resulting from fixing the point with smallest coefficient $\alpha _i$ as a first partner.

When merging two support vectors $x_i$ and $x_j$, we aim to approximate $\alpha _i \cdot \phi (x_i) + \alpha _j \cdot \phi (x_j)$ with a new term $\alpha _z \cdot \phi (z)$ involving only a single point z. Since the kernel-induced feature map is usually not surjective, the pre-image of $\alpha _i \phi (x_i) + \alpha _j \phi (x_j)$ under $\phi $ is empty [3, 21] and no exact match z exists. Therefore the weight degradation $\varDelta = \alpha _i \phi (x_i) + \alpha _j \phi (x_j) - \alpha _z \phi (z)$ is non-zero. For the Gaussian kernel $k(x, x') = \exp (-\gamma \Vert x - x'\Vert ^2)$, due to its symmetries, the point z minimizing $\Vert \varDelta \Vert ^2$ lies on the line connecting $x_i$ and $x_j$ and is hence of the form $z = h x_i + (1 - h) x_j$. For $y_i = y_j$ we obtain a convex combination $0< h < 1$, otherwise we have $h < 0$ or $h > 1$. In this paper we merge only vectors of equal label. For each choice of z, the optimal value of $\alpha _z$ is obtained in closed form: $\alpha _z = \alpha _i k(x_i, z) + \alpha _j k(x_j, z)$. This turns minimization of $\Vert \varDelta \Vert ^2 = \alpha _i^2 + \alpha _j^2 - \alpha _z^2 + 2 k(x_i, x_j)$ into a one-dimensional non-linear optimization problem, which is solved in [25] with golden section line search. The calculations are further simplified by the relations $k(x_i, z) = k(x_i, x_j)^{(1-h)^2}$ and $k(x_j, z) = k(x_i, x_j)^{h^2}$, which save costly kernel functions evaluations.

Budget maintenance in BSGD usually works in the following sequence of steps, see Algorithm 1: First, $x_i$ is fixed to the support vector with minimal coefficient $|\alpha _i|$. Then the best merge partner $x_j$ is determined by testing B pairs $(x_i, x_j)$, $j \in \{1, \dots , B+1\} \setminus \{i\}$. Golden section search is run for each of these steps to determine h to fixed precision $\varepsilon = 0.01$. The weight degradation is computed using the shortcuts mentioned above. Finally, the candidate with minimal weight degradation is selected and the vectors are merged. Hence, although a single golden search search is fast, the need to run it many times per SGD iteration turns it into a rather costly operation.

A theoretical analysis of BSGD is provided by [25]. Their Theorem 1 establishes a bound on the error induced by the budget, ensuring that asymptotically the error is governed only by the (unavoidable) weight degradation.

3 Precomputing the Merging Problem

The merging problem for given support vectors $x_i$ and $x_j$ with coefficients $\alpha _i$ and $\alpha _j$ is illustrated in Fig. 1. Our central observation is that the geometry depends only on the (cosine of the) angle between $\alpha _i \phi (x_i)$ and $\alpha _j \phi (x_j)$, and on the relative lengths of the two vectors. These two quantities are captured by the parameters

relative length $m = \alpha _i / (\alpha _i + \alpha _j)$
cosine of the angle $\kappa = k(x_i, x_j)$,

both of which take values in the unit interval. The optimal merging coefficient h is a function of m and $\kappa $, and so is the resulting weight degradation $W\!D = \Vert \varDelta \Vert ^2$. Therefore we can express h and $W\!D$ as functions of m and $\kappa $, denoted as $h(m, \kappa )$ and $W\!D(m, \kappa )$ in the following. The functions can be evaluated to any given target precision by running the golden section search. Their graphs are plotted in Figs. 2a and b.

If the functions h or $W\!D$ can be approximated efficiently then there is no need to run a potentially costly iterative procedure like golden section search. This is our core technique for speeding up the BSGD method.

The functions blend between different budget maintenance strategies. While for $\kappa \gg 0$ and for $m \approx 1/2$ it is beneficial to merge the two support vectors, resulting in $h \in (0, 1)$, this is not the case for $\kappa \ll 1$ and $m \approx 0$ or $m \approx 1$, resulting in $h \approx 0$ or $h \approx 1$, which is equivalent to removal of the support vector with smaller coefficient. This means that in order to obtain a close fit that works well in both regimes we may need a quite flexible function class like a kernel method or a neural network, while a simple polynomial function can give poor fits, with large errors close to the boundaries.

A much simpler and computationally very cheap approach is to pre-compute the function on a grid covering the domain $[0, 1] \times [0, 1]$. The values need to be pre-computed only once, and here we can afford to apply golden section search with high precision; we use $\varepsilon = 10^{-10}$. Then, given two merge candidates, we can look up an approximate solution by rounding m and $\kappa $ to the nearest grid point. The approximation quality can be improved significantly through bilinear interpolation. On modern PC hardware we can easily afford a large grid with millions of points, however, this is not even necessary to obtain excellent results. In our experiments we use a grid of size $400 \times 400$.

Bilinear interpolation is fast, and moreover it is easy to implement. When looking up $h(m, \kappa )$ this way, we obtain a plug-in replacement for golden section search in BSGD. However, we can equally well look up $W\!D(m, \kappa )$ instead to save additional computation steps. Another benefit of $W\!D$ over h is regularity, see Figs. 2a and b and the following lemma.

Lemma 1

The functions h and $W\!D$ are smooth for $\kappa > e^{-2}$. The function h is continuous outside the set $Z = \{1/2\} \times [0, e^{-2}] \subset [0, 1]^2$ and discontinuous on Z. The function $W\!D$ is everywhere continuous.

Proof

The function $s_{m,\kappa }(h') = m \kappa ^{(1-h')^2} + (1-m) \kappa ^{h'^2}$ used in line 7 of Algorithm 1 inside the $\arg \max $ expression is a weighted sum of two Gaussian kernels. Depending on the parameters m and $\kappa $, it can have one or two modes. It has two modes for parameters in Z, as can be seen from an elementary calculation yielding $s''_{1/2,\kappa }(1/2) > 0 \Leftrightarrow \kappa < e^{-2}$. Due to symmetry, the dominant mode switches at $m=1/2$. The inverse function theorem applied to branches of $s_{m, \kappa }$ implies that $h(m, \kappa ) = \arg \max _{h'}\{s_{m,\kappa }(h')\}$ and $W\!D(m, \kappa ) = (\alpha _i + \alpha _j) \cdot \big (m^2 + (1-m)^2 - [s_{m,\kappa }(h(m,\kappa ))]^2 + 2m(1-m)\kappa \big )$ vary smoothly with their parameters as long as the same mode is active. The maximum operation is continuous, and so is $W\!D$. For each m there is a critical value of $\kappa \le e^{-2}$ where $s_{m,\kappa }$ switches from one to two modes. We collect these parameter configurations in the set N. On N (in contrast to Z), h is continuous. With the same argument as above, h and $W\!D$ are smooth outside $N \cup Z$. $\square $

Bilinear interpolation is well justified if the function is continuous, and differentiable within each grid cell. The above lemma ensures this property for $\kappa > e^{-2}$, and it furthermore indicates that for its continuity, interpolating $W\!D$ is preferable over interpolating h. The regime $\kappa < e^{-2}$ corresponds to merging two points in a distance of more than two “standard deviations” of the Gaussian kernel. This is anyway undesirable, since it can result in a large weight degradation. In fact, if $s_{m,\kappa }$ has two modes, then the optimal merge is close to the removal of one of the points, which is known to give poor results [25].

4 Experimental Evaluation

In this section we evaluate our method empirically, with the aim to investigate its properties more closely, and to demonstrate its practical value. To this end, we’d like to answer the following questions:

1.
Which speed-up is achievable?
2.
Do we pay for speed-ups with reduced test accuracy?
3.
How do results depend on the budget size?
4.
How much do merging decision differ from the original method?

To answer these questions we compare our algorithm to “standard” BSGD with merging based on golden section search. We have implemented both algorithms in C++; the implementation is available from the first author’s homepage.^{Footnote 1} We train SVM models on the binary classification problems SUSY, SKIN, IJCNN, ADULT, WEB, and PHISHING, covering a range of different sizes. The regularization parameter $C = \frac{1}{n \cdot \lambda }$ and the kernel parameter $\gamma $ were tuned on a grid of the form $\log _2(C), \log _2(\gamma ) \in \mathbb {Z}$ using 10-fold cross-validation. The data sets are summarized in Table 1. SVMs were trained with 20 passes through the data, except for the huge SUSY data, where we used a single pass.

Table 1. Data sets used in this study, hyperparameter settings, and test accuracy of the exact SVM model found by LIBSVM.

Full size table

To answer the first question, we trained SVM models with BSGD, comparing golden section search (GSS) with our new algorithms looking up $h(m, \kappa )$ (Lookup-h) or $W\!D(m, \kappa )$ (Lookup-WD). For reference, we also ran golden section search with precision $\varepsilon = 10^{-10}$ (GSS-precise). We used two different budget sizes for each problem.

Table 2. Test accuracy achieved by the different methods, averaged over 5 runs at different budget sizes.

Full size table

All methods found SVM models with comparable accuracy as shown in Table 2; in fact, in most cases the systematic differences are below one standard deviation of the variability between different runs.^{Footnote 2} In contrast, the time spent on budget maintenance differs significantly between the methods. In Fig. 3 we provide a detailed breakdown of the merging time, obtained with a profiler.

Lookup-WD and Lookup-h are faster than GSS, which is (unsurprisingly) faster than GSS-precise. The results are very systematic, see Table 3 and Fig. 3. The greatest savings of about $44\%$ of the total training time are observed for the rather large SUSY data set. Although the speed-up can also be insignificant, like for the WEB data, lookup is never slower than GSS. The actual saving depends on the cost of kernel computations and on the fraction of SGD iterations in which merging occurs. The latter quantity, which we refer to as the merging frequency, is provided in Table 3. We observe that the savings shown in Fig. 3 nicely correlate with the merging frequency.

The profiler results provide a more detailed understanding of the differences: replacing GSS with Lookup-h significantly reduces the time for computing $h(m, \kappa )$. Replacing Lookup-h with Lookup-WD removes further steps in the calculation of $W\!D(m, \kappa )$, but practically speaking the difference is hardly noticeable.

Overall, our method offers a systematic speed-up. The speed-up does not come at any cost in terms of solution precision. This answers the first two questions.

If the budget size is chosen so large that merging is never needed then all tested methods coincide, however, this defeats the purpose of using a budget in the first place. We find that the merging frequency is nearly independent of the budget size as long as the budget is significantly smaller than the number of support vectors of the full kernel SVM model, and hence the fraction of runtime saved is independent of the budget size. The results in Fig. 3 are in line with this expectation, answering the third question.

In the next experiment we have a closer look at the impact of lookup-based merging decisions by investigating the behavior in single iterations, as follows. During a run of BSGD we execute GSS and Lookup-WD in parallel. We count the number of iterations in which the merging decisions differ, and if so, we also record the difference between the weight degradation values. The results are presented in Table 3. They show that the decisions of the two methods agree most of the time, for some problems in more than $99\%$ of all budget maintenance events.

Finally, we investigate the precision with which the weight degradation is estimated by the different methods. While GSS can solve the problem to arbitrary precision, the reference implementation determines $h(m, \kappa )$ only to a rather loose precision of $\varepsilon = 0.01$ in order to save computation time. In contrast, we ran GSS to high precision $\varepsilon = 10^{-10}$ when precomputing the lookup table, however, we may lose some precision due to bilinear interpolation. This loss shrinks as the grid size grows, which comes at added storage cost, but without any runtime cost. We investigate the precision of GSS and Lookup-WD by comparing them to GSS-precise, which is considered a reasonable approximation of the exact minimum of $\Vert \varDelta \Vert ^2$. For both methods we record the factor by which their squared weight degradations exceed the minimum, see Table 3. All factors are very close to one, hence none of the algorithms is wasteful in terms of weight degradation, and indeed Lookup-WD with a grid size of $400 \times 400$ is more precise on all 6 data sets. This answers our last question.

Table 3. Relative improvement of the total training time with respect to golden section search averaged over 5 runs (Lookup-h vs. GSS-standard and lookup-WD vs. GSS-standard), and fraction of merging events for budget size 100 and statistics on the quality of merging decisions (refer to the text for details).

Full size table

5 Conclusion

We have proposed a fast lookup as a plug-in replacement for the iterative golden section search procedure required when merging support vectors in large-scale kernel SVM training. The new method compares favorably to the iterative baseline in terms of training time: it offers a systematic speed-up, resulting in computational savings of up to $65\%$ of the merging time and up to $44\%$ of the total training time, while the training time is never increased. With our method, nearly the full computation time is spent on actual SGD steps, while the fraction of efforts spent on budget maintenance can be reduced significantly. We have demonstrated that our approach results in virtually indistinguishable and even slightly more precise merging decisions. It is for this reason that the speed-up comes at absolutely no cost in terms of predictive accuracy.

Notes

1.
https://www.ini.rub.de/the_institute/people/tobias-glasmachers/#software .
2.
Note that with increasing number of passes (or epochs) the standard deviation does not tend to zero since the training problem is non-convex due to the budget constraint.

References

Bottou, L., Lin, C.J.: Support Vector Machine Solvers, pp. 1–28. MIT Press, Cambridge (2007)
Google Scholar
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) COMPSTAT 2010, pp. 177–186. Physica-Verlag, Heidelberg (2010). https://doi.org/10.1007/978-3-7908-2604-3_16
Chapter Google Scholar
Burges, C.J.: Simplified support vector decision rules, pp. 71–77. Morgan Kaufmann (1996)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27 (2011)
Article Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Cui, J., Li, Z., Lv, R., Xu, X., Gao, J.: The application of support vector machine in pattern recognition. IEEE Trans. Control Autom. (2007)
Google Scholar
Fine, S., Scheinberg, K.: Efficient SVM training using low-rank kernel representations. J. Mach. Learn. Res. 2(Dec), 243–264 (2001)
MATH Google Scholar
Graf, H.P., Cosatto, E., Bottou, L., Dourdanovic, I., Vapnik, V.: Parallel support vector machines: the cascade SVM. In: NIPS (2005)
Google Scholar
Hare, S., et al.: Struck: structured output tracking with kernels. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2096–2109 (2016)
Article Google Scholar
Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: ICML (2008)
Google Scholar
Hsieh, C.J., Si, S., Dhillon, I.: A divide-and-conquer solver for kernel support vector machines. In: International Conference on Machine Learning (ICML), pp. 566–574 (2014)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683
Chapter Google Scholar
Joachims, T.: Making large-scale SVM learning practical. In: Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999)
Google Scholar
Ladicky, L., Torr, P.: Locally linear support vector machines. In: International Conference on Machine Learning (ICML), pp. 985–992 (2011)
Google Scholar
Lewis, D.P., Jebara, T., Noble, W.S.: Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Bioinformatics 22(22), 2753–2760 (2006)
Article Google Scholar
Lin, G., Shen, C., Shi, Q., van den Hengel, A., Suter, D.: Fast supervised hashing with decision trees for high-dimensional data. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Google Scholar
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. MIT Press, Cambridge (2012)
MATH Google Scholar
Nguyen, D., Ho, T.: An efficient method for simplifying support vector machines. In: Proceedings of the 22nd ICML, pp. 617–624 (2005)
Google Scholar
Noble, W.S.: Support vector machine applications in computational biology. In: Schölkopf, B., Tsuda, K., Vert, J.P. (eds.) Kernel Methods in Computational Biology. MIT Press, Cambridge (2004)
Google Scholar
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. NIPS 3(4) (2007)
Google Scholar
Schölkopf, B., et al.: Input space versus feature space in kernel-based methods. IEEE Trans. Neural Netw. 10(5), 1000–1017 (1999)
Article Google Scholar
Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-GrAdient SOlver for SVM. Math. Program. 127(1), 3–30 (2011)
Article MathSciNet MATH Google Scholar
Steinwart, I.: Sparseness of support vector machines. J. Mach. Learn. Res. 4(Nov), 1071–1105 (2003)
MathSciNet MATH Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995). https://doi.org/10.1007/978-1-4757-2440-0
Book MATH Google Scholar
Wang, Z., Crammer, K., Vucetic, S.: Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training. J. Mach. Learn. Res. 13, 3103–3131 (2012)
MathSciNet MATH Google Scholar
Wen, Z., Shi, J., He, B., Li, Q., Chen, J.: Thunder-SVM (2017). https://github.com/zeyiwen/thundersvm
Mu, Y., Hua, G., Fan, W., Chang, S.F.: Hash-SVM: scalable kernel machines for large-scale visual classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)
Google Scholar
Yu, J., Xue, A., Redei, E., Bagheri, N.: A support vector machine model provides an accurate transcript-level-based diagnostic for major depressive disorder. Transl. Psychiatry 6(10), e931 (2016). https://doi.org/10.1038/tp.2016.198
Article Google Scholar
Zanni, L., Serafini, T., Zanghirati, G.: Parallel software for training large scale support vector machines on multiprocessor systems. J. Mach. Learn. Res. 7, 1467–1492 (2006)
MathSciNet MATH Google Scholar
Zhang, H., Berg, A.C., Maire, M., Malik, J.: SVM-KNN: discriminative nearest neighbor classification for visual category recognition. In: Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2126–2136. IEEE (2006)
Google Scholar
Zhang, K., Lan, L., Wang, Z., Moerchen, F.: Scaling up kernel SVM on limited resources: a low-rank linearization approach. In: AISTATS (2012)
Google Scholar
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent. In: International Conference on Machine Learning (2004)
Google Scholar
Zhu, Z.A., Chen, W., Wang, G., Zhu, C., Chen, Z.: P-packSVM: parallel primal gradient descent kernel SVM. In: IEEE International Conference on Data Mining (2009)
Google Scholar

Download references

Acknowledgments

We acknowledge support by the Deutsche Forschungsgemeinschaft (DFG) through grant GL 839/3-1.

Author information

Authors and Affiliations

Institute for Neural Computation, Ruhr University Bochum, 44801, Bochum, NRW, Germany
Tobias Glasmachers & Sahar Qaadan

Authors

Tobias Glasmachers
View author publications
You can also search for this author in PubMed Google Scholar
Sahar Qaadan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tobias Glasmachers .

Editor information

Editors and Affiliations

University of Catania, Catania, Italy and University of Reading, Reading, UK
Giuseppe Nicosia
University of Florida, Gainesville, FL, USA
Panos Pardalos
University of Catania, Catania, Italy
Giovanni Giuffrida
Harvard University, Cambridge, MA, USA
Renato Umeton
IBM, Tivoli Research Lab, Rome, Italy
Vincenzo Sciacca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Glasmachers, T., Qaadan, S. (2019). Speeding Up Budgeted Stochastic Gradient Descent SVM Training with Precomputed Golden Section Search. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R., Sciacca, V. (eds) Machine Learning, Optimization, and Data Science. LOD 2018. Lecture Notes in Computer Science(), vol 11331. Springer, Cham. https://doi.org/10.1007/978-3-030-13709-0_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-13709-0_28
Published: 14 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13708-3
Online ISBN: 978-3-030-13709-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Speeding Up Budgeted Stochastic Gradient Descent SVM Training with Precomputed Golden Section Search

Abstract

Similar content being viewed by others

Large-scale support vector regression with budgeted stochastic gradient descent

Online Budgeted Stochastic Coordinate Ascent for Large-Scale Kernelized Dual Support Vector Machine Training

Stochastic Subgradient Estimation Training for Support Vector Machines

1 Introduction

2 Support Vector Machine Training

3 Precomputing the Merging Problem

Lemma 1

Proof

4 Experimental Evaluation

5 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Speeding Up Budgeted Stochastic Gradient Descent SVM Training with Precomputed Golden Section Search

Abstract

Similar content being viewed by others

Large-scale support vector regression with budgeted stochastic gradient descent

Online Budgeted Stochastic Coordinate Ascent for Large-Scale Kernelized Dual Support Vector Machine Training

Stochastic Subgradient Estimation Training for Support Vector Machines

1 Introduction

2 Support Vector Machine Training

3 Precomputing the Merging Problem

Lemma 1

Proof

4 Experimental Evaluation

5 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation