1 Introduction

The Support Vector Machine (SVM; [5]) is a widespread standard machine learning method, in particular for binary classification problems. Being a kernel method, it employs a linear algorithm in an implicitly defined kernel-induced feature space [24]. SVMs yield high predictive accuracy in many applications [6, 15, 16, 19, 28]. They are supported by strong learning theoretical guarantees [1, 9, 12, 17].

When facing large-scale learning, the applicability of support vector machines (and many other learning machines) is limited by their computational demands. Given n training points, training an SVM with standard dual solvers takes quadratic to cubic time in n [1]. Steinwart [23] established that the number of support vectors is linear in n, and so is the storage complexity of the model as well as the time complexity of each of its predictions. This quickly becomes prohibitive for large n, e.g., when learning from millions of data points.

Due to the prominence of the problem, a large number of solutions was developed. Parallelization can help [29, 33], but it does not reduce the complexity of the training problem. One promising route is to solve the SVM problem only locally, usually involving some type of clustering [14, 30] or with a hierarchical divide-and-conquer strategy [8, 11]. An alternative approach is to leverage the progress in the domain of linear SVM solvers [10, 13, 32], which scale well to large data sets. To this end, kernel-induced feature representations are approximated by low-rank approaches [7, 20, 27, 31], either a-priory using random Fourier features, or in a data-dependent way using Nyström sampling.

Budget methods, introducing an a-priori limit \(B \ll n\) on the number of support vectors [18, 25], go one step further by letting the optimizer adapt the feature space approximation during operation to its needs, which promises a comparatively low approximation error. The usual strategy is to merge support vectors at need, which effectively enables the solver to move support vectors around in input space. Merging decisions greedily minimize the approximation error.

In this paper we propose an effective computational improvement of this scheme. Finding the best merge partners, i.e., support vectors that induce the lowest approximation error when merged, is a rather costly operation. Usually, \(\mathcal {O}(B)\) candidate pairs of vectors are considered, and for each pair an optimization problem is solved with an iterative strategy. By modelling the low-dimensional space of (solutions of the) optimization problems explicitly, we can remove the iterative process entirely, and replace it with a simple and fast lookup.

Our results show that merging-based budget maintenance can account for more than half of the total training time. Therefore reducing the merging time is a promising approach to speeding up training. The speed-up can be significant; on our largest data set we reduce the merging time by \(65\%\), which corresponds to a reduction of the total training time by \(44\%\). At the same time, our lookup method is at least as accurate as the original iterative procedure, resulting in nearly identical merging decisions and no loss of prediction accuracy.

The remainder of this paper is organized as follows. In the next section we introduce SVMs and stochastic gradient training on a budget. Then we analyze the computational bottleneck of the solver and develop a lookup smoothed with bilinear interpolation as a remedy. In Sect. 4 we benchmark the new algorithm against “standard” BSGD, and we investigate the influence of the algorithmic simplification on different budget sizes. Our results demonstrate systematic improvements in training time at no cost in terms of solution quality.

2 Support Vector Machine Training

In this section we introduce the necessary background: SVMs for binary classification, and training with stochastic gradient descent (SGD) on a budget, i.e., with a-priori limited number of support vectors.

Support Vector Machines. An SVM classifier is a supervised machine learning algorithm. In its simplest form it linearly separates two classes with a large margin. When applying a kernel function \(k : X \times X \rightarrow \mathbb {R}\) over the input space X, the separation happens in a reproducing kernel Hilbert space (RKHS). For labeled data \(\big ((x_1, y_1), \dots , (x_n, y_n)\big ) \in (X \times \{-1, +1\})^n\), the prediction on \(x \in X\) is computed as

with \(w = \sum _{j=1}^n \alpha _j \phi (x_j)\), where \(\phi (x)\) is an only implicitly defined feature map (due to Mercer’s theorem, see also [24]) corresponding to the kernel function fulfilling \(k(x, x') = \langle \phi (x), \phi (x') \rangle \). Training points \(x_j\) with non-zero coefficients \(\alpha _j \not = 0\) are called support vectors; the summation in the predictor can obviously be restricted to this subset. The SVM model is obtained by minimizing the following (primal) objective function:

$$\begin{aligned} P(w, b) = \frac{\lambda }{2} \Vert w\Vert ^2 + \frac{1}{n} \sum _{i=1}^n L \Big ( y_i, \big \langle w, \phi (x_i) \big \rangle + b \Big ). \end{aligned}$$
(1)

Here, \(\lambda > 0\) is a user-defined regularization parameter and \(L(y, \mu ) = \max \{0, 1 - y \cdot \mu \}\) denotes the hinge loss, which is a prototypical large margin loss, aiming to separate the classes with a functional margin \(y \cdot \mu \) of at least one. The incorporation of other loss functions allows to generalize SVMs to other tasks like multi-class classification, regression, and ranking.

Primal Training. Problem (1) is a convex optimization problem without constraints. It has an equivalent dual representation as a quadratic program (QP), which is solved by several state-of-the-art “exact” solvers like LIBSVM [4] and thunder-SVM [26]. The main challenge is the high dimensionality of the problem, which coincides with the training set size n and can hence easily grow into the millions.

A simple method is to solve problem (1) directly with stochastic gradient descent (SGD), similar to neural network training. When presenting one training point at a time, as done in Pegasos [22], the objective function P(wb) is approximated by the unbiased estimate

$$\begin{aligned} P_i(w, b) = \frac{\lambda }{2} \Vert w\Vert ^2 + L \Big ( y_i, \big \langle w, \phi (x_i) \big \rangle + b \Big ), \end{aligned}$$

where the index \(i \in \{1, \dots , n\}\) follows a uniform distribution. The stochastic gradient \(\nabla P_i(w, b)\) is an unbiased estimate of the “batch” gradient \(\nabla P(w, b)\) but faster to compute by a factor of n, since it involves only a single training point. Starting from \((w, b) = (0, 0)\), SGD updates the weights according to

$$\begin{aligned} (w, b) \leftarrow (w, b) - \eta _t \cdot \nabla P_{i_t}(w, b), \end{aligned}$$

where t is the iteration counter. With a learning rate \(\eta _t \in \varTheta (1/t)\) it is guaranteed to converge to the optimum of the convex training problem [2].

With a sparse representation \(w = \sum _{(\alpha , \tilde{x}) \in M} \alpha \cdot \phi (\tilde{x})\) the SGD update decomposes into the following algorithmic steps. We scale down all coefficients \(\alpha \) uniformly by the factor \(1 - \lambda \cdot \eta _t\). If the margin \(y_i (\langle w, \phi (x_i)\rangle + b)\) happens to be less than one, then we add a new point \(\tilde{x} = x_i\) with coefficient \(\alpha = \eta _t \cdot y_i\) to the model M. With a dense representation holding one coefficient \(\alpha _i\) per data point \((x_i, y_i)\) we would add the above value to \(\alpha _i\). The most costly step is the computation of \(\langle w, \phi (x_i) \rangle \), which is linear in the number of support vectors (SVs), and hence generally linear in n [23].

SVM Training on a Budget. Budgeted Stochastic Gradient Descent (BSGD) breaks the unlimited growth in model size and update time for large data streams by bounding the number of support vectors during training. The upper bound \(B \ll n\) is the budget size. Per SGD step the algorithm can add at most one new support vector; this happens exactly if \((x_i, y_i)\) does not meet the target margin of one (and \(\alpha _i\) changes from zero to a non-zero value). After \(B+1\) such steps, the budget constraint is violated and a dedicated budget maintenance algorithm is triggered to reduce the number of support vectors to at most B. The goal of budget maintenance is to fulfill the budget constraint with the smallest possible change of the model, measured by \(\Vert \varDelta \Vert ^2 = \Vert w' - w\Vert ^2\), where w is the weight vector before and \(w'\) is the weight vector after budget maintenance. \(\varDelta = w' - w\) is referred to as the weight degradation.

Budget maintenance strategies are investigated in detail in [25]. It turns out that merging of two support vectors into a single new point is superior to alternatives like removal of a point and projection of the solution onto the remaining support vectors. Merging was first proposed in [18] as a way to efficiently reduce the complexity of an already trained SVM. With merging, the complexity of budget maintenance is governed by the search for suitable merge partners, which is \(\mathcal {O}(B^2)\) for all pairs, while it is common to apply the \(\mathcal {O}(B)\) heuristic resulting from fixing the point with smallest coefficient \(\alpha _i\) as a first partner.

When merging two support vectors \(x_i\) and \(x_j\), we aim to approximate \(\alpha _i \cdot \phi (x_i) + \alpha _j \cdot \phi (x_j)\) with a new term \(\alpha _z \cdot \phi (z)\) involving only a single point z. Since the kernel-induced feature map is usually not surjective, the pre-image of \(\alpha _i \phi (x_i) + \alpha _j \phi (x_j)\) under \(\phi \) is empty [3, 21] and no exact match z exists. Therefore the weight degradation \(\varDelta = \alpha _i \phi (x_i) + \alpha _j \phi (x_j) - \alpha _z \phi (z)\) is non-zero. For the Gaussian kernel \(k(x, x') = \exp (-\gamma \Vert x - x'\Vert ^2)\), due to its symmetries, the point z minimizing \(\Vert \varDelta \Vert ^2\) lies on the line connecting \(x_i\) and \(x_j\) and is hence of the form \(z = h x_i + (1 - h) x_j\). For \(y_i = y_j\) we obtain a convex combination \(0< h < 1\), otherwise we have \(h < 0\) or \(h > 1\). In this paper we merge only vectors of equal label. For each choice of z, the optimal value of \(\alpha _z\) is obtained in closed form: \(\alpha _z = \alpha _i k(x_i, z) + \alpha _j k(x_j, z)\). This turns minimization of \(\Vert \varDelta \Vert ^2 = \alpha _i^2 + \alpha _j^2 - \alpha _z^2 + 2 k(x_i, x_j)\) into a one-dimensional non-linear optimization problem, which is solved in [25] with golden section line search. The calculations are further simplified by the relations \(k(x_i, z) = k(x_i, x_j)^{(1-h)^2}\) and \(k(x_j, z) = k(x_i, x_j)^{h^2}\), which save costly kernel functions evaluations.

Budget maintenance in BSGD usually works in the following sequence of steps, see Algorithm 1: First, \(x_i\) is fixed to the support vector with minimal coefficient \(|\alpha _i|\). Then the best merge partner \(x_j\) is determined by testing B pairs \((x_i, x_j)\), \(j \in \{1, \dots , B+1\} \setminus \{i\}\). Golden section search is run for each of these steps to determine h to fixed precision \(\varepsilon = 0.01\). The weight degradation is computed using the shortcuts mentioned above. Finally, the candidate with minimal weight degradation is selected and the vectors are merged. Hence, although a single golden search search is fast, the need to run it many times per SGD iteration turns it into a rather costly operation.

figure a

A theoretical analysis of BSGD is provided by [25]. Their Theorem 1 establishes a bound on the error induced by the budget, ensuring that asymptotically the error is governed only by the (unavoidable) weight degradation.

Fig. 1.
figure 1

The merging problem.

3 Precomputing the Merging Problem

The merging problem for given support vectors \(x_i\) and \(x_j\) with coefficients \(\alpha _i\) and \(\alpha _j\) is illustrated in Fig. 1. Our central observation is that the geometry depends only on the (cosine of the) angle between \(\alpha _i \phi (x_i)\) and \(\alpha _j \phi (x_j)\), and on the relative lengths of the two vectors. These two quantities are captured by the parameters

  • relative length \(m = \alpha _i / (\alpha _i + \alpha _j)\)

  • cosine of the angle \(\kappa = k(x_i, x_j)\),

both of which take values in the unit interval. The optimal merging coefficient h is a function of m and \(\kappa \), and so is the resulting weight degradation \(W\!D = \Vert \varDelta \Vert ^2\). Therefore we can express h and \(W\!D\) as functions of m and \(\kappa \), denoted as \(h(m, \kappa )\) and \(W\!D(m, \kappa )\) in the following. The functions can be evaluated to any given target precision by running the golden section search. Their graphs are plotted in Figs. 2a and b.

If the functions h or \(W\!D\) can be approximated efficiently then there is no need to run a potentially costly iterative procedure like golden section search. This is our core technique for speeding up the BSGD method.

The functions blend between different budget maintenance strategies. While for \(\kappa \gg 0\) and for \(m \approx 1/2\) it is beneficial to merge the two support vectors, resulting in \(h \in (0, 1)\), this is not the case for \(\kappa \ll 1\) and \(m \approx 0\) or \(m \approx 1\), resulting in \(h \approx 0\) or \(h \approx 1\), which is equivalent to removal of the support vector with smaller coefficient. This means that in order to obtain a close fit that works well in both regimes we may need a quite flexible function class like a kernel method or a neural network, while a simple polynomial function can give poor fits, with large errors close to the boundaries.

A much simpler and computationally very cheap approach is to pre-compute the function on a grid covering the domain \([0, 1] \times [0, 1]\). The values need to be pre-computed only once, and here we can afford to apply golden section search with high precision; we use \(\varepsilon = 10^{-10}\). Then, given two merge candidates, we can look up an approximate solution by rounding m and \(\kappa \) to the nearest grid point. The approximation quality can be improved significantly through bilinear interpolation. On modern PC hardware we can easily afford a large grid with millions of points, however, this is not even necessary to obtain excellent results. In our experiments we use a grid of size \(400 \times 400\).

Bilinear interpolation is fast, and moreover it is easy to implement. When looking up \(h(m, \kappa )\) this way, we obtain a plug-in replacement for golden section search in BSGD. However, we can equally well look up \(W\!D(m, \kappa )\) instead to save additional computation steps. Another benefit of \(W\!D\) over h is regularity, see Figs. 2a and b and the following lemma.

Fig. 2.
figure 2

Graphs of the functions \(h(m, \kappa )\) (a) and \(W\!D(m, \kappa )\) (b). The latter uses a log scale on the value axis.

Lemma 1

The functions h and \(W\!D\) are smooth for \(\kappa > e^{-2}\). The function h is continuous outside the set \(Z = \{1/2\} \times [0, e^{-2}] \subset [0, 1]^2\) and discontinuous on Z. The function \(W\!D\) is everywhere continuous.

Proof

The function \(s_{m,\kappa }(h') = m \kappa ^{(1-h')^2} + (1-m) \kappa ^{h'^2}\) used in line 7 of Algorithm 1 inside the \(\arg \max \) expression is a weighted sum of two Gaussian kernels. Depending on the parameters m and \(\kappa \), it can have one or two modes. It has two modes for parameters in Z, as can be seen from an elementary calculation yielding \(s''_{1/2,\kappa }(1/2) > 0 \Leftrightarrow \kappa < e^{-2}\). Due to symmetry, the dominant mode switches at \(m=1/2\). The inverse function theorem applied to branches of \(s_{m, \kappa }\) implies that \(h(m, \kappa ) = \arg \max _{h'}\{s_{m,\kappa }(h')\}\) and \(W\!D(m, \kappa ) = (\alpha _i + \alpha _j) \cdot \big (m^2 + (1-m)^2 - [s_{m,\kappa }(h(m,\kappa ))]^2 + 2m(1-m)\kappa \big )\) vary smoothly with their parameters as long as the same mode is active. The maximum operation is continuous, and so is \(W\!D\). For each m there is a critical value of \(\kappa \le e^{-2}\) where \(s_{m,\kappa }\) switches from one to two modes. We collect these parameter configurations in the set N. On N (in contrast to Z), h is continuous. With the same argument as above, h and \(W\!D\) are smooth outside \(N \cup Z\).    \(\square \)

Bilinear interpolation is well justified if the function is continuous, and differentiable within each grid cell. The above lemma ensures this property for \(\kappa > e^{-2}\), and it furthermore indicates that for its continuity, interpolating \(W\!D\) is preferable over interpolating h. The regime \(\kappa < e^{-2}\) corresponds to merging two points in a distance of more than two “standard deviations” of the Gaussian kernel. This is anyway undesirable, since it can result in a large weight degradation. In fact, if \(s_{m,\kappa }\) has two modes, then the optimal merge is close to the removal of one of the points, which is known to give poor results [25].

4 Experimental Evaluation

In this section we evaluate our method empirically, with the aim to investigate its properties more closely, and to demonstrate its practical value. To this end, we’d like to answer the following questions:

  1. 1.

    Which speed-up is achievable?

  2. 2.

    Do we pay for speed-ups with reduced test accuracy?

  3. 3.

    How do results depend on the budget size?

  4. 4.

    How much do merging decision differ from the original method?

To answer these questions we compare our algorithm to “standard” BSGD with merging based on golden section search. We have implemented both algorithms in C++; the implementation is available from the first author’s homepage.Footnote 1 We train SVM models on the binary classification problems SUSY, SKIN, IJCNN, ADULT, WEB, and PHISHING, covering a range of different sizes. The regularization parameter \(C = \frac{1}{n \cdot \lambda }\) and the kernel parameter \(\gamma \) were tuned on a grid of the form \(\log _2(C), \log _2(\gamma ) \in \mathbb {Z}\) using 10-fold cross-validation. The data sets are summarized in Table 1. SVMs were trained with 20 passes through the data, except for the huge SUSY data, where we used a single pass.

Table 1. Data sets used in this study, hyperparameter settings, and test accuracy of the exact SVM model found by LIBSVM.

To answer the first question, we trained SVM models with BSGD, comparing golden section search (GSS) with our new algorithms looking up \(h(m, \kappa )\) (Lookup-h) or \(W\!D(m, \kappa )\) (Lookup-WD). For reference, we also ran golden section search with precision \(\varepsilon = 10^{-10}\) (GSS-precise). We used two different budget sizes for each problem.

Table 2. Test accuracy achieved by the different methods, averaged over 5 runs at different budget sizes.

All methods found SVM models with comparable accuracy as shown in Table 2; in fact, in most cases the systematic differences are below one standard deviation of the variability between different runs.Footnote 2 In contrast, the time spent on budget maintenance differs significantly between the methods. In Fig. 3 we provide a detailed breakdown of the merging time, obtained with a profiler.

Fig. 3.
figure 3

Breakdown of the merging time in seconds for GSS-precise, GSS, Lookup-h and Lookup-WD. Section A represents the time invested to compute h using either golden section search or lookup. For the Lookup-WD method the same bar represents the look-up of \(W\!D(m,\kappa )\). Section B summarizes all other operations like loop overheads, the computation of \(\alpha _z\), and the construction of the final merge vector z. The numbers on top of the columns for the lookup methods indicate the saving over GSS.

Lookup-WD and Lookup-h are faster than GSS, which is (unsurprisingly) faster than GSS-precise. The results are very systematic, see Table 3 and Fig. 3. The greatest savings of about \(44\%\) of the total training time are observed for the rather large SUSY data set. Although the speed-up can also be insignificant, like for the WEB data, lookup is never slower than GSS. The actual saving depends on the cost of kernel computations and on the fraction of SGD iterations in which merging occurs. The latter quantity, which we refer to as the merging frequency, is provided in Table 3. We observe that the savings shown in Fig. 3 nicely correlate with the merging frequency.

The profiler results provide a more detailed understanding of the differences: replacing GSS with Lookup-h significantly reduces the time for computing \(h(m, \kappa )\). Replacing Lookup-h with Lookup-WD removes further steps in the calculation of \(W\!D(m, \kappa )\), but practically speaking the difference is hardly noticeable.

Overall, our method offers a systematic speed-up. The speed-up does not come at any cost in terms of solution precision. This answers the first two questions.

If the budget size is chosen so large that merging is never needed then all tested methods coincide, however, this defeats the purpose of using a budget in the first place. We find that the merging frequency is nearly independent of the budget size as long as the budget is significantly smaller than the number of support vectors of the full kernel SVM model, and hence the fraction of runtime saved is independent of the budget size. The results in Fig. 3 are in line with this expectation, answering the third question.

In the next experiment we have a closer look at the impact of lookup-based merging decisions by investigating the behavior in single iterations, as follows. During a run of BSGD we execute GSS and Lookup-WD in parallel. We count the number of iterations in which the merging decisions differ, and if so, we also record the difference between the weight degradation values. The results are presented in Table 3. They show that the decisions of the two methods agree most of the time, for some problems in more than \(99\%\) of all budget maintenance events.

Finally, we investigate the precision with which the weight degradation is estimated by the different methods. While GSS can solve the problem to arbitrary precision, the reference implementation determines \(h(m, \kappa )\) only to a rather loose precision of \(\varepsilon = 0.01\) in order to save computation time. In contrast, we ran GSS to high precision \(\varepsilon = 10^{-10}\) when precomputing the lookup table, however, we may lose some precision due to bilinear interpolation. This loss shrinks as the grid size grows, which comes at added storage cost, but without any runtime cost. We investigate the precision of GSS and Lookup-WD by comparing them to GSS-precise, which is considered a reasonable approximation of the exact minimum of \(\Vert \varDelta \Vert ^2\). For both methods we record the factor by which their squared weight degradations exceed the minimum, see Table 3. All factors are very close to one, hence none of the algorithms is wasteful in terms of weight degradation, and indeed Lookup-WD with a grid size of \(400 \times 400\) is more precise on all 6 data sets. This answers our last question.

Table 3. Relative improvement of the total training time with respect to golden section search averaged over 5 runs (Lookup-h vs. GSS-standard and lookup-WD vs. GSS-standard), and fraction of merging events for budget size 100 and statistics on the quality of merging decisions (refer to the text for details).

5 Conclusion

We have proposed a fast lookup as a plug-in replacement for the iterative golden section search procedure required when merging support vectors in large-scale kernel SVM training. The new method compares favorably to the iterative baseline in terms of training time: it offers a systematic speed-up, resulting in computational savings of up to \(65\%\) of the merging time and up to \(44\%\) of the total training time, while the training time is never increased. With our method, nearly the full computation time is spent on actual SGD steps, while the fraction of efforts spent on budget maintenance can be reduced significantly. We have demonstrated that our approach results in virtually indistinguishable and even slightly more precise merging decisions. It is for this reason that the speed-up comes at absolutely no cost in terms of predictive accuracy.