Keywords

1 Introduction

In the last two decades, the Support Vector Machines (SVMs) [7], due to their accuracy and obliviousness to dimensionality [18], have become a popular machine learning technique with applications including genetics [5], image processing [9], and weather forecasting [16]. In this paper, we are only interested in SVMs for classification. SVMs belong to supervised learning algorithms, i.e. algorithms developing a decision model from labelled training samples (training dataset). The SVM decision model is represented by the maximal-margin hyperplane, i.e. the hyperplane that separates the training dataset into two classes with the maximal possible gap between the hyperplane and both classes. Development of the SVM decision model leads to solving a quadratic programming (QP) problem. A brief description of SVMs is given in Sect. 2.

In Sect. 3, PermonSVM [10] is introduced. PermonSVM represents one of a few open-source SVM implementations for distributed environment (it is parallelized using MPI). It focuses on solving large SVM problems on supercomputers. PermonSVM is build on top of PETSc [4] and PermonQP [11]. PermonQP is a PETSc based package for the solution of large scale QP problems. It includes implementations of several QP solvers and it can also use any of KSP [17] and TAO [14] solvers.

Section 4 describes the MPRGP [8] algorithm used for the solution of the QP arising from the SVM formulation. MPRGP is implemented in PermonQP.

Finally, numerical results are presented in Sect. 5. We investigate the convergence of SVM by looking, in each MPRGP iteration, at the training rate (percentage of correctly classified samples in the training dataset), the hyperplane margin, the value of the QP cost function, and the norm of the projected gradient (used in the stopping criterion of MPRGP). The scalability of our approach is also demonstrated.

2 Support Vector Machines for Classifications

SVM is a supervised binary classifier, i.e. a classifier that decides whether a sample falls into either Class A (label 1) or Class B (label \(-1\)) by means of a model. The model is determined from the already categorised training samples in the training phase of the classifier. Unless otherwise stated, let us assume that the training samples are linearly separable, i.e. it is possible to separate the Class A samples and the Class B samples using a hyperplane. The essential idea of the SVM classifier training is to find the maximal-margin hyperplane that divides the Class A from the Class B samples by the widest possible empty strip, which is called the functional margin. The samples contributing to the definition of such hyperplane are called the support vectors – see the circled samples lying on the dashed hyperplanes depicted in Fig. 1.

Fig. 1.
figure 1

An example of a two-class classification problem solved by the linear hard-margin SVM.

Let us denote the training samples as a set of ordered pairs such that

$$\begin{aligned} T := \{ \left( \varvec{x}_1, y_1\right) , \ \left( \varvec{x}_2, y_2\right) , \ \dots , \ \left( \varvec{x}_m, y_m\right) \}, \end{aligned}$$

where m is the number of samples, \(\varvec{x}_i \in \mathbb {R}^n \left( n \in \mathbb {N} \text { represents a number of}\right. \left. \text {attributes} \right) \) is the i-th sample and \(y_i \in \{-1, 1\}\) denotes the label of the i-th sample, \(\ i \in \{1, \ 2, \ \dots , m\}\). Let H be the maximal-margin hyperplane \(\varvec{w}^T \varvec{x} - b = 0\), where \(\varvec{w}\) is a normal vector; \(\frac{b}{\Vert \varvec{w}\Vert }\) determines the offset of the hyperplane H from the origin along its normal vector \(\varvec{w}\). The problem of finding the hyperplane H can be formulated as a constrained optimization problem in the following hard-margin primal SVM formulation:

(1)

For the case of the non-perfectly linearly separable training samples, the soft-margin SVM was designed. To handle the sensitivity of the SVM classifier to possible outliers, we introduce slack variables \(\xi _1, \xi _2, \dots , \xi _m\), and modify the hard-margin primal SVM formulation (1) into the soft-margin primal SVM formulation

$$\begin{aligned} \min _{\varvec{w}, \ b, \ \xi _i} \ \frac{1}{2} \varvec{w}^T \varvec{w} + C \sum _{i = 1}^m \xi _i \ \ \ \text {s.t.} \ \ {\left\{ \begin{array}{ll} \ y_i\left( \varvec{w}^T\varvec{x}_i - b\right) \ge 1 - \xi _i, \\ \ \xi _i \ge 0, \end{array}\right. } \end{aligned}$$
(2)

where C is a user-specified penaltyFootnote 1. Higher value of C increases the importance of minimising \(\Vert \mathbf w \Vert \) (equivalent to maximising the margin) at the expense of satisfying the margin constraint for fewer samples. Let us refer here to several monographs mentioning the significance of C:

  • In the support-vector networks algorithm one can control the trade-off between complexity of decision rule and frequency of error by changing the parameter C,...” [7]

  • The parameter C controls the trade off between errors of the SVM on training data and margin maximization (C = \(\infty \) leads to hard-margin SVM)” [15, p. 82].

  • ...the coefficient C affects the trade-off between complexity and proportion of nonseparable samples and must be selected by the user” [6, p. 366].

We can observe that if \(0 \le \xi _i \le 1\), then the i-th sample lies somewhere between the margin and their respective hyperplane (illustrated in Fig. 2); if \(\xi _i > 1\), the i-th sample is misclassified (illustrated in Fig. 3).

Fig. 2.
figure 2

Soft-margin SVM example: the encircled samples are correctly classified, but are on the wrong side of their respective hyperplane

Fig. 3.
figure 3

Soft-margin SVM example: the encircled samples are misclassified.

The primal formulation of the soft-margin SVM (2) can be simplified by exploiting the Lagrange duality with the Lagrange multipliers \(\varvec{\alpha } = \left[ \alpha _1, \ \alpha _2, \ \dots , \alpha _m \right] ^T\), \(\varvec{\beta } = \left[ \beta _1, \ \beta _2, \ \dots , \ \beta _m \right] ^T\). Evaluating the Karush-Kuhn-Tucker conditions, eliminating \(\varvec{\beta }\) and using other modifications, the problem results into the dual formulation with an inequality (box) constraint [7].

$$\begin{aligned} \min _{\varvec{\alpha }} \ \frac{1}{2} \varvec{\alpha }^T \varvec{Y}^T \varvec{K} \varvec{Y} \varvec{\alpha } - \varvec{\alpha }^T \varvec{e} \ \text {s.t.} \ \ \varvec{o} \le \varvec{\alpha } \le C \varvec{e}, \end{aligned}$$
(3)

where \(\varvec{e} = \left[ 1, 1, \dots , 1\right] ^T\), \(\varvec{o} = \left[ 0, \ 0, \ \dots , \ 0\right] ^T\), \(\varvec{X} = \left[ \varvec{x}_1, \ \varvec{x}_2, \ \dots , \varvec{x}_m \right] \), \(\varvec{y} = \left[ y_1, \ y_2, \ \dots , y_m\right] ^T\), \(\varvec{Y} = diag(\varvec{y})\), and \(\varvec{K} \in \mathbb {R}^{n \times n}\) is symmetric positive semi-definite (SPS) matrix such that \(\varvec{K} := \varvec{X}^T \varvec{X}\). In the machine learning communities, \(\varvec{K}\) is called the Gram matrix, the kernel matrix, or in the QP terminology, the Hessian.

Further, we introduce dual to primal reconstruction formulas for the normal vector

$$\begin{aligned} \varvec{w} = \varvec{X}\varvec{Y}\varvec{\alpha }, \end{aligned}$$
(4)

and the bias

$$\begin{aligned} b = \frac{1}{\left| I^{SV} \right| } \sum _{i \in I^{SV}} \left( \varvec{x}_i^T \varvec{w} - y_i\right) , \end{aligned}$$
(5)

where \(I^{SV}\) denotes the support vector index set, i.e. \(I^{SV} := \{ i \ | \ \alpha _i > 0, \ i = 1, 2, \ \dots , m \}\), and \(\left| I^{SV} \right| \) is the cardinality of \(I^{SV}\). From the normal vector \(\varvec{w}\) and bias b, we can easily set up the decision rule

$$\begin{aligned} \text {If} \ \varvec{w}^T \varvec{x} + b \ge 0, \ \text {then} \ \varvec{x} \ \text {belongs to Class A, else} \ \varvec{x} \ \text {belongs to Class B}. \end{aligned}$$
(6)

The decision rule (6) with concrete \(\varvec{w}\) and b is also called the SVM model for the linearly separable problems.

3 PermonSVM: SVM Implementation Based on PETSc

PermonSVM is a new SVM tool designed to run mainly in parallel even on large supercomputers. It is written on top of PETSc [4] and PermonQP [11]. Distribution of matrices using MPI through the PETSc framework provides the parallelism.

PermonSVM provides an implementation of the two-class classification via soft-margin SVM. It implements a scalable training procedure based on a linear kernel. In the training procedure, PermonSVM takes advantage of the scalable matrix-vector product of PETSc matrices and vectors and an implicit representation of the Gram matrix (i.e. the matrix product \(\varvec{X}^T \varvec{X}\) is not formed), which saves memory and CPU time.

The resulting QP problem with an implicit Hessian matrix is solved by the scalable QP solvers implemented in the PermonQP package.

Additional features include fast, load-balanced cross-validation and grid search for parameter tuning, L1 and L2 loss-functions, and LIBSVM data parser. PermonSVM provides an executable for SVM classification as well as C API designed to be PETSc-like. Its typical usage is presented in Code 1.

figure a

4 MPRGP Algorithm

MPRGP (Modified Proportioning and Reduced Gradient Projection) [8] represents an efficient algorithm for the solution of convex QP with box constraints, i.e. for

$$\begin{aligned} \min \frac{1}{2}\varvec{x}^{T}\varvec{A}\varvec{x}-\varvec{x}^{T}\varvec{b}\quad \text {s.t.}\quad \varvec{l} \le \varvec{x} \le \varvec{u}, \end{aligned}$$
(7)

where \(\varvec{A} \in \mathbb {R}^{n\times n}\) is SPS, \(\varvec{x}\) is the solution, \(\varvec{b}\) is the right hand side, \(\varvec{l}\) and \(\varvec{u}\) is the lower respectively upper bound. The basic version can be considered as a modification of the Polyak algorithm. MPRGP combines the proportioning algorithm with the gradient projections.

Let \(\varvec{g}=\varvec{A}\varvec{x}-\varvec{b}\) be the gradient. Than we can define component-wise (for \(j \in \{1,2,\dots ,n\})\) gradient splitting which is computed after each gradient evaluation. The free gradient is defined as

figure b

The reduced free gradient is

figure c

where \(\overline{\alpha } \in (0,2||\varvec{A}||^{-1}]\) is used as a step length in the expansion step. The definition of the chopped gradient is

figure d

Finally, the projected gradient is defined as \(\varvec{g}^{P}= \varvec{g}^f + \varvec{g}^c\). Its norm decrease is the natural stopping criterion of the algorithm.

Let the projection onto the feasible set \(\varOmega = \{\varvec{x}: \varvec{l} \le \varvec{x} \le \varvec{u} \}\) be defined as

$$\begin{aligned} P_{\varOmega }(\varvec{x})_j = \min (u_j,\max (l_j,x_j)). \end{aligned}$$

Now we have all the necessary ingredients to summarise MPRGP in Algorithm 1.

figure e

5 Numerical Experiments

In this section, we show the scalability of our approach as well as what the relations between the training rate (percentage of correctly classified samples), the hyperplane margin, the value of the dual functional, and the norm of the projected gradient are. The hyperplane (given by \(\varvec{w}\) and b) is computed in each iteration of MPRGP. Using the computed hyperplane, we can evaluate the training rate and the margin (\(2/||\varvec{w}||\)). The value of the dual functional is trivially computed from the gradient which is available in every MPRGP iteration. The computation of these metrics is relatively expensive. Therefore, it is by default disabled, but it can be toggled by a command line switch. The decrease of the projected gradient norm is natural as well as the default stopping criterion of MPRGP.

Relations mentioned above are demonstrated on a dataset from the ExCAPE project [1] and also on the URL [13] dataset. The ExCAPE project aim is to predict compound bioactivity for the pharmaceutical industry. The tested dataset is related to Pfam protein database. It contains 226.1 thousand samples with 2048 attributes. The URL dataset relates to the detecting of malicious websites involved in criminal scams. It contains 2.4 million samples with 3.23 million attributes. The dataset is publicly available on LIBSVM datasets websites [2] in the LIBSVM format.

The experiments were run on the Salomon supercomputer [3] at IT4Innovations. Salomon consists of 1008 compute nodes. Each compute node contains two 2.5 GHz, 12-core Intel Xeon E5-2680v3 (Haswell) processors and 128 GB of memory. Compute nodes are interconnected by InfiniBand FDR56. Salomon has the peak performance of 2 petaFLOPS.

The initial guess was set to the zero vector. The relative norm of projected gradient (i.e. the ratio of the projected gradient norm and the right-hand side norm) being smaller than 1e−1 was used as the stopping criterion in all numerical experiments. From our experience, while the tolerance is exceptionally high, it is more than adequate to find a good solution. This is illustrated by the following results.

In Tables 1 and 2 and in accompanying Figs. 4 and 5 the impact of the parameter C is shown. We report the maximal achieved training rate (Max rate) and the training rate upon the solver convergence (Converged rate) as well as the number of iterations needed to reach these rates.

Looking at the results of the ExCAPE dataset (Table 1 and Fig. 4), except for C = 1e−5, the difference between the maximal rate and converged rate ranges from 0.4 to 0.63%. Also note, that the best rate is achieved after relatively few iterations. To actually satisfy the convergence criterion it is necessary to do between 2.6 and 8 times as many iterations needed to get the maximum rate.

Table 1. ExCAPE dataset: comparison of the maximal achieved training rate (and iteration it occurred) with training rate obtained after solver converged (and again iteration this occurred).
Fig. 4.
figure 4

ExCAPE dataset: comparison of the maximal achieved training rate (and iteration it occurred) with training rate obtained after solver converged (and again iteration this occurred).

The differences are much smaller for the URL dataset (Table 2 and Fig. 5). Again, we ignore in the following discussion the results for the smallest parameter C, because the solution is not good enough. The rate attained after the convergence is between 0.04 and 0.13% lower than the maximum rate. However, to reach the best rate it is necessary to do only from 50 to 70% of the number iterations needed to achieve convergence.

Table 2. URL dataset: comparison of the maximal achieved training rate (and iteration it occurred) with training rate obtained after solver converged (and again iteration this occurred).
Fig. 5.
figure 5

URL dataset: comparison of the maximal achieved training rate (and iteration it occurred) with training rate obtained after solver converged (and again iteration this occurred).

Further, we analyse the training rate, margin, value of dual functional, and norm of the projected gradient on the per iteration basis. The results are shown for the ExCAPE dataset in Figs. 6 and 7 for C = 1e−3, and for the URL dataset in Figs. 8 and 9 for C = 1e−5.

Fig. 6.
figure 6

ExCAPE dataset, C = 1e−3: the relation of the training rate and margin on the iteration number. The iteration number given in bold is where the maximum training rate was reached.

Fig. 7.
figure 7

ExCAPE dataset, C = 1e−3: the relation of the value of dual functional and the norm of the projected gradient on the iteration number. The iteration number given in bold is where the maximum training rate was reached.

Fig. 8.
figure 8

URL dataset, C = 1e−5: the relation of the training rate and margin on the iteration number. The iteration number given in bold is where the maximum training rate was reached

Fig. 9.
figure 9

URL dataset, C = 1e−5: the relation of the value of dual functional and the norm of the projected gradient on the iteration number. The iteration number given in bold is where the maximum training rate was reached.

The MPRGP algorithm guarantees the decrease of the functional value in every iteration. In these examples, the norm of the projected gradient decreases monotonously as well. However, this is not guaranteed, and in fact, we observed high fluctuations for the ExCAPE dataset with larger values of the C parameter.

More interestingly, the training rate peaks after a relatively small number of iterations. The training rate also oscillates. It is barely noticeable in these examples. However, we observed very severe oscillation for the ExCAPE dataset with larger values of the parameter C. The rate difference between consecutive iterations was sometimes over 17%. Also, notice that the hyperplane margin starts to decrease after relatively few iterations.

The decreasing value of the dual functional and that it is negative means, thanks to positive semi-definiteness of the Hessian and the positiveness of alpha, that the dual solution \(\varvec{\alpha }\), on the whole, increases. Meaning, that the satisfaction of the first constraint in (2) improves. The margin generally has a decreasing tendency, i.e. the norm of \(\varvec{w}\) increases, suggesting (from (2)) that the sum of distances of samples from their respective hyperplanes decreases as well. Note, that this does not tell us anything about the training rate. In fact, we can see that improving this sum can lead to decrease in the training rate.

The default stopping criterion of MPRGP based on the norm of the projected gradient seems ill-suited for SVM. It appears, despite the large tolerance, that the problems are solved unnecessary accurately. However, it is relatively easy to implement and use stopping criteria commonly used in SVM solvers. Looking only at the training rate, it seems that MPRGP can obtain a reasonable solution very quickly in few iterations.

Table 3. ExCAPE dataset, C = 1e−3: MPRGP strong parallel scalability

Finally, we demonstrate the strong scalability of our solver. The big advantage of PermonSVM is that it can run in a distributed environment. Moreover, the MPRGP algorithm was proven to be scalable; for example, it can solve problems of mechanics with more than billion of unknowns on tens of thousands of cores [12]. The scalability results for the ExCAPE dataset are summarized in Table 3 and Fig. 10. The results for the URL dataset are presented in Table 4 and Fig. 11. The scalability is essentially the same as the scalability of the sparse matrix-vector product. This operation is memory bounded as illustrated by “Time on half nodes” results on the URL dataset. In this case, only half of the cores on each node are used (6 cores on each socket). This MPI rank placement significantly increases memory throughput. Thanks to this, the scaling is almost perfect up to 48 cores, after which the size of the distributed dataset starts to be too small to utilise the cores fully.

Fig. 10.
figure 10

ExCAPE dataset, C = 1e−3: MPRGP strong parallel scalability

Table 4. URL dataset, C = 1e−5: MPRGP strong parallel scalability
Fig. 11.
figure 11

URL dataset, C = 1e−5: MPRGP strong parallel scalability

6 Conclusion

We have introduced a novel, open-source PermonSVM machine learning tool employing scalable quadratic programming algorithms implemented in the PermonQP module. PermonSVM provides an implementation of the two-class classification via soft-margin SVM. Currently, it supports only a linear kernel. As a default, it uses the MPRGP algorithm for the solution of QP obtained from the dual SVM formulation.

We demonstrated the behaviour of the MPRGP algorithm on a dataset from the ExCAPE project as well as on the URL dataset. We analysed the relations between the training rate, the hyperplane margin, the value of the dual functional and the norm of the projected gradient on the per iteration basis. We note that the algorithm achieves a good training rate after relatively few iterations. The scalability of our approach was also demonstrated.

Further work will include implementation of a better stopping criterion and nonlinear kernels.