13.1 The Problem

The more the dimensions of a feature space, the more is the computing power needed to classify. Support vector machines (SVMs) main advantages are (1) their effectiveness in a high-dimensional space and in cases where the number of dimensions is higher than the number of instances in the dataset, and (2) their low use of memory and hence their memory efficiency.

The aim of the SVM algorithm is to find the best hyperplane (a line in a two-dimensions space, a plane in a three-dimension space) that divides a dataset into two (or more) classes.

To understand how SVM works, we will take a binary classification problem (Fig. 13.1). Since there could be many lines that can separate the two classes (Fig. 13.1 left), SVM looks for the instances in the datasets (points on the graph) that are closest to the dividing line. The lines passing by these points are called the support vectors. The chosen optimal classification line is the one that maximized the distance between the two support vectors. This is called a maximal margin classification.

Fig. 13.1
Two scatterplots plot y versus x. Various lines separate the two datasets in the left scatterplot. The scatterplot on the right has a line that falls along it and is labeled optimal hyperplane.

Two datasets green (circle) and blue (square) to be classified. Each point has two features x and y

Of course, the instances may not that perfectly separable by a line, we need to find a way to classify determine how much should we relax the constraint related to maximizing the margin. This is called a soft margin classification.

The other problem to solve is when the classes are not linearly separable, in this case we need a non-straight line to separate the instances. In SVM, this is done using a kernel. A linear kernel allows to separate linearly separable classes, a polynomial kernel allows the use of a curved line to separate the classes and a radial kernel uses a radial-based function (RBF) to solve complex separations, for example, using a polygon in a two-dimension space).

13.2 The Algorithm

SVMs are a collection of similar supervised learning algorithms that are used for classification and regression [1,2,3,4]. The most effective way to grasp the fundamentals of support vector machines and how they function is to use a simple example (Fig. 13.1). Consider the following scenario: we have two tags, one each of green (circle shape) and blue (square shape), and our data contains two characteristics, x, and y. We are looking for a classifier that, when given a pair of (x, y) coordinates, outputs whether the pair is red or blue in color. On a plane, we plot the training data that has previously been labeled:

When given these data points, a support vector machine will produce the hyperplane (in two dimensions, a hyperplane is simply a line) that will optimally divide the tags. The hyperplane is the decision border. In 2D, each side of the line will be considered a class (i.e., blue class and green class).

But, more specifically, what is the finest hyperplane? It is the one that optimizes the margins from both tags in the case of SVM. The hyperplane (remember, it is a line in this case) with the greatest distance to the nearest element of each tag is known as the maximum distance hyperplane.

The SVM’s goal is to find the best hyperplane (or decision boundary) [5] that divides two different classes while also maximizing the distance between data points from both classes. There could be several hyperplanes to divide the two classes; our aim is to find the hyperplane that is at the greatest distance between data points from both classes (i.e., the greatest margin) [6]. Maximizing the margin distance allows subsequent data points to be categorized with more certainty.

Obviously, the number of features dictates the hyperplane’s dimension; in Fig. 13.2, we have two features x and y, the hyperplane was a straight line [5]. If the number of features is three, then the hyperplane becomes a 2D plane. Beyond three features, we cannot visualize the hyperplane (Fig. 13.3).

Fig. 13.2
A scatterplot plots y versus x. Two rising lines separate the two datasets. Both lines intersect each other at the center and are labeled as the best hyperplane and poor hyperplane.

The hyperplane (remember, it is a line in this case) with the greatest distance to the nearest element of each tag

Fig. 13.3
A 2 D scatterplot with a falling hyperplane that separates the two datasets. On the left, the line is labeled as hyperplane, in R squared, and in 3 D scatterplot the line is labeled as hyperplane in R cubed.

A line hyperplane in a 2D space (left) vs. a two-dimensional hyperplane in a 3D space (right)

Support vector machines are widely used in machine learning research all around the world, particularly in the United States. When SVMs were used in a handwriting recognition test, they gained popularity since they achieved performance equivalent to that of complex neural networks with elaborated features when employing pixel maps as input [2, 7].

13.2.1 Important Concepts

Support Vectors

Support vectors are the data points that are closest to the hyperplane and are used to calculate the distance between them. With the aid of these data points, a dividing line will be drawn between them. It is possible to demonstrate that the optimal hyperplane is derived from the function class with the lowest capacity = number of independent features/parameters that can be twiddled [8]. In other words, they are the data points that are closest to a decision surface (or hyperplane). They are also the data points that are most difficult to classify. They have a direct bearing on the optimal location of the decision surface.

Hyperplane

As we can see in the diagrams above, a hyperplane is a decision plane or space that is partitioned between a collection of objects belonging to distinct classes. In two dimensions, the hyperplane can be represented by the following equation. This is identical to the equation of affine combination; however, the bias b has been included in this case [9].

$$ {\beta}_1{x}_1+{\beta}_2{x}_2+b $$

For d-dimensional space, we may generalize this and express it in vectorized form.

$$ {\displaystyle \begin{array}{lll}h(x)& =& {\beta}_1{x}_1+\cdots +{\beta}_d{x}_d+b\\ {}& =& \left(\sum \limits_{i=1}^d{\beta}_i{x}_i\right)+b\\ {}& =& {\beta}^Tx+b\end{array}} $$

For any point \( X=\left({\mathit{\mathsf{x}}}_{\mathsf{1}},\dots, {\mathit{\mathsf{x}}}_{\mathit{\mathsf{d}}}\right), \) if h(X) = 0, then X lies on the hyperplane; otherwise h(X) < 0 or h(X) > 0, which implies that X falls to one side of the hyperplane. If we now make a very significant assumption about the coefficient weight vector β and assume that x1 and x2 are two random locations that lie on the hyperplane, we may write:

$$ h\left({x}_1\right)={\beta}^T{x}_1+b=0 $$
$$ h\left({x}_2\right)={\beta}^T{x}_2+b=0 $$

Hence,

$$ {\beta}^T{x}_1+b={\beta}^T{x}_2+b $$

and

$$ {\beta}^T\left({x}_1-{x}_2\right)=0 $$

If the dot product of two vectors is 0, we know that the vectors are orthogonal to one another, and vice versa. The weight vector β in this case is orthogonal to (x1 − x2). Being that (x1 − x2) is located on the hyperplane, it follows that the weight vector β is also orthogonal to the hyperplane. That is to say, the weight vector beta points in the direction that is normal to the hyperplane of the weight vector. When the hyperplane is shifted in d-dimensional space, this is expressed as a bias (b) [9] (Fig. 13.4).

Fig. 13.4
A hyperplane falls from the top left to the bottom right. The left and right side of the hyperplane is labeled x subscript 2 and x subscript 1. An arrow labeled beta rises from the center of the hyperplane.

The weight vector beta points in the direction that is normal to the hyperplane of the weight vector

13.2.2 Margin

The distance between two lines drawn through the closest data points of distinct classifications can be described as a margin. The minimal distance (normal distance) between each observation and a specific separating hyperplane can be used to establish the margin between two observations. See how we may utilize the margin to determine the best hyperplane for our situation. It may be computed by taking the perpendicular distance between the line and the support vectors and dividing it by two.

A large margin is seen as a good margin, while a small margin is regarded as a bad margin in business. The size of the margin determines the confidence level of the classifier; as a result, the largest possible margin should be used. Let us choose two hyperplanes based on their distance from the center (Fig. 13.5). The one to the left has a significantly larger margin than the one to the right, and as a result, the first hyperplane is more optimal than the second one.

Fig. 13.5
Two scatterplots with a hyperplane that separates two datasets. Both scatterplots have support vectors. The vectors on the left scatterplot are distant, and the vectors on the right are close to the hyperplane.

Large (left) vs. small (right) margin

We may conclude that in the maximal margin classifier, in order to categorize the data, we will utilize a separation hyperplane that is the greatest (maximum) and smallest (minimum) distance away from the observations in order to classify the data. Let us keep in mind that the margin will still be used to pick the ideal separating hyperplane. Furthermore, Jana margins are divided into two categories: functional margin and geographic margin [9]. They are both summed up below.

13.2.2.1 Functional Margin

To define the theoretical side of the margin, the term “functional margin” is employed. In the presence of a training example (xi, yi), the functional margin of (β, b) with regard to the training example will be as follows:

$$ {y}_i\left({\beta}^T{X}_i+b\right)=\hat{\gamma_i} $$

As opposed to just specifying that the number is larger than 0, we have established a value for the margin by using γ. Thus, the below requirements may be established:

$$ \mathrm{if}\ {y}_i=1,\mathrm{then}\ \hat{\gamma_i}>0 $$
$$ \mathrm{if}\ {y}_i=0,\mathrm{then}\ \hat{\gamma_i}=0 $$

But there is a problem with the functional margin, which is that its value is reliant on the values of β and b. The equation of the hyperplane remains the same when β and b are scaled (multiplied by some scaler s), but the margin increases. If you plot the following two equations, they will both represent the same hyperplane, but in this case, the width of the margins will change between the two equations (Fig. 13.6).

Fig. 13.6
A line graph with a hyperplane and two equations. The equation 20 x subscript 1 plus 30 x subscript 2 minus 50 equals 0 is above the hyperplane, and the equation 2 x subscript 1 plus 3 x subscript 2 minus 5 equals 0 is below the hyperplane.

The same hyperplane representing two equations

$$ 2{x}_1+3{x}_2\hbox{--} 5=0 $$
$$ 20{x}_1+30{x}_2\hbox{--} 50=0 $$

13.2.2.2 Geometric Margin

Let us make considerations regarding the visuals below (Fig. 13.7):

Fig. 13.7
A scatterplot with a hyperplane labeled B separates two datasets. An arrow rises from the hyperplane labeled Y superscript i, and points to a plot labeled A. Another arrow labeled w rises from the hyperplane.

Two datasets separated by a hyperplane with weigh vector w and a decision boundary (w, b)

Along with the vector w, the decision boundary corresponding to (w, b) is depicted in Fig. 13.7. It should be noted that w is orthogonal (i.e., at 90°) to the separation hyperplane.

You must convince yourself that this is a fact. Consider the point A, which represents the input x(i) of a training example with the label y(i) = 1 , as represented by the point B. The line segment AB determines the distance between it and the decision border, denoted by γ(i).

The value of γ(i) can be determined in several ways. To explain it more clearly, the unit-length vector w/‖w‖ indicates that w is moving in the same direction. Since A represents x(i), we may conclude that the point B is given by x(i) − γ(i) × w/‖w‖. The problem is that this point is located on the decision boundary, and all points x on the decision boundary are satisfied by the equation wTx + b = 0; therefore:

$$ {w}^T\left({x}^{(i)}-{\gamma}^{(i)}\frac{w}{\left\Vert w\right\Vert}\right)+b=0 $$

And solving for γ(i) yields:

$$ {\gamma}^{(i)}=\frac{w^T{x}^{(i)}+b}{\left\Vert w\right\Vert }={\left(\frac{w}{\left\Vert w\right\Vert}\right)}^T{x}^{(i)}+\frac{b}{\left\Vert w\right\Vert } $$

Specifically, this was calculated for the situation of a positive training example at A in Fig. 13.7, in which being on the “positive” side of the decision border is advantageous.

Furthermore, we define the geometric margin of (w, b) regarding a training example (x(i), y(i)) as follows:

$$ {\gamma}^{(i)}={y}^{(i)}\left({\left(\frac{w}{\left\Vert w\right\Vert}\right)}^T{x}^{(i)}+\frac{b}{\left\Vert w\right\Vert}\right) $$

It is important to note that if ‖w‖ = 1, then the functional margin equals the geometric margin—this provides a means of connecting these two disparate ideas of margin together. As a result of this property, the geometric margin is invariant to rescaling of the parameters; that is, if we substitute two values for w and two values for b, the geometric margin remains unchanged. Furthermore, because of this invariance to scaling of the parameters, we can apply any arbitrary scaling constraint to w without producing important changes [6]; for example, we can demand that ||w||=1, or that ∣w1 + b ∣  +  ∣ w2 ∣  = 2, and any other constraint can be satisfied by just rescaling w and the parameters; however, this is not recommended.

13.2.3 Types of Support Vector Machines

Support vector machines are generally classified into only two types. They are both detailed below:

13.2.3.1 Linear Support Vector Machine

This type only works with data that can be divided into two categories by a single perfect line, in which case the dataset is considered linearly separable, and the linear SVM classifier is used. This is further divided into two types and is visually displayed below.

13.2.3.2 Soft Margin Classifier

A soft margin classifier is an SVM that where the threshold is allowed to make an tolerable number of misclassifications, while allowing new data instances to be classified correctly [10]. The famous cross-validation technique can be used to determine the best classification (Fig. 13.8).

Fig. 13.8
A scatterplot with a hyperplane that separates two datasets. The hyperplane rises from the bottom left to the top right.

Linear SVM—soft margin classifier

In a real-world scenario, it is unlikely that a perfectly distinct line would be drawn between the data points included inside the space [11]. Furthermore, we might have a curved decision boundary. It is possible to have a hyperplane that precisely separates the data; however, this may not be desired if the data contains noise. Jakkula agrees it is preferable for the smooth border to disregard a small number of data points rather than being curved or going in loops around outliers [2].

The assumption that the dataset is perfectly linearly separable has been made up to this point. This assumption does not hold up to scrutiny when dealing with a real-world dataset. As a result, let us look at a slightly more challenging scenario. The linear SVM is still in the works; however, this time, some of the classes overlap in such a way that a perfect separation is unattainable, yet the data is still linearly separable [9]. Consider a dataset with two dimensions shown in Fig. 13.9. There are two primary options available:

  • When a single outlier occurs, the decision boundary might be pushed significantly, resulting in an extremely tight margin of safety.

  • The data may not be separable using a straight line, even when a linear decision boundary can correctly categorize the target classes (no clear boundary).

Fig. 13.9
A scatterplot with a hyperplane H of X equals 0 that separates two datasets. The distance between the two support vectors and the hyperplane is 1 over beta inside open and closed double vertical bars.

A dataset with two dimensions

In other words, the hard margin classifier that is visualized in Fig. 13.10 would not operate owing to the inequality restriction yi(βTxi + 1) ≥ 1.

Fig. 13.10
A scatterplot with a hyperplane that separates two datasets. H of X equals 0 denotes the hyperplane and it falls from the top left to bottom right. Two support vectors are on either side of the hyperplane.

Linear SVM—hard margin classifier

13.2.3.2.1 Hard Margin Classifier

As previously stated, the idea of SVM is to execute an affine discrimination of observations with the greatest amount of margin possible, that is, to identify an element (w ∈ X) with the lowest norm and the greatest possible real value b, such that the value of yi((wi, xi) + b) ≥ 1 is the same for all i. But to do so, we must first solve the quadratic programming issue described below:

$$ \genfrac{}{}{0pt}{}{\genfrac{}{}{0pt}{}{\kern0.75em \min <w,\kern0.5em w>}{\kern1.2em w,b\kern3.75em }\kern12.5em }{\mathrm{subject}\ \mathrm{to}\kern0.75em {y}_{i\left(<{w}_i,{x}_i>+b\right)\ge 1,\kern1.25em 1\le i\le N}} $$

The classification rule that relates to (w, b) is simply referred to as f(x) =  sin ((w, x) + b). In this circumstance (which is referred to as the hard margin SVM), we require that the rule have zero error on the learning set (Fig. 13.10).

13.2.3.3 Nonlinear Support Vector Machine

This classifier is used for nonlinearly separated data, which means that if a dataset cannot be classified using a straight line, it is considered nonlinear data, and the classifier used is the nonlinear SVM classifier. A graphical representation is shown below (Fig. 13.11) [11].

Fig. 13.11
A scatterplot with a circle that separates two datasets for nonlinear assistance of the vector machine.

Representation of a nonlinear Support Vector Machine

A mathematical example of nonlinear support vector machines is described below:

$$ K\left(x,y\right)={\left(x.y+1\right)}^p $$
$$ \left\{-{\left\Vert x-y\right\Vert}^2/2{\sigma}^2\right\} $$
$$ K\left(x,y\right)=\tan h\left( kx.y\hbox{--} \delta \right) $$

The first equation is a polynomial, while the second equation is a radial basis function (Gaussians), and the third is a sigmoid (neural net activation function) [8]. Some of these are visualized just below.

13.2.4 Classification

SVM is a data classification approach that is beneficial. The employment of neural networks, despite the fact that they are regarded as more user-friendly than SVM, might result in disappointing outcomes at times [12]. Training and testing data for classification tasks typically comprise a small number of data examples [2]. A target value and a number of characteristics are contained inside each instance of the training set. The SVM model allows us to predicts target values of the instances in the testing dataset [13].

Supervised learning may be seen in the classification process of SVM. Known labels assist in determining whether or not the system is operating in the proper manner. This information either points to a desired reaction, thus verifying the correctness of the system, or it may be utilized to assist the system in learning to behave in the appropriate manner. In SVM classification [2, 13], one phase is the identification of classes that are tightly related to the classes that are already recognized. This is referred to as feature selection, or feature extraction in technical terms. Even when the prediction of unknown samples is not required, the combination of feature selection with SVM classification might be beneficial. In order to separate the classes, they can be utilized to identify key sets that are engaged in the procedures that distinguish them.

13.2.5 Regression

Through the use of an alternate loss function, it is possible to apply SVMs to regression situations [13, 14]. It is necessary to modify the loss function in order to add a distance measure. There are two types of regression: linear and nonlinear. Linear models are composed mostly of the loss functions listed below: e-intensive loss functions, quadratic loss functions, and the Huber loss function.

It is common for nonlinear models to be required for data modeling challenges, much as it is for classification difficulties. A technique similar to the nonlinear SVC approach, nonlinear mapping, may be used to map the data into a high-dimensional feature space, where linear regression can then be done on the information.

When it comes to dealing with the curse of dimensionality [15], the kernel technique is once again used. Considerations based on past knowledge of the problem and the distribution of the noise are taken into account while employing the regression approach. The robust loss function of Huber has been proved to be a good substitute in the absence of such information [13].

13.2.6 Tuning Parameters

13.2.6.1 Regularization

For each training dataset, the support vector machine is instructed on the optimal degree of misclassification to avoid by adjusting the regularization parameter, which is also known as the C parameter in Python’s sklearn module. When larger numbers are used for the C parameter in a support vector machine, the optimizer will automatically choose a hyperplane margin that is smaller if it is successful in separating and classifying all the training data points during the optimization process. Alternately, when dealing with extremely small values, the algorithm will seek a larger margin for the hyperplane to separate, even if the hyperplane misclassifies some data points.

13.2.6.2 Gamma

An influence on a single training data sample is repeated several times using this tuning parameter. Lower gamma values reflect distance from the hyperplane, whereas higher gamma values show proximity to the hyperplane. Data points with both low and high gamma (far from and near to the hyperplane, respectively) are included in the computation of the separation line.

13.2.6.3 Margins

The margin is the final but not the least important characteristic. It is also a critical parameter for fine-tuning and a vital characteristic of a support vector machine classifier. The margin, as previously established, is the distance between the line and the data points from the classes. When using the support vector approach, it is critical to have a good and appropriate margin. When the difference between the two groups of data is higher than one standard deviation, it is a good margin. A sufficient margin ensures that the individual data points remain inside their respective classes and do not cross over into another class.

13.2.7 Kernel

When using SVM, a kernel turns the input data space into the desired format. SVM employs the kernel trick to turn a low-dimensional input space into a higher-dimensional space. For the uninitiated, this means that kernel adds new dimensions to an issue that would otherwise be impossible to separate.

Generally speaking, it is most useful in nonlinear separation situations. Simply said, the kernel performs a number of incredibly sophisticated data transformations before determining the best method of separating the data depending on the labels or outputs that have been established.

As a result, SVM gains higher scalability, adaptability, and accuracy. Kernels utilized by SVM include those listed below.

13.2.7.1 Linear Kernel

All observations can be combined in this way. Here is the equation for a linear kernel:

$$ K\left(x,{x}_i\right)=\mathrm{sum}\left(x\times {x}_i\right) $$

The product between two vectors, x and xi, may be represented as the total of the products of each pair of input values in the formula above.

13.2.7.2 Polynomial Kernel

Curved or nonlinear input spaces can be distinguished using this generalized linear kernel. A polynomial kernel may be expressed using the following formula:

$$ K\left(x,{x}_i\right)=1+\mathrm{sum}{\left(x\times {x}_i\right)}^d $$

Here, d is the degree of the polynomial, which we must manually enter into the learning algorithm.

13.2.7.3 Radial Basis Function (RBF) Kernel

When used in SVM classification, the RBF kernel transforms the input space into an infinitum of three-dimensional spaces. It is widely used in SVM classification tasks. The following formula provides a mathematical explanation:

$$ K\left(x,{x}_i\right)=\exp \left(-\mathrm{gamma}\times \mathrm{sum}\left(x-{x_i}^2\right)\right) $$

In this case, gamma is between 0 and 1. We must explicitly define it in the learning algorithm; the default value of gamma is 0.1, which is the industry-accepted default.

13.3 Advantages, Disadvantages, and Best Practices

Nonetheless, the SVM’s greatest benefit is the kernel technique, which allows it to classify extremely nonlinear situations by creating complicated boundary shapes, rather than by using simple classification rules [16]. These qualities have enabled the SVM to find widespread use in a variety of disciplines throughout the course of the previous few years. SVM has been utilized for fault diagnostics [17], quality improvement [18], and quality assessment [19].

SVMs have been used in the field of computer vision for a variety of tasks, such as face detection, picture categorization, hand gesture recognition, and background removal. SVMs have been utilized in finance for a variety of purposes, including financial time series forecasting and the prediction of bankruptcy. Aside from hydrology, other uses of SVM include forecasting of solar and wind resources, prediction of atmospheric temperature, bioinformatics, speaker recognition, agricultural forecasting, and electrical design. The quality of the datasets, on the other hand, has an impact on the performance of basic support vector machines (SVMs). Typically, noise may be found in real-world datasets. Noise is defined as anything that obscures the link between the attributes of an instance and the characteristics of its class. The noise might express itself as feature-noise (or feature uncertainty), which has the effect of altering the observed value of the corresponding feature. Certainly, uncertainties may arise as a result of the constraints of observational material, as well as the restricted resources available for data collection, storage, transformation, and analysis.

Overall, the training of SVM is quite simple, which is one of its key advantages. It scales rather well to large amounts of high-dimensional data, and the trade-off between classifier complexity and error may be carefully adjusted. It is necessary to have a good kernel function, which is one of the weaknesses [13, 20]. Overall, it is a good idea to standardize to avoid the optimal hyperplane being influenced by the scale of the features.

13.4 Key Terms

  1. 1.

    Statistical learning theory

  2. 2.

    Hyperplane

  3. 3.

    Structural risk minimization

  4. 4.

    Support vectors

  5. 5.

    Coefficient weight vector

  6. 6.

    Functional margin

  7. 7.

    Geometric margin

  8. 8.

    Soft margin classifier

  9. 9.

    Hard margin classifier

  10. 10.

    Curse of dimensionality

  11. 11.

    Gamma

13.5 Test Your Understanding

  1. 1.

    What are support vectors?

  2. 2.

    How do support vector machines function?

  3. 3.

    When have we achieved a maximum distance hyperplane?

  4. 4.

    What is a hyperplane? Highlight its purpose(s).

  5. 5.

    Explain the structural risk management concept.

  6. 6.

    Describe an optimization theory-based learning algorithm.

  7. 7.

    What is the difference between the functional margin and the geometric margin?

  8. 8.

    List the two types of support vectors.

  9. 9.

    Distinguish between soft and hard margin classifiers.

  10. 10.

    Describe the maximal margin classifier.

  11. 11.

    Why do SVMs use the kernel trick?

  12. 12.

    Highlight the tuning parameters of a support vector machine.

13.6 Read More

  1. 1.

    K. R. Song et al., “Resting-state connectome-based support-vector-machine predictive modeling of internet gaming disorder,” (in eng), Addict Biol, vol. 26, no. 4, p. e12969, Jul 2021, doi: 10.1111/adb.12969.

  2. 2.

    A. Fleury, N. Noury, and M. Vacher, “Supervised classification of Activities of Daily Living in Health Smart Homes using SVM,” (in eng), Annu Int Conf IEEE Eng Med Biol Soc, vol. 2009, pp. 6099–102, 2009, doi: 10.1109/iembs.2009.5334931.

  3. 3.

    L. Squarcina et al., “Automatic classification of autism spectrum disorder in children using cortical thickness and support vector machine,” (in eng), Brain Behav, vol. 11, no. 8, p. e2238, Aug 2021, doi: 10.1002/brb3.2238.

  4. 4.

    P. Unnikrishnan, D. K. Kumar, S. Poosapadi Arjunan, H. Kumar, P. Mitchell, and R. Kawasaki, “Development of Health Parameter Model for Risk Prediction of CVD Using SVM,” (in eng), Comput Math Methods Med, vol. 2016, p. 3016245, 2016, doi: 10.1155/2016/3016245.

  5. 5.

    A. Fleury, M. Vacher, and N. Noury, “SVM-based multimodal classification of activities of daily living in Health Smart Homes: sensors, algorithms, and first experimental results,” (in eng), IEEE Trans Inf Technol Biomed, vol. 14, no. 2, pp. 274–83, Mar 2010, doi: 10.1109/titb.2009.2037317.

  6. 6.

    S. Wang et al., “Abnormal regional homogeneity as a potential imaging biomarker for adolescent-onset schizophrenia: A resting-state fMRI study and support vector machine analysis,” (in eng), Schizophr Res, vol. 192, pp. 179–184, Feb 2018, doi: 10.1016/j.schres.2017.05.038.

  7. 7.

    S. Wang, G. Wang, H. Lv, R. Wu, J. Zhao, and W. Guo, “Abnormal regional homogeneity as potential imaging biomarker for psychosis risk syndrome: a resting-state fMRI study and support vector machine analysis,” (in eng), Sci Rep, vol. 6, p. 27619, Jun 8 2016, doi: 10.1038/srep27619.

  8. 8.

    C. Cavaliere et al., “Computer-Aided Diagnosis of Multiple Sclerosis Using a Support Vector Machine and Optical Coherence Tomography Features,” (in eng), Sensors (Basel), vol. 19, no. 23, Dec 3 2019, doi: 10.3390/s19235323.

  9. 9.

    M. Kang, S. Shin, G. Zhang, J. Jung, and Y. T. Kim, “Mental Stress Classification Based on a Support Vector Machine and Naive Bayes Using Electrocardiogram Signals,” (in eng), Sensors (Basel), vol. 21, no. 23, Nov 27 2021, doi: 10.3390/s21237916.

  10. 10.

    G. Cohen and R. Meyer, “Optimal asymmetrical SVM using pattern search. A health care application,” (in eng), Stud Health Technol Inform, vol. 169, pp. 554–8, 2011

13.7 Lab

13.7.1 Working Example in Python

In this section, we will create a support vector machine classifier model, test it, and optimize it. Start by downloading Iris dataset using the following link: https://www.kaggle.com/datasets/arshid/iris-flower-dataset. Alternatively, you can use the following code to load the dataset directly into your code.

iris = datasets.load_iris() x = iris.data[:, :4] y = iris.target

The iris dataset describes the properties of flowers. It includes three iris species within 50 samples. This dataset includes the following columns:

  • Petal length: petal length for the Iris

  • Petal width: petal width for the Iris

  • Sepal length: sepal length for the Iris

  • Sepal width: sepal width for the Iris

  • Species: class of the iris (there are three species in the dataset)

13.7.1.1 Loading Iris Dataset

Start by importing the required libraries and loading the dataset (Fig. 13.12).

Fig. 13.12
An algorithm to load the Iris dataset into pandas and a table of 6 columns and 11 rows. The algorithm has codes for the following. Import required libraries and load the Iris dataset.

Loading the Iris dataset into pandas

13.7.1.1.1 Visualize Iris Dataset

Visualizing the dataset can be done in many ways, one is demonstrated in Fig. 13.13.

Fig. 13.13
An algorithm to load the Iris dataset with a table of 4 columns and 4 rows. The algorithm has the codes for the following: Import required libraries, and load the Iris dataset.

Visualizing iris dataset

13.7.1.2 Preprocess and Scale Data

We need to replace the categorical target with numeric values, split the dataset into training and testing datasets, and standardize both sets (Fig. 13.14).

Fig. 13.14
An algorithm to visualize the Iris dataset. The algorithm has codes for the following: choosing the features and target columns, scaling data.

Preprocess and scale iris dataset

13.7.1.3 Dimension Reduction

We can now create a support vector model (SVM) using an RBF kernel and C=100. The dimension of the feature matrix is low (i.e., 4); however, for illustration purposes, we will use the Principal Component analysis (PCA) to reduce the number of features to 2. Once PCA is create, we apply it to the x_tran and x_test. A two-dimension feature matrix will allow us to plot the SVM results in two dimensions which clarifies the end result. Instead of using PCA, and for illustration purposes, you could have opted to choose two of the four dimensions, such as sepal width and petal width (Fig. 13.15).

Fig. 13.15
An algorithm to create assistance for vector machine. The codes are: to reduce the features&#x2019; dimensions colon for illustration purpose only, the dimensions are already small colon 4.

Creating support vector machine

13.7.1.4 Hyperparameter Tuning and Performance Measurements

Using GridSearch, we can now seek hyperparameter tuning for the SVC. One the optimal model is found, we fit it to the training dataset and make predictions on the testing dataset to display the classification report and the AUC (Fig. 13.16).

Fig. 13.16
An algorithm with codes for model optimization, prepare the S V C algorithm, and so on. A classification report with a table of 5 columns and 6 rows displays the AUC value.

Decision plot for iris species

Optionally, we can display the confusion matrix (Fig. 13.17).

Fig. 13.17
An algorithm to visualize the data of the confusion matrix for the optimal support vector model using the testing dataset.

Confusion matrix resulting from the optimal model

13.7.1.5 Plot the Decision Boundaries

Finally, we can plot the decision boundaries between classes (Figs. 13.18 and 13.19).

Fig. 13.18
An algorithm to plot the decision boundaries between three datasets in a 2 D scatterplot. The scatterplot of S V M classification results on the Iris dataset plots component 2 versus component 1 for Iris setosa, Iris versicolor, and Iris virginica.

Plotting the decision boundaries in 2D

Fig. 13.19
An algorithm with codes to compute and print accuracy score with a table of 5 columns and 6 rows. The column headers are blank, precision, recall, f 1 score, and support.

Calculating accuracy, recall, and precision metrics for SVM using testing dataset

13.7.2 Do It Yourself

13.7.2.1 The Iris Dataset Revisited

In Sect. 13.7 above, we applied PCA to reduce the number of features to 2

  1. 1.

    Instead of PCA choose to drop the petal length and sepal length and check how the MVC performance chance.

  2. 2.

    Instead of PCA choose to drop the petal width and sepal width and check how the MVC performance chance.

  3. 3.

    Which one lead to better results? Can you know in advance what is more likely to lead to good performance by looking at the pair plots? We did not display the pair plots, so display them and check visually to see if you can gain an insight about the better choice.

13.7.2.2 Breast Cancer

Use the breast cancer dataset that can download from the following link: https://www.kaggle.com/code/buddhiniw/breast-cancer-prediction/data.

  1. 1.

    Create an SVM model to solve this classification problem.

  2. 2.

    Now that you know several classifiers, create a lab where you use three classifiers including an SVM and compare their performance. Conclude by choosing the best performing classifier. Always give a rational for your choices.

13.7.2.3 Wine Classification

Use the wine dataset that can be downloaded from the following link: https://archive.ics.uci.edu/ml/datasets/wine

You can also contemplate using the built “load_wine”

from sklearn.datasets import load_wine wine_data= sklearn.datasets.load_wine()

There are three types of wine, so this is a multi-class problem. Create a model to predict the wine the using SVM (hint: use the SVC with decision_functino_shape =‘ovr’ and degree=3).

13.7.2.4 Face Recognition

You might need to install the Python image library called pillow: pip install pillow

  1. 1.

    Load the images dataset using the following code

from sklearn.datasets import fetch_lfw_people data = fetch_lfw_people(min_faces_per_person=50) # read only those with 50 images or more

  1. 2.

    Display on the screen the number of instances in each target class

  2. 3.

    Do you notice any imbalance in the classes? Clarify.

  3. 4.

    Try to plot few images on the screen

  4. 5.

    Split the dataset into training and testing datasets

  5. 6.

    Create an SVM variable (if there were imbalance in classes, then use a class_weight parameter)

  6. 7.

    Grid search for the optimal model

  7. 8.

    Display the best parameters and the best model

  8. 9.

    Fit the optimal model on the training dataset

  9. 10.

    Use the fitted model to predict using the testing dataset

  10. 11.

    Display a classification report (use classification_report from Sklearn )

  11. 12.

    Let’s take one further step. PCA might help you boost the model’s performance. Apply PCA before rerunning the SVM grid search and check if the performance is better.

13.7.2.5 SVM Regressor: Predict House Prices with SVR

Support vector machine can be used not only as classifiers but as regressors too. Create a support vector model regressor to predict house prices, using the housing dataset that can be downloaded using the following link: https://www.kaggle.com/datasets/huyngohoang/housingcsv.

The housing dataset provides the sale price of houses across the United States. This dataset includes the following columns:

  • Avg. Area Income: the average income in the area where the house is located.

  • Avg. Area House Age: the average house age in the area where the house is located.

  • Avg. Area Number of Rooms: the average number of rooms for a house in the area where the house is located.

  • Avg. Area Number of Bedrooms: the average number of bedrooms for a house in the area where it is located.

  • Area Population: the population in the area where the house is located.

  • Price: the sale price of the house.

  • Address: the house address.

Hint: explore the SVR following this link: https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html

13.7.2.6 SVM Regressor: Predict Diabetes with SVR

  1. 1.

    Load the diabetes dataset using the following code:

SVM Regressor: Predict house prices with SVR

  1. 2.

    Store at least 30 samples in a testing dataset

  2. 3.

    Proceed with GridSearch using the following values

    1. (a)

      alpha: [1e-7, 1e-6, 1e-5, 1e-4]

    2. (b)

      penalty: [None, ‘l2’]

    3. (c)

      eta0: [0.03, 0.04, 0.05, 0.1]

    4. (d)

      max_itr: [500, 1000]

  3. 4.

    After fitting the optimal model, display the best parameters and best estimstor

  4. 5.

    Make prediction and display the optimal model performance (i.e., MAE, MSE, R2)

13.7.2.7 Unsupervised SVM

Support vector machine can be used not only as supervised but unsupervised too.

Create an unsupervised support vector machine regressor to predict house prices, using the housing dataset.

Hint: explore the One class SVM following this link: https://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html

13.7.3 Do More Yourself

Use the following datasets and create linear and nonlinear support vector machines to solve the classification problems associated with these datasets. Also, try several algorithms to solve each and choose the best model.