Keywords

4.1 Introduction

We have examined the roles of centrality and diversity in search and representation earlier. In the discussion, we had considered their roles in clustering and classification also. In this chapter, we will consider more details on the roles of centrality and diversity in clustering and classification.

4.2 Clustering

We have observed that a unifying representation  of both hard and soft clustering is through matrix factorization. Specifically, if we represent the set of n l-dimensional points,

$$\begin{aligned} {\mathcal X} = \{x_1, x_2, \ldots , x_n \}, \end{aligned}$$

to be clustered as the rows of a matrix \(A_{n \times l}\), then we can factorize it into the product of matrices \(B_{n \times K}\) and \(C_{K \times l}\) where

  • \(B_{n \times K}\) is the cluster/topic assignment matrix, \(B_{ik}\) is the membership or importance of cluster k to pattern i for \(i = 1, \ldots , n\) and \(k = 1, \ldots , K\).

  • \(C_{K \times l}\) is the cluster/topic description matrix  where \(C_{kj}\) indicates the importance of feature j to cluster k, \(j= 1, \ldots , l\) and \(k = 1, \ldots , K\).

An observation in such a representation is that any data matrix \(A_{n \times l}\) is a \(\mathfrak {R}^{n \times l}\) structure. Further, any \(n \times l\) matrix A has its

$$\begin{aligned} rank(A) = row-rank(A) = column-rank(A) \end{aligned}$$

where row rank is the number of linearly independent rows in A and column rank is the number of linearly independent columns of A. Because clustering is grouping of rows (n patterns) and dimensionality reduction deals with columns (l features), their ranks being equal means the number of clusters and number of features are equal from the linear independence view.

We examine one representative each from clustering, feature selection, and feature extraction with the help of an example data set.

Example 4.1

Let \(A_{4 \times 3}\) be a matrix representing 4 patterns in a 3-dimensional space given by

$$\begin{aligned} A = \begin{bmatrix}1&0&1\\ 0&1&0\\ 1&0&1\\ 0&1&0 \end{bmatrix} \end{aligned}$$

For the sake of simplicity rows are replicated, rows 1 and 3 are identical and rows 2 and 4 are also same. We will examine clustering first using matrix A.

4.2.1 Clustering-Based Matrix Factorization

Clustering the 4 rows of A into \(K =2\) clusters gives us \(\{ A_1, A_3\}\) and \(\{A_2, A_4 \}\), where \(A_i\) is the ith row of A. This is obtained by selecting the first two rows, diverse rows, as the initial cluster centers and assigning the remaining two points, third and fourth rows based on nearness to the selected points. The centroids of clusters \(c_1\) and \(c_2\) are (1, 0, 1) and (0, 1, 0) respectively. This gives us

  • The assignment matrix \(B_{4 \times 2}\) to be

    $$\begin{aligned} B = \begin{bmatrix} 1&0\\ 0&1\\ 1&0\\ 0&1 \end{bmatrix} \end{aligned}$$
  • The cluster description matrix \(C_{2 \times 3}\) has the 2 centroids as its rows given by

    $$\begin{aligned} C = \begin{bmatrix}1&0&1\\ 0&1&0 \end{bmatrix} \end{aligned}$$
  • Note that

    $$\begin{aligned} A= \begin{bmatrix}1&0&1\\ 0&1&0\\ 1&0&1\\ 0&1&0 \end{bmatrix} = \begin{bmatrix} 1&0\\ 0&1\\ 1&0\\ 0&1 \end{bmatrix} \begin{bmatrix}1&0&1\\ 0&1&0 \end{bmatrix} = BC \end{aligned}$$

Observe that because of the simplicity of the data, any clustering algorithm will lead to the same partition and if centroids of clusters or other representatives are used, again we get the same C matrix. However, some important observations are

  • In general \(A \approx BC\). In this example, \(A= BC\) because each centroid coincides with 2 out of the 4 patterns.

  • In most of the practical applications A will have elements from \(\mathfrak {R}^+ \cup \{0\}\). The factorization is called non-negative matrix factorization (NMF) if elements of B and C are nonnegative reals.

  • It is known that in such a NMF if any two out of A, B, C are given, then getting the third one is simple. In KMA based clustering, given A, getting the centroids and the C matrix are reasonably straightforward.

  • In NMF, in general, we are given A and finding B and C is posed as the optimization problem

    $$\begin{aligned} \min _{B,C} ||A-BC||_F \; s.t. B \ge 0, C \ge 0 \end{aligned}$$

    where \(||A-BC ||_F\) is the squared Frobenius norm or element-wise difference between the \(n \times l\) matrices A and BC.

4.2.2 Feature Selection

It is easy to observe that columns 1 and 3 are identical in matrix A. So, by grouping the columns and identifying diverse columns gives rise to using either columns 1 and 2 or columns 2 and 3. Suppose we use columns 1 and 2 to represent matrix B, then

$$ B = \begin{bmatrix} 1&0\\ 0&1\\ 1&0\\ 0&1 \end{bmatrix}.$$

Consequently we get the same C matrix as earlier that is given by

$$ C = \begin{bmatrix}1&0&1\\ 0&1&0 \end{bmatrix}.$$

This simple example is ideally suited to illustrate the equivalence between features and clusters by using feature selection. Further, all the matrices involved are nonnegative. So, this is an example NMF.

4.2.3 Principal Component Analysis (PCA)

Principal components, PCs, are popular linear feature extractors. Given the data represented in l-dimensional space using features \(f_1, f_2, \ldots , f_l\). An extracted feature, f, is a linear combination that is obtained from the given l features. So, \(f = \sum _{i=1}^l \alpha _if_i\) where \(\alpha _i\) is the weight or importance associated with the given feature \(f_i\). In general, we can extract features using nonlinear combinations also, but that may be time consuming.

In PCA, the features extracted are the eigenvectors of the covariance matrix of the data. These are popularly called the principal components (PCs). There could be up to l PCs when A is an \(n \times l\) matrix. These are ordered based on decreasing order of the respective eigenvalues. Some properties of PCA are

  1. 1.

    Because the underlying matrix is the covariance matrix, these eigenvalues are variances in the direction of the respective PCs. So, the first PC is in the maximum variance direction of the data.

  2. 2.

    The covariance matrix is a symmetric matrix. So, the eigenvectors (PCs) are orthogonal to each other when the corresponding eigenvalues are distinct.

  3. 3.

    If we take the first K out of l possible PCs to represent the data, it corresponds to optimizing a criterion function that captures average deviations between the given patterns in the l-dimensional space and a K-dimensional space. This minimization leads to K PCs as the optimal new features that are linear combinations of the given features.

  4. 4.

    These PCs provide uncorrelated  directions under some conditions.

Considering the data matrix \(A_{4 \times 3}\), the corresponding sample covariance matrix is obtained first by getting the zero-mean normalized matrix, \(A^n\) is

$$\begin{aligned} A^n = \begin{bmatrix}\frac{1}{2}&-\frac{1}{2}&\frac{1}{2}\\ - \frac{1}{2}&\frac{1}{2}&-\frac{1}{2}\\ \frac{1}{2}&-\frac{1}{2}&\frac{1}{2}\\ - \frac{1}{2}&\frac{1}{2}&-\frac{1}{2} \end{bmatrix} \end{aligned}$$

and then the covariance matrix \(\varSigma \) given by \({A^n}^tA^n\) which is

$$\begin{aligned} \varSigma = \frac{1}{4}\begin{bmatrix}1&-1&1\\ -1&1&-1\\ 1&-1&1 \end{bmatrix}. \end{aligned}$$

The eigenvalues of \(\varSigma \) are 3, 0, and 0. So, the top two eigenvectors are \((1,-1,1)^t\) and \((1,2,1)^t\). They are orthogonal. To make them orthonormal we normalize them to make them unit norm   vectors to get the two PCs to be

$$\begin{aligned} \left( \frac{1}{\sqrt{3}}, -\frac{1}{\sqrt{3}},\frac{1}{\sqrt{3}} \right) ^t, \left( \frac{1}{\sqrt{6}}, \frac{2}{\sqrt{6}}, \frac{1}{\sqrt{6}} \right) ^t \end{aligned}$$

So, \(C_{pc}\) matrix is given by

$$\begin{aligned} C_{pc} = \begin{bmatrix} \frac{1}{\sqrt{3}}&-\frac{1}{\sqrt{3}}&\frac{1}{\sqrt{3}}\\ \frac{1}{\sqrt{6}}&\frac{2}{\sqrt{6}}&\frac{1}{\sqrt{6}}.\end{bmatrix} \end{aligned}$$

This gives us the \(B_{pc}\) matrix to be

$$\begin{aligned} B_{pc} = \begin{bmatrix} \frac{2}{\sqrt{3}}&\frac{2}{\sqrt{6}}\\ - \frac{1}{\sqrt{3}}&\frac{2}{\sqrt{6}}\\ \frac{2}{\sqrt{3}}&\frac{2}{\sqrt{6}}\\ - \frac{1}{\sqrt{3}}&\frac{2}{\sqrt{6}}. \end{bmatrix} \end{aligned}$$

Note that the 4 rows of \(B_{pc}\) are obtained by projecting the 4 patterns onto these two PCs. Projecting the first row (pattern) of A, that is (1, 0, 1) gives us \((\frac{2}{\sqrt{3}}, \frac{2}{\sqrt{6}})\). The second row projection gives us \((-\frac{1}{\sqrt{3}}, \frac{2}{\sqrt{6}})\). Putting them all together, we have \(A = B_{pc}C_{pc}\) given by

$$\begin{aligned} A= \begin{bmatrix}1&0&1\\ 0&1&0\\ 1&0&1\\ 0&1&0 \end{bmatrix}= \begin{bmatrix} \frac{2}{\sqrt{3}}&\frac{2}{\sqrt{6}}\\ - \frac{1}{\sqrt{3}}&\frac{2}{\sqrt{6}}\\ \frac{2}{\sqrt{3}}&\frac{2}{\sqrt{6}}\\ - \frac{1}{\sqrt{3}}&\frac{2}{\sqrt{6}} \end{bmatrix}\begin{bmatrix} \frac{1}{\sqrt{3}}&- \frac{1}{\sqrt{3}}&\frac{1}{\sqrt{3}}\\ \frac{1}{\sqrt{6}}&\frac{2}{\sqrt{6}}&\frac{1}{\sqrt{6}}.\end{bmatrix} \end{aligned}$$

This factorization is indicating how the 3-dimensional points are represented in the 2-dimensional PC space. When the rank of the matrix A is 2, which is the case here, we can represent it using 2 orthogonal basis vectors as indicated in the equality between A and \(B_{pc}C_{pc}\). Also this is not an NMF as there are negative elements in both \(B_{pc}\) and \(C_{pc}\).

However, the second eigenvalue of \(\varSigma \) is 0. So, the variance is captured by the first PC itself. In such a case, using the first PC we get approximation

$$A= \begin{bmatrix}1&0&1\\ 0&1&0\\ 1&0&1\\ 0&1&0 \end{bmatrix}\approx \begin{bmatrix} \frac{2}{\sqrt{3}}\\ -\frac{1}{\sqrt{3}}\\ \frac{2}{\sqrt{3}}\\ -\frac{1}{\sqrt{3}} \end{bmatrix}\begin{bmatrix} \frac{1}{\sqrt{3}}&-\frac{1}{\sqrt{3}}&\frac{1}{\sqrt{3}}. \end{bmatrix} $$

Here \(B_{pc}\) is the projection of the 4 rows of A onto the first PC.

This approximation amounts to \(||A-B_{pc}C_{pc}||_F = \frac{16}{3}\), where each pattern is approximated with an error of \(\frac{4}{3}\). However, the 1-dimensional representation is able to discriminate between the patterns 1 and 3 from the patterns 2 and 4. There could be other approximations with a lesser value of 4 as the squared Frobenius norm for the following.

$$A = \begin{bmatrix}1&0&1\\ 0&1&0\\ 1&0&1\\ 0&1&0 \end{bmatrix}\approx \begin{bmatrix} \sqrt{3}\\ 0\\ \sqrt{3}\\ 0 \end{bmatrix}\begin{bmatrix} \frac{1}{\sqrt{3}}&-\frac{1}{\sqrt{3}}&\frac{1}{\sqrt{3}} \end{bmatrix}= \begin{bmatrix}1&-1&1\\ 0&0&0\\ 1&-1&1\\ 0&0&0 \end{bmatrix}.$$

Even though some discrimination between elements of the two clusters is exhibited in the PCs space, in general the first K PCs may not be able to retain the discrimination present in the l-dimensional space. The reason is that the underlying optimization is planned to reduce the expected squared deviation between the patterns in the l-dimensional space and their representations in the K-dimensional space specified by minimization of

$$\begin{aligned} E[(x^l-x^K)^t(x^l-x^K)], \end{aligned}$$

where \(x^l\) and \(x^K\) are original pattern and its approximation, that is represented in the \(K (<l)\) dimensional space respectively and E is the expectation operation. The following high-level summary of the properties will link the above criterion function and the PCs.

  • Note that \(x^l\) is a vector in a l-dimensional space. So, it can be uniquely represented using l orthonormal basis vectors \(v_1, \ldots , v_l\). Specifically,

    $$\begin{aligned} x^l = \sum _{i=1}^l d_iv_i \end{aligned}$$

    where \(d_i\)s are some real numbers, for \(i=1, \ldots , l\).

  • Now \(x^K\) may be viewed as coming out of K-dimensional subspace and

    $$\begin{aligned} x^K=\sum _{i=1}^K d_iv_i \end{aligned}$$
  • The error, by exploiting the orthonormality property of \(v_1, v_2, \ldots , v_l\) will reduce to

    $$\begin{aligned} error = E[(x^l-x^K)^t(x^l-x^K)]= \sum _{i={K+1}}^l v_i^t \varSigma v_i= \sum _{i={K+1}}^l v_i^tv_i \lambda _i= \sum _{i={K+1}}^l \lambda _i \end{aligned}$$

    where \(v_i\) and \(\lambda _i\) are an eigenvector and the respective eigenvalue of \(\varSigma \).

  • This error is minimized when \(\lambda _{K+1}, \lambda _{K+2}, \ldots , \lambda _l\) are smaller. This indicates that \(\lambda _1, \lambda _2, \ldots , \lambda _K\) need to be the larger eigenvalues. Correspondingly, \(v_1, v_2, \ldots , v_K\) are the eigenvectors that characterize \(x^K\).

  • So, first K PCs are the eigenvectors of \(\varSigma \) which can uniquely characterize projection of each pattern in the K space.

So, error considered is intuitively appealing as it minimizes the average error between patterns in the l space and the respective projections in the K PCs space. This optimization is reproduction friendly and the basis vectors in the K space capture the variance in the data to the best possible extent. However, there is no guarantee that the K PCs retain the discrimination present in the patterns.

4.2.4 Singular Value Decomposition (SVD)

A more general factorization of \(A_{n \times l}\) may be viewed \(A_{n \times l} = B_{n \times n} D_{n \times l} C_{l \times l}\), where D is a diagonal matrix with \(n-l\) zero rows if \(n > l\) or with \(l-n\) zero columns if \(n < l\). In the earlier cases, where \(A = BC\), D may be viewed as having in its diagonal portion the identity matrix I.

SVD may be viewed as

  • orthonormal eigenvectors of the symmetric matrix \(AA^t\) as the columns of B.

  • orthonormal eigenvetors of the symmetric matrix \(A^tA\) as the rows of C.

  • Square roots of the eigenvalues of \(AA^t\) or \(A^tA\), based on whether \(n < l\) or \(l < n\) respectively, as the diagonal entries of D with remaining elements to be 0. These diagonal entries are called the singular values of A.

  • Importantly, SVD always gives \(B, \; D, \; and \;C\) such that \(A = BDC\), an exact deterministic factorization of any A matrix.

Consider the matrix A given in the example, we have

$$ A^tA = \begin{bmatrix}2&0&2\\ 0&2&0\\ 2&0&2\end{bmatrix}. $$

The eigenvalues of \(A^tA\) are \(4,\; 2,\; and\; 0\) the respective eigenvectors are \((1,0,1)^t, (0,1,0)^t, (1,0,-1)^t\). They are orthogonal and by normalizing them to be unit norm vectors, we get the C matrix as

$$\begin{aligned} C = \begin{bmatrix} \frac{1}{\sqrt{2}}&0&\frac{1}{\sqrt{2}}\\ 0&1&0\\ \frac{1}{\sqrt{2}}&0&-\frac{1}{\sqrt{2}} \end{bmatrix} \end{aligned}$$

Similarly, the eigenvalues of \(AA^t\) are \(4, \; 2, \; 0, \; 0\) and respective orthonormal eigenvectors that are used as columns of B give B as

$$\begin{aligned} B = \begin{bmatrix} \frac{1}{\sqrt{2}}&0&\frac{1}{2}&-\frac{1}{2}\\ 0&\frac{1}{\sqrt{2}}&-\frac{1}{2}&-\frac{1}{2}\\ \frac{1}{\sqrt{2}}&0&-\frac{1}{2}&\frac{1}{2}\\ 0&\frac{1}{\sqrt{2}}&\frac{1}{2}&\frac{1}{2}\end{bmatrix}. \end{aligned}$$

The \(D_{4 \times 3}\) is given

$$\begin{aligned} D = \begin{bmatrix} 2&0&0\\ 0&\sqrt{2}&0\\ 0&0&0\\ 0&0&0 \end{bmatrix}, \end{aligned}$$

where nonzero entries \(\sqrt{4} = 2, \; and \; \sqrt{2}\) are the singular values that are the positive square roots of the nonzero eigenvalues of either \(AA^t\) or \(A^tA\). Note that

$$\begin{aligned} A = \begin{bmatrix}1&0&1\\ 0&1&0\\ 1&0&1\\ 0&1&0 \end{bmatrix}= \begin{bmatrix} \frac{1}{\sqrt{2}}&0&\frac{1}{2}&-\frac{1}{2}\\ 0&\frac{1}{\sqrt{2}}&-\frac{1}{2}&-\frac{1}{2}\\ \frac{1}{\sqrt{2}}&0&-\frac{1}{2}&\frac{1}{2}\\ 0&\frac{1}{\sqrt{2}}&\frac{1}{2}&\frac{1}{2}\end{bmatrix}\begin{bmatrix} 2&0&0\\ 0&\sqrt{2}&0\\ 0&0&0\\ 0&0&0 \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}}&0&\frac{1}{\sqrt{2}}\\ 0&1&0\\ \frac{1}{\sqrt{2}}&0&-\frac{1}{\sqrt{2}} \end{bmatrix}= BDC. \end{aligned}$$

This is an exact factorization, which could be obtained for any \(A_{m \times n}\). We can consider an approximation by retaining some largest singular values and ignoring (making them 0) the smaller singular values. For example, here if we ignore \(\sqrt{2}\), that is approximate D to

$$\begin{aligned} D = \begin{bmatrix} 2&0&0\\ 0&0&0\\ 0&0&0\\ 0&0&0 \end{bmatrix}, \end{aligned}$$

then the resulting approximation to A based on the largest singular value is \(A_1\) where

$$\begin{aligned} A_1 = \begin{bmatrix}1&0&1\\ 0&0&0\\ 1&0&1\\ 0&0&0 \end{bmatrix}= \begin{bmatrix} \frac{1}{\sqrt{2}}&0&\frac{1}{2}&-\frac{1}{2}\\ 0&\frac{1}{\sqrt{2}}&-\frac{1}{2}&-\frac{1}{2}\\ \frac{1}{\sqrt{2}}&0&-\frac{1}{2}&\frac{1}{2}\\ 0&\frac{1}{\sqrt{2}}&\frac{1}{2}&\frac{1}{2}\end{bmatrix}\begin{bmatrix} 2&0&0\\ 0&0&0\\ 0&0&0\\ 0&0&0 \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}}&0&\frac{1}{\sqrt{2}}\\ 0&1&0\\ \frac{1}{\sqrt{2}}&0&-\frac{1}{\sqrt{2}} \end{bmatrix} \end{aligned}$$

Note that the squared Frobenius norm \(||A-A_1||_F\) is 2 or the Frobenius norm is \(\sqrt{2}\) which is the singular value that is ignored. It is not a coincidence. In general, if a matrix A is approximated to \(A_K\) by using the top K singular values in D, then \(||A-A_K||_F = \sigma _{K+1}^2\) where \(\sigma _{K+1}\) is the largest of the ignored singular values. This helps in monitoring the possible error in approximating A to \(A_K\) for both dimensionality reduction and clustering. A popular application is in document representation, clustering, and classification under latent semantic analysis (LSA) .

Under the matrix factorization, one can characterize any linear feature extraction including feature selection, hard and soft clustering, and even classification. Note that even nonlinear problems may be viewed as linear in an appropriate high-dimensional space. So, linear algebra in general and matrix factorization in particular are important in several of these topics.

Even the probabilistic variants like probabilistic latent semantic analysis (PLSA) are shown to be equivalent to deterministic factorization approaches like NMF and the KMA. This happens because both the approaches depend on some empirical schemes, based on the given data set in practice. In a more general sense statistics is responsible for the equivalence. An important semantic underlying matrix factorization is some kind of criterion function that is optimized with additional constraints to regularize or reduce the diversity of the solution space. We summarize the optimization related properties associated with clustering and dimensionality reduction in Table 4.1.

Table 4.1 Optimization in clustering and dimensionality reduction

4.2.5 Diversified Clustering

Conventionally in clustering, the points in each cluster are similar to each other and points in different clusters are dissimilar. However, there are applications where each cluster needs to have diverse elements and a pair of clusters are highly similar. In other words there is a higher within cluster entropy and lower between cluster entropy.

Some of these applications are in

  • Peer Learning: If a collection of students, selected based on some qualifying score, are to be grouped then the conventional clustering will lead to stratified grouping. In such a grouping all the students similar in terms of the qualifying score will be put together. This reduces the chance for peer learning. It can be shown to be good if each group has diverse students, that is students with varying qualifying scores. Further, to avoid discrimination between groups different groups should have similar collective behavior. This means round-robin allotment students to groups is a better deal than stratified grouping.

  • Team formation: When different soccer teams are to be selected to participate in a cup, there will be diversity in terms of special roles of players like the goalkeeper, wing, center forward, full back, etc., This means there will be diversity in terms of these special roles in each team. Further, every team requires a goalkeeper, two wings, etc., which means a pair of teams are structurally similar. Not only in sports, this kind of grouping is required in the formation of committees and many other team formation scenarios.

  • Groups based on a Standard: UG programmes offered by various computer science departments typically follow ACM curriculum. So, the similarity between different UG programmes exists because of the standard like the ACM curriculum. At the same time, each programme needs to show enough diversity in terms of representation of theoretical CS, computer systems, and other topics like ML, AI, DBMS, graphics, etc. There other standards like, for example, the Dewey Decimal Classification, Library Congress classification, etc. which are followed by libraries across the globe.

4.3 Classification

We have seen in earlier chapters how search and representation impact the classifiers. Knowledge is used in the form of prior densities, selection of representation schemes for patterns and classes. We can search for how knowledge can be exploited in modeling, selecting the correct model, and even selection of the values of the hyperparameters. Search takes different forms including searching for a solution to an optimization problem based on some constraints. In this section, we will examine how optimization can be used in modeling and selecting classifier models.

A good number of classifiers are explicitly modeled or can be interpreted as solutions to some intuitively appealing and convenient optimization problems. We will look into some of the classifiers.

4.3.1 Perceptron

It may be viewed as minimizing the sum of the violations of the training patterns, their distances from the wrong side of the decision boundary, using the current w, the weight vector of perceptron. This happens because w has misclassified some training patterns. Noting that each such pattern, x satisfies \(w^tx < 0\), the perceptron criterion function based on w is, PCF(w) is

$$\begin{aligned} PCF(w) = -\sum _{x: w^tx <0} w^tx. \end{aligned}$$

\(w^tx\) captures the extent of violation of x because of w. Because \(w^tx < 0\) for such an x, we minimize \(-w^tx\) for every x that is misclassified by w so that sum of the extent of violations is minimized.

If we consider the gradient \(\nabla _w PCF(w)\), we get \(-\sum _{x: w^tx <0} x\). So, if we use the gradient descent method to minimize PCF(w), then the updates to w are given, using the negative of the gradient with a suitable scaling factor \(\eta \), by

$$\begin{aligned} w_{k+1} = w_k + \eta \sum _{x: w_k^tx <0} x \end{aligned}$$
(4.1)

This update rule is called batch mode update. There are several simplifications to this equation.

  1. 1.

    One variant is to use \(\eta = 1\) and consider one x that is misclassified at a time rather than the sum of all the patterns x that are misclassified by \(w_k\). This is popularly called the fixed increment rule that we discussed in the previous chapter.

  2. 2.

    Another variant is to insist that the w obtained is a simple sparse vector, minimum possible nonzero entries, that can be effectively used for classification which is useful in high-dimensional spaces. This is specified by

    $$\begin{aligned} PCF(w) = -\sum _{x: w^tx <0} w^tx + \lambda ^\prime w^tw, \end{aligned}$$
    (4.2)

    so that while minimizing the sum of violations, we reduce the nonzero entries in w as well. There is a scaling factor \(\lambda ^\prime \). Noting that the gradient of \(w^tw\) is 2w, we have the the corresponding incremental update rule, one pattern at a time, to be

    $$\begin{aligned} w_{k+1} = w_k + \eta x^k - \lambda w_k \rightarrow w_{k+1} = (1-\lambda ) w_k + \eta x^k \end{aligned}$$

    where \(\lambda = 2 \eta \lambda ^\prime \) and \(x^k\) is the first pattern misclassified by \(w_k\).

Note that both these variants are constraining or regularizing the optimization solution, w.

4.3.2 Support Vector Machine (SVM)

In SVMs, the criterion function that is considered is margin between the two classes. This may be detailed using Fig. 4.1. In SVM margin between the positive and negative classes is maximized. In the figure, there are negative class patterns in the left side. These are labeled by using −. Similarly, on the right side we have the positive class patterns. These are labeled by \(+\).

Fig. 4.1
figure 1

Margin between the two classes

In this two-dimensional case, there are two parallel lines (in higher dimensions they will be parallel hyperplanes) called support lines. The respective class boundary patterns are located on these support lines. The negative class patterns satisfy the property that \(w^tx + b \le 1\) where the boundary vectors, xs, or support vectors (SVs) of the negative class satisfy \(w^tx + b = -1\). Similarly, the positive class patterns satisfy \(w^tx + b \ge 1\) with the respective SVs satisfying the property \(w^tx + b = +1\).

The decision boundary between the two classes is characterized by points x such that \(w^tx +b =0\). Points to its right are from positive class and left side patterns are of negative class. If two points \(x_1\) and \(x_2\) are points on the decision boundary, then

$$\begin{aligned} w^tx_1+b = w^tx_2 +b =0 \Rightarrow w^t (x_1-x_2) = 0. \end{aligned}$$

This means vector w is orthogonal to \(x_1-x_2\) or the line on which they are located which is the decision boundary itself. So, w is orthogonal to the decision boundary  as shown in the figure.

Another property is that w points towards the positive side. Consider a problem where the origin is on the decision boundary. So, \(w^t0 + b = 0 \Rightarrow b=0\). Now if we consider a point \(x_1 \in c_+\) the positive class, then \(w^tx_1 > 0\). The cosine of the angle, \(\theta \), between w and \(x_1\) is given by

$$\begin{aligned} cos \theta = \frac{w^tx_1}{||w|| ||x_1||}. \end{aligned}$$

The denominator terms are positive here and the numerator is positive as \(x_1 \in c_+\). So, \(cos \theta > 0 \Rightarrow \) the angle between w and \(x_1\) is acute. So, w points towards the positive side.

Any point \(x \in c_+\) may be written as \(x = x_d + p \frac{w}{||w||}\) where \(x_d\) is point on the decision boundary at which the normal projection of x onto the decision boundary meets it. If the distance between x and \(x_d\) is p units, then the corresponding vector is \(p \frac{w}{||w||}\) because w is orthogonal or normal to the decision boundary. But as \(x \in c_+\),

$$\begin{aligned} w^tx + b = w^t(x_d + p \frac{w}{||w||}) + b = w^tx_d + b + p||w|| = p||w|| >0 \end{aligned}$$

as \(w^tx_d + b = 0\), where \(\frac{w}{||w||}\) is a unit vector in the direction of w. So,

$$\begin{aligned} w^tx + b = p||w|| \Rightarrow p = \frac{w^tx +b}{||w||}. \end{aligned}$$

So, normal distance between any point x on the positive support line and the decision boundary is \(\frac{w^tx + b}{||w||} = \frac{1}{||w||}\). Similarly, from any point x on the negative support line to the decision boundary the modulus of the distance is again \(\frac{1}{||w||}\). So,

$$\begin{aligned} margin = \frac{1}{||w||} + \frac{1}{||w||} = \frac{2}{||w||}. \end{aligned}$$

In SVM, we find w that maximizes the margin. Equivalently, we minimize \(\frac{1}{2}||w||^2\) which maximizes the margin. The constraints are \(y_i(w^tx_i + b) \ge 1\) where \(y_i\) is the class label of \(x_i\); \(y_i = 1 \; or\; -1\) based on whether \(x_i \in c_+\) or \(x_i \in c_-\) respectively.

We can express the corresponding Lagrangian by taking into account the constraints as

$$\begin{aligned} \mathcal {L}(w,b,\alpha ) = \frac{1}{2}||w||^2 + \sum _{i=1}^n \alpha _i(1-y_i(w^tx_i + b)), \end{aligned}$$

where we would like to find the vectors w, \(\alpha = \{\alpha _1, \ldots , \alpha _n\}\) and the scalar b. Optimal values of these variables can be obtained by equating the gradient to zero which is given by

$$\begin{aligned} \nabla _w {\mathcal L} = w - \sum _{i=1}^n \alpha _i y_i x_i = 0 \Rightarrow w = \sum _{i=1}^n \alpha _i y_i x_i. \end{aligned}$$
$$\begin{aligned} \nabla _b {\mathcal L} = \sum _{i=1} \alpha _i y_i = 0. \end{aligned}$$
$$\begin{aligned} \nabla _{\alpha _i} {\mathcal L}= 1-y_i(w^tx_i + b) = 0 \Rightarrow y_i(w^tx_i + b) = 1 \Rightarrow w^tx_i + b = y_i \end{aligned}$$

There are other conditions also including \(\alpha _i \ge 0\) and \(\alpha _i(1-y_i(w^tx_i + b)) = 0\).

By using these equations all the variables w, \(\alpha \), and b can be determined using different approaches. Here, the optimization problem was chosen such that it is a well-behaved problem guaranteeing a globally optimal solution to the minimization. However, we face a difficulty when there is no margin between the two classes. This can happen, for example in Fig. 4.1, if a point from the positive class falls to the left of the decision boundary or equivalently a point from the negative class falls to the right of the decision boundary. Such points are called violators. This can be the case in most of the real-world problems.

To overcome this problem, a popularly used solution is to formulate it as a soft margin problem . This is achieved by weighing each of the violators using a weight C based on the extent of violation. If we do not want to permit any violator then \(C \rightarrow \infty \). This amounts to the soft margin formulation to converge to the hard margin formulation. On the other extreme, a value of \(C=0\) means every point can be a violator. However, this will not solve the problem in practice.

Typically, a positive finite nonzero value is used for C to accommodate some violators. The corresponding problem is

$$\begin{aligned} \min _{w} \frac{1}{2} ||w||^2 + C \sum _{i=1}^n \xi _i \end{aligned}$$

\(s.t. \; y_i(w^tx_i +b) \ge 1 -\xi _i.\) and \(\xi _i \ge 0\). It is seen that there is no change in the form of the variables. Only change is that in the hard margin formulation, \(\alpha _i \ge 0\). In the soft margin case, \(0 \le \alpha _i \le C\).

An important practical consideration is the right value of C. This shifts the attention from getting the global optimal solution to getting the right value of C. So, tuning of hyperparameter C occupies the central stage in practice. Some professionally developed software packages have helped in realizing this practically.

4.3.3 Summary

In this chapter, we have seen the role of optimization in dimensionality reduction, clustering, and classification. We have considered only some of the algorithms. There are potentially a large variety of other machine learning platforms like neural networks. In a sense optimization based solutions exhibit diversity which is controlled using regularization to provide more central or less variance solutions.

Note the following about optimization. The set of constraints specify the feasible region. This may typically characterize potentially infinite solutions or diverse possible solutions. The criterion function being optimized will force the selection of one or more of these diverse points in the solution space increasing the centrality. A regularizer will shrink this collection of possible solutions further.

Consider, for example, a data set of the following four patterns drawn from 2 classes as shown in Table 4.2. Let patterns 1 and 2 be training points from class 1 and let the other 2, that is patterns 3 and 4 be from class 2.

Table 4.2 4-dimensional data from two classes

Let a classifier gave two w vectors given by

\(w^1\) = \((1,0,1,0,-2)\)

\(w^2\) = \((0.5,0.6,0.5,0.4,-2)\). Verify that both these weight vectors classify all the four patterns correctly. For example, using \(w^1\) on pattern 2 gives us

\(0.5+0+0.7+0 -2 <0\). Similarly \(w^2\) with pattern 4 gives us \(0.65+0.78+0.7+0.9 - 2>0\). Similarly one can verify other patterns.

Among these two weight vectors, if we require a sparse vector, then \(w^1\) will be selected and \(w^2\) will be left out.