Keywords

8.1 Introduction

Under multivariate analysis, two very important techniques are clustering and classification. Under the problem of clustering, we try to find out the unknown number of homogeneous inherent groups in a data set as well as the structure of the groups. But under classification, the basic problem is discrimination of objects into some known groups. One of the most basic abilities of living creatures involves the grouping of similar objects to produce a classification. Classification is fundamental to most branches of science.

Cluster analysis has a variety of objectives. It is focussed on segmenting a collection of items (also called observations, individuals, cases, or data rows) into subsets such that those within each cluster are more closely related to one another than objects assigned to different clusters. The main focus in cluster analysis is on the notion of degree of similarity (or dissimilarity) among the individual objects being clustered. The two major methods of clustering are hierarchical clustering and k-means clustering. Most of the clustering methods are exploratory in nature and do not need any model assumption.

Different statistical techniques are available for clustering and classification (Fraix-Burnet et al. 2015; De et al. 2013 and references there in). But depending on the nature of the different types of data, several problems often arise and in some cases a proper solution is still not available.

Sometimes the data set under consideration has a distributional form (usually normal), and sometimes it is of non-normal nature. Based on the above point, there is a justification needed about which clustering or classification technique should be used so that it reflects the proper nature of the data set provided. This problem is more relevant for classification as most of the classification methods are model based. For clustering, most of the methods are nonparametric in nature and as such the above problem is not very serious. But here also basic assumption is that the nature of the variables under study is continuous, whereas under practical situations, these may be categorical like binary, nominal, ordinal, and even directional (particularly for environmental and astronomical data). Under such situations, standard similarity/dissimilarity measures will not work.

The clustering techniques which require an inherent model assumption are known as model-based methods, whereas the clustering technique where no modeling assumption or distributional form is needed may be termed as non-model-based methods. Hence based on the nature of data set, one has to decide about proper application of the two types of techniques.

At present, big data issues related to data size are quite common. In statistical terms, these problems may be tackled in terms of both the number of observations and the variables considered. Many standard clustering techniques fail to deal with such big data sets. Thus, some dimension reduction methods may be applied at first and then clustering may be performed on the reduced data set. Some data mining techniques are very helpful under such situations.

Finally and most importantly, after all these considerations, the similarity of grouping of objects obtained from different methods should be checked in terms of some physical properties.

8.2 Hierarchical Clustering Technique

There are two major methods of clustering, viz. hierarchical clustering and k-means clustering. In hierarchical clustering, the items are not partitioned into clusters in a single step. Instead, a series of partitions takes place, which runs from a single cluster containing all objects to n clusters each containing a single object. Hierarchical clustering is subdivided into agglomerative methods, which proceed by series of combinations of the n objects into groups, and divisive methods, which separate n objects successively into smaller groups. Agglomerative techniques are more commonly used. Hierarchical clustering may be represented by a two-dimensional diagram known as dendrogram which illustrates the additions or divisions made at each successive stage of analysis.

8.2.1 Agglomerative Methods

An agglomerative hierarchical clustering procedure produces a series of partitions of the data, \(G_n; G_{n - 1};\) :::::::; \(G_1.\) The first \(G_n\) consists of n single-object ‘clusters,’ and the last \(G_1\) consists of single group containing all n cases. The structure of the groups is not unique and depends on several factors like choice of the dissimilarity/similarity measure, choice of the linkage measure.

At each particular stage, the method adds together the two clusters which are most similar. At the first stage, we join together two objects that are closest together, since at the initial stage each cluster has only one object. Differences between methods arise because of the different ways of defining dissimilarity or similarity between clusters.

Hierarchical clustering is largely dependent on the selection of such a measure. A simple measure is Manhattan distance, equal to the sum of absolute distances for each variable. The name comes from the fact that in a two-variable case, the variables can be plotted on a grid that can be compared to city streets, and the distance between two points is the number of blocks a person would walk.

The most popular measure is Euclidean distance, computed by finding the square of the distance between each variable, summing the squares, and finding the square root of that sum. In the two-variable case, the distance is analogous to finding the length of the hypotenuse in a triangle. Besides Manhattan and Euclidian distances, there are other dissimilarity measures also based on the correlation coefficients between two observations on the basis of several variables.

Alternatively, one may use a similarity measure which is complementary in nature and under the normalized set up, it may be obtained by subtracting the dissimilarity measure from one.

8.2.2 Similarity for Any Type of Data

The above-mentioned dissimilarity/similarity measures are applicable to continuous-type data only. But generally, we work with mixed-type data sets those include different types like continuous, discrete, binary, nominal, ordinal. Gower (1971) has proposed a general measure as follows:

The Gower’s Coefficient of Similarity:

Two individuals i and j may be compared on a character k and assigned a score sijk. There are many ways of calculating sijk, some of which are described below.

Corresponding to n individuals and p variables, Gower’s similarity index \(S_{ij}\) is defined as

$$ S_{ij} = \Sigma ^p_{k = 1} s_{ijk}/\Sigma _{k = 1}^p \delta _{ijk} (i, j = 1, 2, \ldots n) $$
$$\begin{aligned} \text {where}~\delta _{ijk}&=1~\text {when character}~k~\text {can be compared} \nonumber \\&{\qquad }\text {for observations}~i~\text {and}~j \nonumber \\&=0~\text {otherwise} \end{aligned}$$

For continuous (quantitative) variables with values \(x_{1k}, x_{2k}, \ldots , x_{n k}\) for the kth variable

$$ s_{ijk} = 1 - \mid x_{i k} - x_{j k} \mid / R_k $$

where \(R_k\) is the range of the variable k and may be the total range in population or the range in the sample.

For a categorical (qualitative) character with m categories (\(m = 2\) for binary variable)

$$\begin{aligned} s_{i j k}&=0~\text {if}~i~\text {and}~j~\text {are totally different} \nonumber \\&=q~\text {(positive fraction) if there is some degree of agreement} \nonumber \\&=1~\text {when}~i~\text {and}~j~\text {are same} \end{aligned}$$

8.2.3 Linkage Measures

To calculate distance between two clusters, it is required to define two representative points from the two clusters (Chattopadhyay and Chattopadhyay 2014). Different methods have been proposed for this purpose. Some of them are listed below.Footnote 1

Single linkage: One of the simplest methods is single linkage, also known as the nearest neighbor technique. The defining feature of the method is that distance between clusters is defined as the distance between the closest pair of objects, where only pairs consisting of one object from each cluster are considered.

In the single linkage method, \(d_{rs}\) is computed as \(d_{rs} =\) Min \(d_{ij}\), where object i is in cluster r and object j is in cluster s and \(d_{ij}\) is the distance between the objects I and j. Here the distance between every possible object pair (ij) is computed, where object i is in cluster r and object j is in cluster s. The minimum value of these distances is said to be the distance between clusters r and s. In other words, the distance between two clusters is given by the value of the shortest link between the clusters. At each stage of hierarchical clustering, the clusters r and s, for which \(d_{rs}\) is minimum, are merged.

Complete linkage: The complete linkage, also called farthest neighbor, clustering method is the opposite of single linkage. Distance between clusters is now defined as the distance between the most distant pair of objects, one from each cluster. In the complete linkage method, \(d-{rs}\) is computed as \(d_{rs} =\) Max \(d_{ij}\), where object i is in cluster r and object j is cluster s. Here the distance between every possible object pair (ij) is computed, where object i is in cluster r and object j is in cluster s and the maximum value of these distances is said to be the distance between clusters r and s. In other words, the distance between two clusters is given by the value of the largest distance between the clusters. At each stage of hierarchical clustering, the clusters r and s, for which \(d_{rs}\) is minimum, are merged.

Average linkage: Here the distance between two clusters is defined as the average of distances between all pairs of observations, where each pair is composed of one object from each group. In the average linkage method, \(d_{rs}\) is computed as \(d_{rs} = Trs / (Nr \times Ns)\) where Trs is the sum of all pair-wise distances between cluster r and cluster s. Nr and Ns are the sizes of the clusters r and s, respectively. At each stage of hierarchical clustering, the clusters r and s, for which \(d_{rs}\) is the minimum, are merged.

Minimax Linkage: This was introduced by Bien and Tibshirani (2011). For any point x and cluster G, define

$$ d_{\max } (x, G) = \text{ max }_{y \in G} \, d(x, y) $$

as the distance to the farthest point in G from x. Define the minimax radius of the cluster G as

$$ r(G) = \text{ min }_{x \in G} \, d_{\max } \, (x, G) $$

that is, find the point \(x \in G\) from which all points in G are as close as possible. This minimizing point is called the prototype for G. It may be noted that a closed ball of radius r(G) centered at the prototype covers all of G. Finally, we define the minimax linkage between two clusters G and H as

$$ d(G, H) = r(GUH) $$

that is, we measure the distance between clusters G and H by the minimax radius of the resulting merged cluster.

8.2.4 Optimum Number of Clusters

Usually, the number of clusters is determined from the dendrogram and validated by the physical properties. We specify a horizontal line for a particular similarity/dissimilarity value, and the clusters below this line are selected as optimum. But some mathematical rules (thumb rules) are also available which are based on between cluster and within cluster sum of squares values. If we denote by k, the number of clusters and define by W(k) the sum of the within cluster sum of squares for k clusters then the values of W(k) will gradually decrease with increase in k and that ‘k’ may be taken as optimum where W(k) stabilizes. For detailed discussion, one may follow the link http://www.cc.gatech.edu/~hpark/papers/cluster_JOGO.pdf.

8.2.5 Clustering of Variables

The hierarchical clustering method can also be used for clustering of variables on the basis of the observations. Here instead of the distance matrix, one may start with the correlation matrix (higher correlation indicating similarity of variables). The linkage measures as listed in the previous section will not be applicable for variable clustering. In order to measure similarity/dissimilarity between two clusters of variables, one may either use the correlation between first principal components corresponding to the two clusters or the canonical correlations.

8.3 Partitioning Clustering-k-Means Method

The k-means algorithm (MacQueen 1967) assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster. This method can be used for clustering of objects and not variables.

This method starts with a value of k. We will discuss later the method of selection of the value of k. Then we randomly generate k clusters and determine the cluster centers, or directly generate k seed points as cluster centers. Assign each point to the nearest cluster center in terms of Euclidian distance. Re-compute the new cluster centers. Repeat until some convergence criterion is met, i.e., there is no reassignment. The main advantages of this algorithm are its simplicity and speed which allows it to run on large data sets. Its disadvantage is that it is highly dependent on the initial choice of clusters. It does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. It maximizes inter-cluster variance and minimizes intra-cluster variance.

The advantages of partitioning method are as follows (Chattopadhyay and Chattopadhyay 2014):

  1. (a)

    A partitioning method tries to select best clustering with k groups which is not the goal of hierarchical method.

  2. (b)

    A hierarchical method can never repair what was done in previous steps.

  3. (c)

    Partitioning methods are designed to group items rather than variables into a collection of k clusters.

  4. (d)

    Since a matrix of distances (similarities) does not have to be determined and the basic data do not have to be stored during the computer run, partitioning methods can be applied to much larger data sets.

For k-means algorithms, the optimum value of k can be obtained in different ways.

On the basis of the method proposed by Sugar and James (2003), by using k-means algorithm first determine the structures of clusters for varying number of clusters taking \(k = 2, 3, 4\), etc. For each such cluster formation, compute the values of a distance measure

$$ dK = (1/p) \, \text {min}_x \, E[(x_k - c_k)^\prime (x_k - c_k)] $$

which is defined as the distance of the \(x_k\) vector (values of the parameters) from the center \(c_k\) (which is estimated as mean value), p is the order of the \(x_k\) vector. Then the algorithm for determining the optimum number of clusters is as follows. Let us denote by \(d_k^\prime \) the estimate of \(d_k\) at the kth point which is actually the sum of within cluster sum of squares over all k clusters. Then \(d_k^\prime \) is the minimum achievable distortion associated with fitting k centers to the data. A natural way of choosing the number of clusters is plot \(d_k^\prime \) versus k and look for the resulting distortion curve. This curve is always monotonic decreasing. Initially, one would expect much smaller drops, i.e., a leveling off for k greater than the true number of clusters because past this point adding more centers simply partitions within groups rather than between groups.

According to Sugar and James (2003) for a large number of items the distortion curve when transformed to an appropriate negative power, will exhibit a sharp “jump” (if we plot k versus transformed \(d_k^\prime \)). Then calculate the jumps in the transformed distortion as

$$ J_k = (d_k^{\prime - (p/2)} - d_{k - 1}^{\prime - (p/2)}) $$

Another way of choosing the number of clusters is plot \(J_k\) versus k and look for the resulting jump curve. The optimum number of clusters is the value of k at which the distortion curve levels off as well as its value associated with the largest jump.

The k-means clustering technique depends on the choice of initial cluster centers (Chattopadhyay et al. 2012). But this effect can be minimized if one chooses the cluster centers through group average method (Milligan 1980). As a result, the formation of the final groups will not depend heavily on the initial choice and hence will remain almost the same according to physical properties irrespective of initial centers. In MINITAB package, the k-means method is almost free from the effect of initial choice of centers as they have used the group average method.

8.4 Classification and Discrimination

DiscriminantFootnote 2 analysis and classification are multivariate techniques concerned with separating distinct sets of objects and with allocating new objects to previously defined groups. Once the optimum clustering is obtained by applying the method discussed under previous section, one can verify the acceptability of the classification by computing classification/misclassification probabilities for the different objects. Although the k-means clustering method is purely a data analytic method, for classification it may be necessary to assume that the underlying distribution is multivariate normal. The method can be illustrated as follows for two populations (clusters). The method can be easily generalized for more than two underlying populations.

Let \(f_1(x)\) and \(f_2(x)\) be the probability density functions associated with the \(p \times 1\) random vector X for the populations \(\pi _1\) and \(\pi _2\) respectively. Let \(\Omega \) be the sample space, i.e., collection of all objects. Let us denote by x the observed value of X. Let R1 be that set of x values for which we classify objects as \(\pi _1\) and \(R_2 = \Omega R_1\) be the remaining x values for which we classify objects as \(\pi _2.\) Since every object must be assigned to one and only one of the two groups, the sets \(R_1\) and \(R_2\) are disjoint and exhaustive. The conditional probability of classifying an object as \(\pi _2\) when in fact it is from \(\pi _1\) (error probability) is,

$$ P(2 \mid 1) = P[X \in R_2 \mid \pi _1] = f_{R2} f_1 (x) dx $$

Similarly, the other error probability can be defined. Let \(p_1\) and \(p_2\) be the prior probabilities of \(\pi _1\) and \(\pi _1\), respectively, \((p_1 + p_2 = 1).\) Then the overall probabilities of correctly and incorrectly classifying objects can be derived as

P (correctly classified as \(\pi _1\)) = P (Observation actually comes from \(\pi _1\) and is correctly classified as \(\pi _1) = P[X \in R_1 \mid \pi _2] p_2\).

P (misclassified as \(\pi _1) = P[X \in R_1 \mid \pi _2] p_2\).

The associated cost of misclassification can be defined by a cost matrix

 

Classified as

True population

\(\pi _1\)

\(\pi _2\)

\(\pi _1\)

0

\(C(2 \mid 1)\)

\(\pi _2\)

\(C(1 \mid 2)\)

0

For any rule, the average or Expected Cost of Misclassification (ECM) is given by

$$ ECM = C(2 \mid 1) P(2 \mid 1) p_1 + C(1 \mid 2 P(1 \mid 2)p_2 $$

A reasonable classification rule should have ECM as small as possible.

Rule: The regions \(R_1\) and \(R_2\) that minimize the ECM are defined by the value of x for which the following inequalities hold.

$$ R_1 : \frac{f_1(x)}{f_2(x)} > \frac{C(1 \mid 2)p_2}{C(2 \mid 1)p_1} $$
$$ R_2 : \frac{f_1(x)}{f_2(x)} < \frac{C(1 \mid 2)p_2}{C(2 \mid 1)p_1} $$

If we assume \(f_1(x)\) and \(f_2(x)\) are multivariate normal with mean vectors \(\mu _1\) and \(\mu _2\) and covariance matrices \(\Sigma _1\) and \(\Sigma _2\), respectively, then a particular object with observation vector \(x_0\) may be classified according to the following rule (under the assumption \(\Sigma _1 = \Sigma _2\))

Allocate \(x_0\) to \(\pi _1\) if

$$ (\mu _1 - \mu _2)^\prime \Sigma ^{- 1} x_0 - \frac{1}{2} (\mu _1 - \mu _2)^\prime \Sigma ^{- 1} (\mu _1 + \mu _2) \ge \frac{C(1 \mid 2)p_2}{C(2 \mid 1)p_1} $$

allocate \(x_0\) to \(\pi _2\) otherwise.

If we choose \(C(1 \mid 2) = C(2 \mid 1)\) and \(p_1 = p_2\), then the estimated minimum ECM rule for two Normal populations will be as follows:

Allocate \(x_0\) to \(\pi _1\) if

$$ (m_1 - m_2)^\prime S_{\text{ pooled }} - 1 x_0 - \frac{1}{2} (m_1- m_2)^\prime \Sigma ^{- 1} (m_1 + m_2) \ge 0 $$

where \(m_1\) and \(m_2\) are sample mean vectors of the two populations and \(S_{\text {pooled}}\) is pooled (combined) sample covariance matrix. Allocate \(x_0\) to \(\pi _2\) otherwise. The LHS is known as the linear discriminant function. One can easily generalize the method for more than two groups.

8.5 Data

Example 8.5.1

The Fisher’s Iris data set is a multivariate data set introduced by Fisher (1936). It is also known as Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris setosa (type-3), Iris versicolor (type-2), and Iris virginica (type-1)). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters (Table 8.1).

Table 8.1 Results of k-means clustering for Iris data

We have performed k-means clustering of the data on the basis of the first four variables, viz. sepal length, sepal width, petal length, and petal width. Choosing \(k = 3\), we have divided the 150 observations into three groups in order to verify whether we can identify three groups corresponding to three species. From columns 6 and 7, it is clear that k-means method has correctly identified Iris setosa (type-3) species for all the 50 cases, whereas there are some errors corresponding to types 1 and 2. For type 2, three cases and for type 1 fourteen cases had wrongly identified. The summary result for k-means clustering is given below:

Number of clusters: 3

 

Number of observations

Within cluster sum of squares

Average distance from centroid

Maximum distance from centroid

Cluster1

39

25.414

0.732

1.552

Cluster2

61

38.291

0.731

1.647

Cluster3

50

15.151

0.482

1.248

We have also performed linear discriminant analysis by considering types as the true groups.

Linear Method for Response: Type

Predictors: Sepal le Sepal wi Petal le Petal wi

Summary of Classification

Put into

....True Group....

Group

1

2

3

1

49

2

0

2

1

48

0

3

0

0

50

Total N

50

50

50

Summary of Classification with Cross-validation

Put into

....True Group....

Group

1

2

3

1

49

2

0

2

1

48

0

3

0

0

50

Total N

50

50

50

N Correct

49

48

50

Proportion

0.980

0.960

1.000

N \(=\) 150 N Correct \(=\) 147 Proportion Correct \(=\) 0.980

Squared Distance Between Groups

 

1

2

3

1

0.000

17.201

179.385

2

17.201

0.000

89.864

3

179.385

89.864

0.000

Linear Discriminant Function for Group

 

1

2

3

Constant

−103.27

−71.75

−85.21

Sepal le

12.45

15.70

23.54

Sepal wi

3.69

7.07

23.59

Petal le

12.77

5.21

−16.43

Petal wi

21.08

6.43

−17.40

Variable Pooled Means for Group

 

Mean

1

2

3

Sepal le

5.8433

6.5880

5.9360

5.0060

Sepal wi

3.0573

2.9740

2.7700

3.4280

Petal le

3.7580

5.5520

4.2600

.4620

Petal wi

1.1993

2.0260

1.3260

0.2460

Variable Pooled StDev for Group

 

StDev

1

2

3

Sepal le

0.5148

0.6359

0.5162

0.3525

Sepal wi

0.3397

0.3225

0.3138

0.3791

Petal le

0.4303

0.5519

0.4699

0.1737

Petal wi

0.2047

0.2747

0.1978

0.1054

Pooled Covariance Matrix

Sepal le Sepal wi Petal le Petal wi

Sepal le 0.26501

Sepal wi 0.09272 0.11539

Petal le 0.16751 0.05524 0.18519

Petal wi 0.03840 0.03271 0.04267 0.04188

Here we see that only three observations are wrongly classified. The corresponding probabilities are given by

Observation

True Group

Pred Group

Group

Probability Predicted

71 **

2

1

1

0.75

   

2

0.25

   

3

0.00

84 **

2

1

1

0.86

   

2

0.14

   

3

0.00

134 **

1

2

1

0.27

   

2

0.73

   

3

0.00

Example 8.5.2

The following data are related to a survey on environmental pollution level. The following variables were observed in suitable units at 111 selected places. The four variables under study were Ozone content, Radiation, Temperature, and Wind speed in some proper units. We have performed hierarchical clustering with Euclidian distance and single linkage. The data set as well as the cluster membership is shown in the following table.

The summary of results and the dendrogram are given below the table. By considering similarity level at 93, six clusters were found of which three (4, 5, and 6) may omitted as outliers containing 2, 1, and 1 observations. Hence clusters 1, 2, and 3 are the main clusters. Figures corresponding to radiation, temperature, wind speed, ozone content and H-cluster number of 111 places.

Table 8.2 Results of hierarchical clustering for pollution data
Fig. 8.1
figure 1

Dendrogram of pollution data

Number of main clusters: 3

 

Number of observations

Within cluster sum of squares

Average distance from centroid

Maximum distance from centroid

Cluster1

71

202337.219

48.851

101.003

Cluster2

12

5151.429

18.929

35.732

Cluster3

24

26269.208

30.505

58.654

Cluster Centroids

Variable

Cluster1

Cluster2

Cluster3

Grand centroid

Radiatio

240.7606

123.9167

46.6250

184.8018

Temperat

80.1831

73.5833

71.9167

77.7928

Wind spe

9.6577

10.2583

11.5292

9.9387

Ozone Co

49.2535

22.1667

17.7500

42.0991

The dendrogram of the pollution data is shown below. The centroids of the first three clusters are widely separated corresponding to all the variables; the 24 places falling in cluster 3 may be considered to be least polluted, whereas the 71 places falling in cluster 1 are most polluted (Fig. 8.1).