12.1 The Problem

So far, we have covered supervised learning algorithms for which the classes are predefined and the class of each instance of the training dataset is known in advance. The problem was to find a model that correctly classifies instances into their appropriate classes with a minimal cost (i.e., minimal error rate).

The problem in unsupervised learning is different; we have a dataset, but neither the classes nor the way to classify each instance in the training dataset is known in advance, i.e., the samples in the dataset are not labeled. For example, one might have data about residents of a city and want to cluster them geographically based on their assumed support for a candidate for election, or a medical image and want to cluster the pixel to perform image segmentation into similar regions, or a dataset about houses’ characteristics (e.g., number of bedrooms, number of bathrooms, area, price, postal code) and want to cluster them by their value. Hospitals would be interested in uncovering clusters of patients who are high users of their services. Businesses might want to perform customer segmentation, i.e., to cluster customers based on some criteria, such as their purchases and their website activities; then, the businesses can recommend products for customers in the same clusters [1].

Our aim is to build a model that uncovers the clusters of data latent in the dataset based on feature similarities and classifies the instances into these clusters with a minimal error rate. How to label these clusters is the job of the data analyst.

12.2 A Practical Example

Let us consider the iris database and discount our knowledge of the class attribute. The instances are now not labeled, and we are in front of a clustering problem. We would like to use K-means to cluster the data into n clusters/categories. In this problem, we know that there are three clusters (Iris setosa, Iris versicolor, and Iris virginica); however, usually, n is suggested by a domain expert who understands the reality the data describes.

The dataset contains the length and width of the sepal and petal of the three types of irises. The dataset is labeled (i.e., class attribute); we can either delete the label or, if we are using Weka, we can ask the K-means algorithm to ignore the label. Figure 12.1 shows how to delete the label in Weka, while Fig. 12.6 shows how to ignore it when executing K-means in Weka; as usual, in the lab, you will be using Python and R.

Fig. 12.1
A screenshot of the Weka window. The screenshot has sections for current relation and attributes. Arrows point to the class option under the attributes section, and a remove button below.

Remove the class attribute

We can start by exploring the dataset to have an understanding of the data trends and the problem. Figure 12.2 shows the histograms for each attribute by class. The dataset contains 50 instances for each type of iris. The petallength and petalwidth histograms indicate that the petal length and width of one of the iris types (Iris setosa) are clearly distinct from the other two types, and hence these two attributes will help us identify Iris setosa. Iris versicolor and Iris virginica have common petal length and width, and the sepal length and width do not provide enough information to distinguish a separate class, as all three types of flowers have common values for these two attributes. We can already conclude that we will encounter errors in clustering, especially between Iris versicolor and Iris virginica.

Fig. 12.2
Five histograms depict the sepal length, sepal width, petal length, petal length, petal width, and class.

Histograms for the different dataset features

When we explore the graph plots for the dataset features (Fig. 12.3), the scatterplot shows the feature on the X-axis (i.e., in the columns) against all other features on the Y-axis (i.e., in the rows). Again, we notice that the Iris setosa is separated from the other two species in each of the scatterplots. Iris versicolor and Iris virginica have many of their instances intermixed and are hard to distinguish in all the scatterplots (e.g., sepal width vs. sepal length).

Fig. 12.3
A screenshot of the Weka explorer. The visualize tab is selected on the top panel. Scatter plots explain the class, petal width, petal length, sepal width, and sepal length versus vice-versa.

Scatterplots for the dataset features

Given these observations, let us apply the K-means algorithm to the dataset and explore the result (Fig. 12.4). Click on the algorithm’s name to open the list of its parameters; since we know that we are seeking to detect three clusters, we will change the numClusters (i.e., K) parameter to 3 (Fig. 12.5); the default distance used is Euclidean, which fits our problem (i.e., length and width). Finally, we choose to ignore the Class parameter, as the label is not part of the features that we would use in a clustering problem (Fig. 12.6), and we run the algorithm.

Fig. 12.4
A screenshot of the dialog box with the cluster tab selected. Under the clusters and Weka folders section, simpleKmeans is selected. 2 arrows point to cluster and simpleKmeans.

Choose the SimpleKMeans algorithm from the Cluster tab in Weka

Fig. 12.5
A screenshot of Wekas simpleKmeans dialog box. It has an about section, a text reads cluster data using the k means algorithm. Arrows point to the distance function and num clusters.

Set the numClusters (i.e., K) parameters of the K-means algorithm to 3

Fig. 12.6
A screenshot of the Weka Explorer window. The cluster tab is selected at the top. It has a section clusterer, cluster mode, and result list. Arrows point to use training set and ignore attributes.

Choose to ignore the class label before executing K-means

The resulting window (Fig. 12.7) displays important information. We can notice that the algorithm converged after three iterations. The clusters are enumerated from 0 to K – 1, and Weka will display the centroid positions for each iteration (e.g., Cluster 0: 6.1, 2.9, 4.7, 1.4); the sum of the square error that we are trying to minimize is 6.99, and the number of instances that were assigned to the three clusters is 61, 50, and 39; since we know the labels, in this particular example, we know that there were instances assigned to the wrong cluster, but this is already expected given the instances’ intermixing that we noticed during the data visualization phase.

Fig. 12.7
A screenshot of the clusterer output window. Arrows point to texts within-cluster sum of squared errors, the starting point of clusters 0, 1 and 2, final cluster attributes, and clustered instance.

K-means clustering result

Since the dataset contains the labels, we can ask K-means to consider the labels instead of ignoring them to let us know how many instances were assigned to the wrong cluster. In Weka, we can do so by choosing the “classes to clusters evaluation” and selecting the class from the list of features (Fig. 12.8). Clicking on Start this time shows an extra output (Fig. 12.9), the incorrectly clustered instances. We notice that the K-means algorithm incorrectly clustered 17 instances: 14 Iris virginica were clustered with the Iris versicolor, and three Iris versicolor were clustered with the Iris virginica; the 50 instances of Iris setosa were all correctly clustered. As expected, the errors were related to the distinction between the Iris versicolor and Iris virginica species.

Fig. 12.8
A screenshot clusterer dialog box. Arrows point to a drop-down option nom class under classes to clusters evaluation, and a start button under the cluster mode section.

Analyzing the incorrectly clustered instances

Fig. 12.9
A screenshot of the clusterer output window with the starting point of clusters and final cluster centroids. Arrows point to the texts that read Iris setosa, and incorrectly clustered instances.

Seventeen instances were incorrectly clustered

We can always visualize the cluster assignments by right-clicking on the name of the algorithm in the Result window and clicking on “Visualize Cluster Assignments” (Fig. 12.10). A new window opens and displays a scatterplot for data on the X- and Y-axes. To display a scatterplot showing the classes on the Y-axis, click on the Y dropdown list and choose the Cluster attribute. The scatterplot shows us how the instances were assigned, and we can see clearly that cluster 1 is distinct, 13 instances are assigned to cluster 0 while they clearly belong to cluster 2, and three instances are assigned to class 0 while they seem to belong to class 2 (Fig. 12.11).

Fig. 12.10
A dialog box for the result list with the menu options to view in the main window, load model, etcetera. generic object editor window. An arrow points to visualize cluster assignments.

Visualize cluster assignments

Fig. 12.11
A screenshot explains the scatter plot of the iris clustered. An arrow points to the y.cluster at the top. The scatter point indicated 2 points.

Cluster visualization

We can check the cluster to which each instance was assigned by using a filter called AddCluster (Fig. 12.12).

Fig. 12.12
A weka explorer window, with preprocess indicated by an arrow. Under the Filter section, add cluster is selected and indicated by an arrow.

AddCluster filter

We can click on the filter to choose the clustering algorithm and set its parameters. In our case, we would like to set K to 3 (Fig. 12.13).

Fig. 12.13
A window depicts the simpleKmeans and an arrow on the numCuster field is on the right side.

The K-means parameters in the AddCluster filter

When we apply the filter, a new attribute called “cluster” is created; it indicates the cluster associated with each instance. To display the data, it is enough to click on the Edit button (Fig. 12.14).

Fig. 12.14
A screenshot of the Weka Explorer window. Preprocess tab is selected on the top panel. Arrows point to the edit button on the top, apply button under the filter section, and weight 50.0 of row 2 in a table.

Display dataset values

12.3 The Algorithm

The K-means algorithm involves the following steps:

  • Specify the number of clusters K.

  • Initialize the centroids by randomly selecting K samples from the training dataset.

  • Assign samples to the clusters based on the closest centroid. The closeness is determined using a distance (e.g., Euclidean, Manhattan).

  • Update the centroid for each cluster by recalculating the center point for each cluster. The latter is the mean of the cluster’s samples, hence the name K-means.

  • Repeat assigning samples and updating centroids until model convergence, i.e., there is little change in the centroids, or a certain number of iterations is completed.

After convergence, the model is represented by the centroids. Each new sample is assigned to the closest centroid; that is, each new sample is assigned to one of the clusters 1 to K [2].

We can represent the dataset with a matrix of N instances with n features, the instances being the rows and the features, the columns of the matrix.

$$ X=\left[\begin{array}{ccc}{x}_1^{(1)}& \cdots & {x}_n^{(1)}\\ {}\vdots & \ddots & \vdots \\ {}{x}_1^{(N)}& \cdots & {x}_n^{(N)}\end{array}\right] $$

Each vector\( {x}^{(i)}={\left[{x}_1^{(i)}\ {x}_2^{(i)}\dots {x}_n^{(i)}\right]}^T \)and X = [x(1) x(2)x(N)]T.

The K-means algorithm computes the distance between each vector x(i) and the centroids of the cluster μk k = 1, …, K. If Nk is the number of instances in the cluster k, then the centroid μk is calculated as follows:

$$ {\mu}_k=\frac{1}{N_k}\sum \limits_{i=1}^{N_k}{x}^{(i)} $$

Using the Euclidean distance (note that other distances can be used) to measure similarities between instances [3], the distance between an instance μk and a centroid μk can be computed as follows:

$$ {d}_{ik}=\sqrt{\sum \limits_{i=1}^n\ {\left({x}_j^{(i)}-{\mu_k}_j\right)}^2\ } $$

If we use the matrix notation, we can rewrite the same as follows:

$$ {x}^{(i)}={\left({\left[{x}^{(i)}-{\mu}_k\right]}^T\left[{x}^{(i)}-{\mu}_k\right]\right)}^{1/2}=\left\Vert {x}^{(i)}-{\mu}_k\right\Vert $$

Each x(i) is considered as belonging to a cluster K if it is closest to it and hence satisfies the following condition:

$$ \left\Vert {x}^{(i)}-{\mu}_k\right\Vert <\left\Vert {x}^{(i)}-{\mu}_j\right\Vert; j=1,\dots, K,j\ne k $$

The algorithm stops when the change in the number of instances belonging to the clusters is minimal (less than a certain constant) or the clusters’ center’s location is minimal; the stopping criteria is

either

$$ \sum \limits_{k=1}^K\left|{N}_k^{t+1}-{N}_k^t\right|<\varepsilon $$

or

$$ \sum \limits_{k=1}^K\left|{\mu}_k^{t+1}-{\mu}_k^t\right|<\varepsilon $$

The algorithm can be summarized as follows:

  1. 1.

    Initialize k, and \( {\mu}_1^{\left({t}_0\right)}\ \mathrm{to}\kern0.50em {\mu}_k^{\left({t}_0\right)}, \) and set the time t = 1.

  2. 2.

    Classify the N instances according to the nearest \( {\mu}_k^{\left(t-1\right)} \):

    $$ \left\Vert {x}^{(i)}-{\mu}_k^{\left(t-1\right)}\right\Vert <\left\Vert {x}^{(i)}-{\mu}_k^{\left(t-1\right)}\right\Vert; j=1,\dots, K,j\ne k $$
  3. 3.

    For each group of instances \( {N}_k^{\left(t-1\right)} \) in a cluster K, k = 1, … K, recompute:

    $$ {\mu}_k=\frac{1}{N_k^{\left(t-1\right)}}\sum \limits_{i=1}^{N_k^{\left(t-1\right)}}{x}^{(i)} $$
  4. 4.

    Stop when the stopping criteria are satisfied; otherwise, increment t: t = t + 1 and repeat from step 2.

  5. 5.

    Return the result \( {\mu}_1^{(t)}\ \mathrm{to}\ {\mu}_k^{(t)} \).

12.4 Inertia

Once the learning is one (using .fit() in Python), the algorithm can always assign a new instance to the closest centroid (using .predict()), which is called a hard clustering.

Alternatively to hard clustering, we can perform soft clustering that assigns scores to the new instance, each score is equivalent to the distance of the new instance to one of the centroids (.transform() in Python). In this way, each data instance can be represented by its position to the K clusters, if K is lower than the dataset dimension, the result of clustering would be a reduction in the data dimensions; so, K-means could be used for dimensionality reduction.

The K-means is sensitive to the initial centroids and can provide you with different solutions if you start with different initial centroids. So how to choose the best of many solutions? What is the cost function to minimize? Since this is an unsupervised learning, we do not have labels to measure the performance of the algorithm, what we have seen in the Weka example above was only for learning purposes. However, there is one performance measurement for K-means called inertia, that consist of the within-cluster sum-of-squares that measures the sum of squared distances of each instance to its centroid. Inertia measures the internal coherence of the clusters; a low inertia means a better clustering solution. We can run K-Means multiple times (in Python this is controlled by K-Means n_init hyperparameter), and K-means will use inertia to find an optimal clustering solution.

12.5 Minibatch K-Means

The original K-means algorithm was proposed by Stuart Lloyd in 1957 [4]. To accelerate the execution of K-Means, a faster new algorithm was proposed in 2003 by Charles Elkan [5]; both version can be set by the K-Means algorithm hyperparameter in Python, the default being Lloyd’s. An even faster version of K-Means was proposed in 2010 by David Sculley [6] that uses min-batches of the dataset instead of the full dataset at each iteration, it is implemented under the name MiniBatchKMeans in Python.

12.6 Final Notes: Advantages, Disadvantages, and Best Practices

K-means models are widely used because of their simplicity. They can easily scale and generalize to different data size/shapes and easily adapts to new scenarios. As convergence is key for accuracy and optimal decisions, K-means model has shown good convergence results. Choosing the number of centroid (K) is a challenge, and we will see a method to find the optimal K in Sect. 12.10 below.

Like any ML model, K-means has its own disadvantages, including the manual choice of K value, which makes it dependent on initial values. One of key issues of K-means is the impact of outliers on clusters generation; if the data is not well preprocessed, outliers can have their own clusters. Lastly, highly dimensional data can generate curse of dimensionality problem, thus affecting the convergence of the model. Using feature engineering techniques can mitigate this issue.

When working with K-means, consider trying many values for K, as well as many clustering algorithms (e.g., K-Means, DBSCAN, and test which setup provides better results. You may also want to take special care to avoid overfitting, mainly you need to increase the value of K to eliminate noises in clusters.

12.7 Key Terms

  1. 1.

    K -means

  2. 2.

    Clustering

  3. 3.

    Cluster

  4. 4.

    Centroids

  5. 5.

    Matrix

  6. 6.

    Segmentation

  7. 7.

    Image segmentation

  8. 8.

    Customer segmentation

  9. 9.

    Hard clustering

  10. 10.

    Soft clustering

12.8 Test Your Understanding

  1. 1.

    Give examples of situations from different fields where we can use clustering.

  2. 2.

    How do you decide which cluster a data instance belongs to?

  3. 3.

    How do you decide the centroid of each cluster?

  4. 4.

    How do you determine the centroids of the first clusters?

  5. 5.

    How do you determine the number of clusters K?

  6. 6.

    What is the stopping criterion for K-means?

  7. 7.

    There is a method called “elbow” that allows us to compute the optimum number of clusters in a dataset. Explore this method; you will be using it in the lab.

  8. 8.

    Cite some of the hyperparameters of K-means.

  9. 9.

    What is inertia in K-Means? What is it used for?

12.9 Read More

  1. 1.

    Aris, T. A., Nasir, A. S. A., & Mohamed, Z. (2021). A Robust Segmentation of Malaria Parasites Detection using Fast K-Means and Enhanced K-Means Clustering Algorithms. 2021 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), 128–133. doi: 10.1109/ICSIPA52582.2021.9576799

  2. 2.

    Armstrong, J. J., Zhu, M., Hirdes, J. P., & Stolee, P. (2012). K-means cluster analysis of rehabilitation service users in the Home Health Care System of Ontario: examining the heterogeneity of a complex geriatric population. Arch Phys Med Rehabil, 93(12), 2198–2205. doi: 10.1016/j.apmr.2012.05.026

  3. 3.

    Fahim, A. (2021). K and starting means for K-means algorithm. Journal of Computational Science, 55(Complete). doi: 10.1016/j.jocs.2021.101445

  4. 4.

    Gesicho, M. B., Were, M. C., & Babic, A. (2021). Evaluating performance of health care facilities at meeting HIV-indicator reporting requirements in Kenya: an application of K-means clustering algorithm. BMC Med Inform Decis Mak, 21(1), 6. doi: 10.1186/s12911-020-01367-9

  5. 5.

    Grant, R. W., McCloskey, J., Hatfield, M., Uratsu, C., Ralston, J. D., Bayliss, E., & Kennedy, C. J. (2020). Use of Latent Class Analysis and K-Means Clustering to Identify Complex Patient Profiles. JAMA Netw Open, 3(12), e2029068. doi: 10.1001/jamanetworkopen.2020.29068

  6. 6.

    Jabi, M., Pedersoli, M., Mitiche, A., & Ayed, I. B. (2021). Deep Clustering: On the Link Between Discriminative Models and K-Means. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(6), 1887–1896. doi: 10.1109/TPAMI.2019.2962683

  7. 7.

    Khan, A. R., Khan, S., Harouni, M., Abbasi, R., Iqbal, S., & Mehmood, Z. (2021). Brain tumor segmentation using K-means clustering and deep learning with synthetic data augmentation for classification. Microsc Res Tech, 84(7), 1389–1399. doi: 10.1002/jemt.23694

  8. 8.

    Kwedlo, W., & Lubowicz, M. (2021). Accelerated K-Means Algorithms for Low-Dimensional Data on Parallel Shared-Memory Systems. IEEE Access, 9, 74286–74301. doi: 10.1109/ACCESS.2021.3080821

  9. 9.

    Li, Y., Zhang, Y., Tang, Q., Huang, W., Jiang, Y., & Xia, S.-T. (2021). t-k-means: A ROBUST AND STABLE K-means VARIANT. ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3120–3124. doi: 10.1109/ICASSP39728.2021.9414687

  10. 10.

    Sinaga, K. P., Hussain, I., & Yang, M.-S. (2021). Entropy K-Means Clustering With Feature Reduction Under Unknown Number of Clusters. IEEE Access, 9, 67736–67751. doi: 10.1109/ACCESS.2021.3077622

  11. 11.

    Tavallali, P., Tavallali, P., & Singhal, M. (2021). K-means tree: an optimal clustering tree for unsupervised learning. The Journal of Supercomputing, 77(5), 5239–5266. doi: 10.1007/s11227-020-03436-2

  12. 12.

    Yuan, R. (2021). An improved K-means clustering algorithm for global earthquake catalogs and earthquake magnitude prediction. Journal of Seismology, 25(3), 1005–1020. doi: 10.1007/s10950-021-09999-8

  13. 13.

    Zukotynski, K., Black, S. E., Kuo, P. H., Bhan, A., Adamo, S., Scott, C. J. M., . . . Gaudet, V. (2021). Exploratory Assessment of K-means Clustering to Classify 18F-Flutemetamol Brain PET as Positive or Negative. Clin Nucl Med, 46(8), 616–620. doi: 10.1097/rlu.0000000000003668

12.10 Lab

12.10.1 Working Example in Python

Download the dataset using the following link: https://www.kaggle.com/datasets/uciml/adult-census-income

This dataset includes demographics and income level (above or less or equal to 50K). It includes the following columns:

  • Age: person’s age

  • Workclass: work class type the person belongs to

  • Fnlwgt: final weight estimate for the specified socioeconomic characteristics of the population

  • Education: education level

  • Education.num: education level as a number

  • Marital.status: person’s marital status

  • Occupation: person’s occupation

  • Relationship: person’s relationship status

  • Race: person’s race

  • Sex: person’s gender

  • Capital.gain: an increase in the person’s profit

  • Capital.loss: a decrease in the person’s profit

  • Hours.per.week: number of working hours per week for the person

  • Native.country: the person’s original country

  • Income: the person’s salary

12.10.1.1 Load Person’s Demographics

After downloading the adult census dataset, load the dataset (Fig. 12.15).

Fig. 12.15
An algorithm to envision the census income of a person's data in a histogram includes work class, unknown, occupation, and with hashtag drop two irrelevant features.

Load persons’ census income dataset into pandas

12.10.1.2 Data Visualization and Cleaning

You can explore the data visually; we will content with one plot (Fig. 12.16).

Fig. 12.16
An algorithm to visualize data and a bar chart that plots frequency versus a person&#x2019;s income. The algorithm has code for data visualization. Many people earn less than 50K.

Visualizing persons’ census income data in histogram

12.10.1.3 Data preprocessing

“Workclass” and “occupation” features that contain interrogation marks, and we will start by replacing the “?” with the word “unknown,” while the “native.country” feature is irrelevant for our work and we will drop it altogether (Fig. 12.17).

Fig. 12.17
An algorithm to replace question marks with unknown and to drop native country, and a dataset table with a total of 22 columns.

Replacing “?” with “unknown” and dropping the “native.country”

Many features (workclass, marital.status, occupation, relationship, race, sex) as well as the target (income) are categorical data and hence need one-hot-encoding. The features that are one-hot encoded will be split into many new “dummy” features, if we are working with a linear regression algorithm that would be a problem as the values of any of the features can be deduced if we know the values of all other ones (if all features other than the dropped one are zero, then for sure the dropped feature is 1, and if any one of the other features is 1 then for sure the dropped feature value is zero). This means that one of the features is always correlated with all others. With K-means we do not need to do so. We show how to proceed with splitting the categorical features into new features in Fig. 12.18.

Fig. 12.18
An algorithm to slit the categorical features into new features. It also lists the 22 data columns with non-null, count, and data type.

One-hot encoding workclass categorical variable

The education feature in ordinal in nature, so we will change the string values to numeric ordered values (Fig. 12.19).

Fig. 12.19
An algorithm to change the string values to numeric ordered values with the help of the education feature and replace keywords.

Mapping ordinal categorical values into numeric values

12.10.1.4 Choosing Features and Scaling Data

The feature and the target vectors are prepared and data is normalized (check the results if you do not normalize the data, how would the clustering be affected) (Fig. 12.20).

Fig. 12.20
An algorithm to analyze the data and check if the clustering can be affected or not by scaling the data.

Preparing the data for analysis

12.10.1.5 Finding the Best K for the K-Means Model

To find the best K for the K-means we will proceed manually using a method called the “the elbow” method and also we will sue the usual grid search cross-validation method.

The elbow method consists of drawing for each possible K the inertia cost function. (Fig. 12.21), we can notice that cost declines significantly at the beginning from K = 1 to K = 2, afterwards the decline is less pronounced. It seems like K = 2 is the optimal number of clusters. This indeed reflects the actual data, where we have two clusters: people who earn more than 50k and others who earn 50K or less.

Fig. 12.21
An algorithm to analyze data with a line graph display of optimizing using the elbow method plot cost versus the number of clusters.

Finding the optimal K using elbow method

Using grid search cross-validation, we can also find the K for the optimal model (Fig. 12.22). It is important to note that the GridSearch varied the K parameter between 2 and 15 and resulted in K = 2 for the optimal K-Means.

Fig. 12.22
An algorithm to optimize K means hyperparameters using grid search cross-validation. The algorithm codes are: optimal model equals grid c v dot best underscore estimator underscore.

Optimizing K-Means hyperparameters using grid search cross-validation

Then you can predict the clusters of the feature vector x using the optimal model found by GridSearch. As mentioned above the .predict() will perform hard clustering, and you can display for each data instance the cluster to which it is assigned. However, .transform() will show perform soft clustering and, in our case, assigns for each datapoint two scores representing the distance between the instance and the two centroids. Finally, .labels_ and .cluster_centers_ allow you to display the labels of each point as well as the coordinates of clusters’ centers (Fig. 12.23).

Fig. 12.23
An algorithm to get the optimal k from the elbow method. The algorithm has codes for finding the best K colon from gridSearchCV. It displays the labels of each point as well as the coordinates.

Optimizing K-means model using grid search cross-validation

You can think of plotting the datapoints and the centers. You can try that in Sect. 12.10.2.

12.10.2 Do It Yourself

12.10.2.1 The Iris Dataset Revisited

In this section, create a K-means model using different values for K (try K = 2, 5, 10, 15, 20, and 25) using the iris dataset.

You can download the Iris dataset using the following code

import numpy as np import pandas as pd from sklearn import datasets iris =datasets.load_iris()

Note the variable “iris” is an array and not a data frame; you should be able by now to know how to convert an array to a data frame, but that is not necessary. To learn more about other available data set, click on the following link: https://scikit-learn.org/stable/datasets.html.

You can always now the features names using “feature_names” and the target name using “target_names.”

iris.feature_names iris.target_names

If you want to convert data to a data frame you can write the following code

df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['class']) df.info() # display the data frame information df # display the first few rows

Note: other than the elbow method explained above, there is the silhouette score that allows you to uncover the best K value. Read about the silhouette_score to choose the best K for the Isis dataset.

12.10.2.2 K-Means for Dimension Reduction

Download the digits dataset using the following code

from sklearn.datasets import load_digits x, y = load_digits(return_X_y=True)

  1. 1.

    Phase 1: basic multiclass classification using logistic regression

    1. (a)

      Split the data into training and testing dataset.

    2. (b)

      Use logistic regression for multiclass classification (i.e., multi_class=“ovr”) and consider max_iter=5000.

    3. (c)

      Fit the model to the training dataset.

    4. (d)

      Compute the algorithm score using the testing dataset.

  2. 2.

    Phase 2: seek enhancement using K-Means for preprocessing

    1. (a)

      Create a pipeline for K-Means followed by logistic regression. Since the digits are handwritten and can be present in many ways, choose the number of clusters for K-Means much larger than 10, try 50. The logistic regression parameters are the same as in phase 1.

    2. (b)

      Fit the pipeline on the training dataset.

    3. (c)

      Compute the pipeline score no the testing dataset.

    4. (d)

      Compared to the previous score, was this one better or worse? Discuss the possible reasons behind the new score.

  3. 3.

    Phase 3: search for an optimal K-Means and logistic regression.

    1. (a)

      Find through GridSearch the optimal model for the pipeline. Finetune only one hyperparameter for K-Mean: the number of clusters; vary is between 2 and 100. Note that the execution might take around 20 min depending on your computer configuration.

    2. (b)

      Fit the pipeline and check the new score.

    3. (c)

      Compared to the previous score, was this one better or worse? Discuss the possible reasons behind the new score.

    4. (d)

      What was the best parameter found by GridSearch?

12.10.3 Do More Yourself

Create K-means models to solve the problems presented by the following datasets: