A method for k-means-like clustering of categorical data

Nguyen, Thu-Hien Thi; Dinh, Duy-Tai; Sriboonchitta, Songsak; Huynh, Van-Nam

doi:10.1007/s12652-019-01445-5

A method for k-means-like clustering of categorical data

Original Research
Published: 06 September 2019

Volume 14, pages 15011–15021, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

A method for k-means-like clustering of categorical data

Download PDF

Thu-Hien Thi Nguyen¹,
Duy-Tai Dinh¹,
Songsak Sriboonchitta² &
…
Van-Nam Huynh ORCID: orcid.org/0000-0002-3860-7815¹

1588 Accesses
15 Citations
Explore all metrics

Abstract

Despite recent efforts, the challenge in clustering categorical and mixed data in the context of big data still remains due to the lack of inherently meaningful measure of similarity between categorical objects and the high computational complexity of existing clustering techniques. While k-means method is well known for its efficiency in clustering large data sets, working only on numerical data prohibits it from being applied for clustering categorical data. In this paper, we aim to develop a novel extension of k-means method for clustering categorical data, making use of an information theoretic-based dissimilarity measure and a kernel-based method for representation of cluster means for categorical objects. Such an approach allows us to formulate the problem of clustering categorical data in the fashion similar to k-means clustering, while a kernel-based definition of centers also provides an interpretation of cluster means being consistent with the statistical interpretation of the cluster means for numerical data. In order to demonstrate the performance of the new clustering method, a series of experiments on real datasets from UCI Machine Learning Repository are conducted and the obtained results are compared with several previously developed algorithms for clustering categorical data.

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

The Performance of Objective Functions for Clustering Categorical Data

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

During the last decades, data mining has emerged as a rapidly growing interdisciplinary field, which merges together databases, statistics, machine learning and other related areas in order to extract useful knowledge from data (Han and Kamber 2001). Clustering is one of fundamental tasks in data mining and machine learning that aims at partitioning a set of data objects into multiple clusters, so that objects within a cluster are similar one another but dissimilar to objects in other clusters. Dissimilarities and similarities between objects are assessed based on those attribute values describing the objects and often involve distance measures. Clustering has been widely applied in a variety of fields (Tellaroli et al. 2016; Kushwaha and Pant 2018), ranging from medical sciences, economics, computer sciences, engineering to social sciences and earth sciences. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a large number of applications (Berkhin 2002; Fahad 2014; Xu and Wunsch 2005; Xu and Tian 2015). However, clustering of massive data sets with categorical and mixed-type data is still a challenge in many applications of big data mining.

Additionally, according to LeCun et al. (2015), unsupervised learning will become far more important in the longer term, because labeling data is both costly and time consuming, and sometimes impossible especially in the context of big data. As an important branch in unsupervised learning, clustering has recently reemerged as an active research topic in big data mining. Particularly, it can be considered as an important tool for analyzing the massive volume of data generated by modern applications: having big data clustered/grouped in a compact format that is still an informative version of the entire data (Fahad 2014). There are difficulties for applying clustering techniques to big data due to new challenges that are raised with big data (Shirkhorshidi et al. 2014). Especially, big data mining adds to clustering the complications of very large datasets with many attributes of different types (Berkhin 2002).

Typically, in clustering objects can be considered as vectors in n-dimensional space, where n is the number of features. When objects are described by numerical features, similarity between objects can be defined based on distance measures such as Euclid distance or Manhattan distance. However, such distance measures are not applicable for categorical data which contains values, for instance, from gender, locations, etc. Clustering data with categorical attributes has increasingly gained considerable attention in the research community (Ganti et al. 1999; Gibson et al. 2000; Guha et al. 1998, 2000; Huang 1997, 1998; Sumangali and Aswani Kumar 2019). As for categorical data, the simple matching measure based on an equal comparison of categorical values is often used to define similarity between objects (Huang 1997). However, this metric does not distinguish between the different values taken by the attribute, since we only measure the equality between pair of values, as argued in Ienco et al. (2009).

The classical k-means algorithm (MacQueen 1967) is probably the most popular and widely used clustering technique. Numerous k-means like algorithms have been also proposed in the literature; each of which typically uses different similarity measure (Kogan et al. 2005). The k-means like algorithms are easy to implement, linear time complexity in size of the data, and almost surely convergence to local optima (Selim and Ismail 1984). Consequently, k-means like algorithms are very efficient for handling large data sets. However, working only on numerical data prohibits them from being used for clustering categorical data.

So far, numerous attempts have been made in order to overcome the numerical-only limitation of the k-means algorithm making it applicable for categorical data clustering, such as k-modes algorithm (Huang 1998) and k-representative algorithm (San et al. 2004), k-centers algorithm (Chen and Wang 2013). Particularly, in the k-modes algorithm (Huang 1998), the simple matching similarity measure is used to compute distance between categorical objects, and “modes” are used instead of means for cluster centers. The mode of a cluster is a data point, in which the value of each attribute is assigned the most frequent value of the attribute’s domain set appearing in the cluster. Furthermore, Huang (1998) also combined the k-modes algorithm with k-means algorithm in order to deal with mixed numerical and categorical databases. It is worth, however, noting that in k-modes algorithm a cluster can have more than one mode and, therefore, the performance of k-mode algorithm depends strongly on the selection of modes during the clustering process. This observation had motivated the authors in San et al. (2004) to introduce a new notion of “cluster centers” called representatives for categorical objects and develop a so-called k-representative algorithm for clustering categorical data. In particular, the representative of a cluster is defined by making use of the distributions of categorical values appearing in clusters. Then, the dissimilarity between a categorical object and the representative of a cluster is easily defined based on relative frequencies of categorical values within the cluster and the simple matching measure between categorical values. In such a way, the k-representative algorithm is also formulated in a similar fashion to the k-means algorithm. In fact, it has been shown that the k-representative algorithm is very effective for clustering categorical data (Ng et al. 2007). More recently, Chen and Wang (2013) have proposed a new kernel density based method for defining cluster centers in central clustering of categorical data. Then the so-called k-centers algorithm that incorporates the new formulation of cluster centers and the weight attributes calculation scheme has been also developed. The experimental results have shown that the k-centers algorithm has good performance especially for the task of recognizing biological concepts in DNA sequences.

While the above-mentioned k-means based algorithms use a similar clustering procedure to the k-means algorithm, they are different in the way of defining “cluster center” or “similarity measure” for categorical data. In this paper,^{Footnote 1} we propose a new extension of the k-means algorithm for clustering categorical data. In particular, as for measuring dissimilarity between categorical objects, we make use of the information theoretic definition of similarity proposed in Lin (1998), which is intuitively defined based on the amount of information contained in the statement of commonality between values in the domain set of a categorical attribute. On the other hand, the definition of cluster centers is generalized using the kernel-based density estimates for categorical clusters as similarly considered in Chen and Wang (2013), instead of using the frequency estimates as originally in San et al. (2004). We then develop a new clustering method by incorporating a feature weighting scheme that automatically measures the contribution of individual attributes to formation of the clusters.

The rest of this paper is organized as follows. Section 2 briefly recalls the k-means algorithm and then presents its existing extensions for clustering categorical data. The proposed method is discussed in Sect. 3, and the experimental results are presented in Sect. 4. Finally, Sect. 5 wraps up the paper with some conclusions.

2 k-Means algorithm and its extensions for categorical data

Assume that DB is a data set consisting of N objects, each of which is characterized by D attributes with finite domains $O_1,\ldots ,O_D$, respectively. That is, each object in DB is represented by a tuple in $O_1\times \cdots \times O_D$, and the dth attribute takes $|O_d|(> 1)$ discrete values. In addition, the categories in $O_d$ will be denoted by $o_{dl}$, for $l=1,\ldots ,\vert O_d\vert $, and each data object in DB will be denoted by X, with subscript if necessary, which is represented as a tuple $X=(x_1,\ldots ,x_D) \in O_1\times \cdots \times O_D$. Let ${\mathcal {C}}=\{C_1,\ldots ,C_k\}$ be the set of k clusters of DB, i.e. we have

$$\begin{aligned} C_{j} \cap C_{j'} = \emptyset {\text { if }} j\not = j' {\text { and }} DB = \bigcup _{j=1}^k C_j \end{aligned}$$

Regarding the clustering problem discussed in this paper, we consider two types of data: numeric and categorical. The domain of numerical attributes consists of continuous real values. Thus, a distance measure such as Euclid distance or Manhattan distance can be used. A domain $O_d$ is defined as categorical if it is finite and unordered, so that only a comparison operation is allowed in $O_d$. It means, for any $x,y \in O_d$, we have either $x=y$ or $x \ne y$.

2.1 k-Means algorithm

The k-means algorithm (MacQueen 1967) is one of the most popular algorithms in partitional or non-hierarchical clustering method. Given a set DB of N numerical data objects, a natural number $k \le N$, and a distance measure ${\text {dis}}(\cdot ,\cdot )$, the k-means algorithm searches for a partition of DB into k non-empty disjoint clusters that minimizes the overall sum of the squared distances between data objects and their cluster centers. Mathematically, the problem can be formulated in terms of an optimization problem as follows:

Minimize

$$\begin{aligned} {P(U,{\mathcal {V}})=\sum _{j=1}^{k}\sum _{i=1}^{N}u_{i,j}{\text {dis}}(X_i,V_j)} \end{aligned}$$

(1)

subject to

$$\begin{aligned}&\sum _{j=1}^{k} u_{i,j}=1, \quad 1\le i \le N, \nonumber \\&u_{i,j} \in \{0,1\}, \quad 1 \le i \le N, \quad 1 \le j \le k, \end{aligned}$$

(2)

where $U = [u_{i,j}]_{N\times k}$ is a partition matrix ($u_{i,j}$ take value 1 if object $X_i$ is in cluster $C_j$, and 0 otherwise), ${\mathcal {V}} =\{V_1,\dots ,V_k\}$ is a set of cluster centers, and ${\text {dis}}(\cdot ,\cdot )$ is the squared Euclidean distance between two objects.

The problem P can be solved by iteratively solving two problems:

Fix ${\mathcal {V}} = \hat{{\mathcal {V}}}$ then solve the reduced problem $P(U,\hat{{\mathcal {V}}})$ to find ${{\hat{U}}}$.
Fix $U={{\hat{U}}}$ then solve the reduced problem $P({{\hat{U}}}, {\mathcal {V}})$.

Basically, the k-means algorithm iterates through a three-step process until $P(U, {\mathcal {V}})$ converges to some local minimum:

1.
Select an initial ${\mathcal {V}}^{(0)} = {V_1^{(0)},\dots , V_k^{(0)}}$, and set $t=0$.
2.
Keep ${\mathcal {V}}^{(t)}$ fixed and solve $P(U, {{\mathcal {V}}^{(t)}})$ to obtain $U^{(t)}$. That is, having the cluster centers, we then assign each object to the cluster of its nearest cluster center.
3.
Keep $U^{(t)}$ fixed and generate ${\mathcal {V}}^{(t+1)}$ such that $P(U^{(t)}, {\mathcal {V}}^{(t+1)})$ is minimized. That is, construct new cluster centers according to the current partition.
4.
In the case of convergence or if a given stopping criterion is fulfilled, output the result and stop. Otherwise, set $t=t+1$ and go to step 2.

In numerical clustering problem, the Euclidean norm is often chosen as a natural distance measure in the k-means algorithm. With this distance measure, we calculate the partition matrix in step 2 as below, and the cluster center is computed by the mean of cluster’s objects.

$$\begin{aligned}&{\text {if }} {\text {dis}}(X_i, V_j) \le {\text {dis}}(X_i, V_p) {\text { then}} \nonumber \\&u_{i,j}=1, {\text {and }} u_{i,p}=0, \quad {\text {for }} 1 \le p \le k, p \ne j \end{aligned}$$

(3)

2.2 Extensions of k-means for categorical data

2.2.1 k-Modes algorithm

It was also shown in Huang (1997) that the k-means method can be extended to categorical data by using a simple matching distance measure for categorical objects and the most frequent values to define the “cluster centers” called modes. Let $X_1, X_2$ are two categorical objects in DB, with $X_1 = (x_{11},\dots ,x_{1D})$ and $X_2 = (X_{21},\dots ,X_{2D})$. The dissimilarity between $X_1$ and $X_2$ can be computed by the total matching of the corresponding attribute values of the two objects. Formally,

$$\begin{aligned} {\text {dis}}(X_1, X_2)=\sum _{d=1}^{D}\delta (x_{1d},x_{2d}) \end{aligned}$$

(4)

where

$$\begin{aligned} \delta (x_{1d},x_{2d}) = \left\{ \begin{array}{l l} 0 &{} {\text {if }} x_{1d} = x_{2d},\\ 1 &{} {\text {if }} x_{1d} \ne x_{2d}. \end{array} \right. \end{aligned}$$

Given a cluster $\{X_1, \dots ,X_p\}$ of categorical objects, with $X_i = (x_{i1},\dots ,x_{iD})$, $1 \le i \le p$, its mode $V = (o_1,\dots ,o_D)$ is defined by assigning $o_d$, $1 \le d \le D$, the value most frequently appeared in $\{x_{1d} , \dots , x_{pd}\}$. With these modifications, Huang (1998) developed the k-modes algorithm that mimics the k-means method to cluster categorical data. However, as mentioned previously, by definition the mode of a cluster is not in general unique. This makes the algorithm unstable depending on the selection of modes during the clustering process.

2.2.2 k-Representative algorithm

Instead of using modes for cluster centers as in Huang (1997), San et al. (2004) proposed the notion of representatives for clusters defined as follows..

Again, let $C=\{X_1,\ldots ,X_p\}$ be a cluster of categorical objects and

$$\begin{aligned} X_i=(x_{i1},\ldots ,x_{iD}), \quad 1\le i\le p. \end{aligned}$$

For each $d=1,\ldots ,D$, let us denote $O^{C}_{d}$ the set forming from categorical values $x_{1d},\ldots ,x_{pd}$. Then the representative of C is defined by $V_C = (v^C_{1}, \ldots , v^C_{D})$, with

$$\begin{aligned} v^C_d=\{(o_{dl},f_C({o_{dl}}))\,\vert \, o_{dl}\in O^{C}_{d}\}, \end{aligned}$$

(5)

where $f_C({o_{dl}})$ is the relative frequency of category $o_{dl}$ within C, i.e.

$$\begin{aligned} f_C({o_{dl}})=\frac{\#_C(o_{dl})}{p} \end{aligned}$$

(6)

where $\#_C(o_{dl})$ is the number of objects in C having the category $o_{dl}$ at dth attribute. More formally, each $v^C_d$ is a distribution on $O^{C}_{d}$ defined by relative frequencies of categorical values appearing within the cluster.

Then, the dissimilarity between object $X=(x_1,\ldots ,x_D)$ and representative $V_C$ is defined based on the simple matching measure $\delta $ by

$$\begin{aligned} {\text {dis}}(X,V_C)=\sum _{d=1}^{D}\sum _{o_{dl}\in O^{C}_{d}}f_C({o_{dl}})\cdot \delta (x_d,o_{dl}) \end{aligned}$$

(7)

As such, the dissimilarity ${\text {dis}}(X,V_C)$ is mainly dependent on the relative frequencies of categorical values within the cluster and simple matching between categorical values.

2.2.3 k-Centers algorithm

More generally, Chen and Wang (2013) have recently proposed a generalized definition for centers of categorical clusters as follows. The center of a cluster $C_j$ is defined as

$$\begin{aligned} V_j = [{\varvec{\nu }}_{j1}, \ldots , {\varvec{\nu }}_{jD}] \end{aligned}$$

(8)

in which the dth element ${\varvec{\nu }}_{jd}$ is a probability distribution on $O_d$ estimated by a kernel density estimation method (Aitchison and Aitken 1976). More particularly, let denote $X_d$ a random variable associated with observations $x_{id}$, for $i=1,\ldots , \vert C_j\vert $, appearing in $C_j$ at dth attribute, and $p(X_d)$ its probability density. Let $O_{jd}$ be the set forming from categorical values $\{x_{id}\}_{i=1}^{\vert C_j\vert }$. Then the kernel density based estimate of $p(X_d)$, denoted by ${\hat{p}}(X_d,\lambda _j\vert C_j)$, is of the following form (see, e.g., Titterington 1980):

$$\begin{aligned} {\hat{p}}(X_d,\lambda _j\vert C_j) = \sum _{o_{dl}\in O_{jd}} f_j(o_{dl})K(X_d,o_{dl}\vert \lambda _j) \end{aligned}$$

(9)

where $K(\cdot ,o_{dl}\vert \lambda _j)$ is a so-called kernel function, $\lambda _j\in [0,1]$ is a smoothing parameter called the bandwidth, and $f_j$ is the frequency estimator for $C_j$, i.e.

$$\begin{aligned} f_j(o_{dl})=\frac{\#_j(o_{dl})}{|C_j|} \end{aligned}$$

(10)

with $\#_j(o_{dl})$ being the number of $o_{dl}$ appearing in $C_j$. Note that another equivalent form of (9) was used in Chen and Wang (2013) for defining a kernel density estimate of $p(X_d)$.

Also, Chen and Wang (2013) used a variation of Aitchison and Aitken’s kernel function (Aitchison and Aitken 1976) defined by

$$\begin{aligned} K(X_d,o_{dl}\vert \lambda _j)=\left\{ \begin{array}{ll} 1-\frac{\vert O_d\vert -1}{\vert O_d\vert }\lambda _j&{} \text{ if } X_d = o_{dl}\\ \frac{1}{\vert O_d\vert }\lambda _j &{} \text{ if } X_d \not = o_{dl} \end{array}\right. \end{aligned}$$

(11)

to derive the estimate ${\hat{p}}(X_d,\lambda _j\vert C_j)$, which is then used to define ${\varvec{\nu }}_{jd}$.

It is worth noting here that the kernel function $K(X_d,o_{dl}\vert \lambda _j)$ is defined in terms of the cardinality of the whole domain $O_d$ but not in terms of the cardinality of the subdomain $O_{jd}$ of the given cluster $C_j$.

From (9)–(11), it easily follows that ${\varvec{\nu }}_{jd}$ can be represented as

$$\begin{aligned} {\varvec{\nu }}_{jd} = [ P_{jd}(o_{d1}), \dots , P_{jd}(o_{dl}), \dots , P_{jd}(o_{d|O_d|}) ] \end{aligned}$$

where

$$\begin{aligned} P_{jd}(o_{dl})=\lambda _j\frac{1}{|O_d|} + (1-\lambda _j)f_j(o_{dl}) \end{aligned}$$

(12)

and $\lambda _j \in [0,1]$ is the bandwidth for $C_j$.

When $\lambda _j = 0$, the center degenerates to the pure frequency estimator, which is originally used in the k-representative algorithm to define the center of a categorical cluster.

To measure the dissimilarity between a data object and its center, each data object $X_i$ is represented by a set of vectors $\{y_{id}\}^D_{d=1}$, with

$$\begin{aligned} y_{id} = [ I(x_{id} = o_{d1}),\dots ,I(x_{id} = o_{dl}),\dots ,I(x_{id} = o_{d|O_d|}) ] \end{aligned}$$

Here $I(\cdot )$ is an indicator function whose value is either 1 or 0, indicating whether $x_{id}$ is the same as $o_{dl} \in O_d$ or not. The dissimilarity on the dth dimension is then measured by

$$\begin{aligned} {\text {dis}}_d(X_i, V_j)=||y_{id}-{\varvec{\nu }}_{jd}||_2 \end{aligned}$$

(13)

We can see that, k-centers uses a different way to calculate the dissimilarities between objects and cluster centers, but the idea of comparing two categorical values is still based on the simple matching method (represented by indicator function $I(\cdot )$). The remains of the k-center mimics the idea of k-means algorithm.

3 The proposed clustering method

Basically, the clustering method proposed in this paper follows a general procedure for clustering categorical data (Fig. 1) outlined as follows.

1.
Based on kernel smoothing techniques (Aitchison and Aitken 1976; Titterington 1980), we will develop a kernel-based method for representation of cluster centers for categorical objects. Such a kernel-based approach could provide an interpretation of cluster centers being consistent with the statistical interpretation of the cluster means for numerical data.
2.
Taking underlying distribution of categorical attributes into consideration, we will define an information theoretic-based measure of dissimilarity for categorical data. This dissimilarity measure will be further extended for computing the distance between categorical objects and cluster centers.
3.
Under the above considerations, we can formulate the problem of clustering categorical data in the fashion similar to partitioning-based clustering. Eventually, we can also investigate incorporating weighting schemes introduced in Huang et al. (2005) into partitioning-based clustering for categorical and mixed data.

In particular, we now introduce a new extension of the k-means clustering for categorical data by combining a kernel-based estimation method of cluster centers and an information theoretic based dissimilarity measure for categorical objects.

3.1 Representation of cluster centers

Similar as in k-centers algorithm (Chen and Wang 2013), for each cluster $C_j$, let us define the center of $C_j$ as

$$\begin{aligned} V_j = [{\varvec{\nu }}_{j1}, \ldots , {\varvec{\nu }}_{jD}] \end{aligned}$$

(14)

where ${\varvec{\nu }}_{jd}$ is a probability distribution on $O_d$ estimated by a kernel density estimation method.

As our aim is to derive a kernel density based estimate ${\hat{p}}(X_d,\lambda _j\vert C_j)$ for the dth attribute of cluster $C_j$, instead of directly using Chen and Wang’s kernel function as above, we use another version of the kernel function defined in terms of the cardinality of the subdomain $O_{jd}$ as follows.

For any $o_{dl}\in O_d$, if $o_{dl}\in O_{jd}$ then we define

$$\begin{aligned} K(X_d,o_{dl}\vert \lambda _j)=\left\{ \begin{array}{ll} 1-\frac{\vert O_{jd}\vert -1}{\vert O_{jd}\vert }\lambda _j&{} \text{ if } X_d = o_{dl}\\ \frac{1}{\vert O_{jd}\vert }\lambda _j &{} \text{ if } X_d \not = o_{dl} \end{array}\right. \end{aligned}$$

(15)

otherwise, i.e. $o_{dl}\not \in O_{jd}$, we let $K(X_d,o_{dl}\vert \lambda _j)=0.$ Then, from (9), (10) and (15) it easily follows that ${\varvec{\nu }}_{jd}$ can be obtained as

$$\begin{aligned} {\varvec{\nu }}_{jd} = \left[ P_{jd}(o_{d1}), \dots , P_{jd}(o_{dl}), \dots , P_{jd}(o_{d|O_d|})\right] \end{aligned}$$

(16)

where

$$\begin{aligned} P_{jd}(o_{dl})=\left\{ \begin{array}{ll} \lambda _j\frac{1}{|O_{jd}|} + (1-\lambda _j)f_j(o_{dl}) &{} \text{ if } o_{dl}\in O_{jd}\\ 0 &{} \text{ otherwise } \end{array}\right. \end{aligned}$$

(17)

and $\lambda _j \in [0,1]$ is the smoothing parameter for $C_j$.

The parameter $\lambda _j$ is selected using the least squares cross validation (LSCV) as done in Chen and Wang (2013), which is based on the principle of selecting a bandwidth that minimizes the total error of the resulting estimation over all the data objects. Specifically, the optimal $\lambda _j^*$ is determined by the following equation:

$$\begin{aligned} \lambda _j^*= \frac{1}{\vert C_j\vert -1} \frac{\sum _{d=1}^{D}\left( 1-\sum _{o_{dl}\in O_{jd}}[f_j(o_{dl})]^2\right) }{{\sum _{d=1}^{D}}\left( \sum _{o_{dl}\in O_{jd}}[f_j(o_{dl})]^2- \frac{1}{|O_{jd}|}\right) } \end{aligned}$$

(18)

3.2 An information theoretic-based dissimilarity measure

Instead of using the simple matching measure as in Huang (1997) and San et al. (2004) or the Euclidean norm as in Chen and Wang (2013), we first introduce a dissimilarity measure for categorical values making use of an information-theoretic definition of similarity proposed by Lin (1998), and then propose a new method for computing the distance between categorical objects and cluster centers, making use of the kernel density based definition of centers and the information-theoretic based dissimilarity measure for categorical data.

In Lin (1998), the author developed an information-theoretic framework for similarity within which a formal definition of similarity can be derived from a set of underlying assumptions. Basically, Lin’s definition of similarity is stated in information theoretic terms, as quoted “the similarity between A and B is measured by the ratio between the amount of information needed to state the commonality of A and B and the information needed to fully describe what A and B are.” Formally, the similarity between A and B is generally defined as

$$\begin{aligned} \mathrm {sim}(A,B) = \frac{\log P(\mathrm {common}(A,B))}{\log P(\mathrm {description}(A,B))} \end{aligned}$$

(19)

where P(s) is the probability of a statement s, and the information contained in s is measured by $[- \log P(s)]$. To show the universality of the information-theoretic definition of similarity, Lin (1998) also discussed it in different settings, including ordinal domain, string similarity, word similarity and semantic similarity.

Boriah et al. (2008) applied Lin’s framework to the categorical setting and proposed a similarity measure for categorical data as follows. Let DB be a data set consisting of objects defined over a set of D categorical attributes with finite domains denoted by $O_1,\ldots ,O_D$, respectively. For each $d=1,\ldots ,D$, the similarity between two categorical values $o_{dl},o_{dl'}\in O_d$ is defined by

$$\begin{aligned} \mathrm {sim}_d(o_{dl},o_{dl'}) =\left\{ \begin{array}{ll} 2\log f_d(o_{dl}) &{} \text{ if } o_{dl}=o_{dl'}\\ 2\log (f_d(o_{dl})+f_d(o_{dl'})) &{} \text{ otherwise } \end{array}\right. \end{aligned}$$

(20)

where

$$\begin{aligned} f_d(x) = \frac{\#(x)}{\vert DB\vert } \end{aligned}$$

with $\#(x)$ being the number of objects in DB having the category x at dth attribute. In fact, Boriah et al. (2008) also proposed another similarity measure derived from Lin’s framework and conducted an experimental evaluation of many different similarity measures for categorical data in the context of outlier detection.

It should be emphasized here that the similarity measure $\mathrm {sim}_d(\cdot ,\cdot )$ does not satisfy the Assumption 4 assumed in Lin’s framework (Lin 1998), which states that the similarity between a pair of identical object is 1. Particularly, the range of $\mathrm {sim}_d(o_{dl},o_{dl'})$ for $o_{dl}=o_{dl'}$ is $\left[ -2\log \vert DB\vert ,0\right] $, with the minimum being attained when $o_{dl}$ occurs only once and the maximum being attained when $O_d=\{o_{dl}\}$. Similarly, the range of $\mathrm {sim}_d(o_{dl},o_{dl'})$ for $o_{dl}\not =o_{dl'}$ is $\left[ -2 \log \frac{\vert DB\vert }{2}, 0\right] $, with the minimum being attained when $o_{dl}$ and $o_{dl'}$ each occur only once, and the maximum value is attained when $o_{dl}$ and $o_{dl'}$ each occur $\frac{\vert DB\vert }{2}$ times, as pointed out in Boriah et al. (2008).

Based on the general definition of similarity given in (19) and its application to similarity between ordinal values briefly discussed in Lin (1998), we introduce another similarity measure for categorical values as follows.

For any two categorical values $o_{dl},o_{dl'}\in O_d$, their similarity, denoted by $\mathrm {sim}^*_d(o_{dl},o_{dl'})$, is defined by

$$\begin{aligned} \mathrm {sim}^*_d(o_{dl},o_{dl'}) =\frac{2\log f_d(\{o_{dl},o_{dl'}\})}{\log f_d(o_{dl})+ \log f_d(o_{dl'})} \end{aligned}$$

(21)

where

$$\begin{aligned} f_d(\{o_{dl},o_{dl'}\}) = \frac{\#(\{o_{dl},o_{dl'}\})}{\vert DB\vert } \end{aligned}$$

with $\#(\{o_{dl},o_{dl'}\})$ being the number of categorical objects in DB that receive the value belonging to $\{o_{dl},o_{dl'}\}$ at the dth attribute. Clearly, we have $\mathrm {sim}^*_d(o_{dl},o_{dl'}) = 1$ if $o_{dl}$ and $o_{dl'}$ are identical, which satisfies the Assumption 4 stated as above.

Then, the dissimilarity measure between two categorical values $o_{dl},o_{dl'}\in O_d$ can be defined by

$$\begin{aligned} {\text {dis}}^*_d(o_{dl},o_{dl'}) = 1 -\mathrm {sim}^*_d(o_{dl},o_{dl'}) =1- \frac{2\log f_d(\{o_{dl},o_{dl'}\})}{\log f_d(o_{dl})+ \log f_d(o_{dl'})} \end{aligned}$$

(22)

Let $X_i = [x_{i1},x_{i2},\dots ,x_{iD}]\in DB$ and $V_j = [{\varvec{\nu }}_{j1}, \ldots , \varvec{\nu }_{jD}]$ be the center of cluster $C_j$. We are now able to extend the dissimilarity between categorical values of $O_d$ to the dissimilarity on the dth attribute between $X_i$ and $V_j$, i.e. the dissimilarity between the dth component $x_{id}\in O_d$ of $X_i$ and the dth component ${\varvec{\nu }}_{jd}$ of the center $V_j$, as follows. Without danger of confusion, we shall also use ${\text {dis}}^*_d$ to denote this dissimilarity and

$$\begin{aligned} {\text {dis}}^*_d(X_i,V_j)= \sum _{o_{dl} \in O_{jd}} P_{jd}(o_{dl}) {\text {dis}}^*_d(x_{id},o_{dl}) \end{aligned}$$

(23)

3.3 The clustering algorithm

With the modifications just made above, we are now ready to formulate the problem of clustering categorical data in a similar way as k-means clustering. Adapted from Huang’s W-k-means algorithm (Huang et al. 2005), we also use a weighting vector $W=[w_1, w_2, \dots ,w_D]$ for D attributes and $\beta $ being a parameter for attribute weight, where $0 \le w_d \le 1$ and $\sum _d w_d = 1$. The principal for attribute weighting is to assign a larger weight to an attribute that has a smaller sum of the within cluster distances and a smaller one to an attribute that has a larger sum of the within cluster distances. More details of this weighting scheme can be found in (Huang et al. 2005). Then, the weighted dissimilarity between data object $X_i$ and cluster center $V_j$, denoted by ${\text {dis}}^*(X_i,V_j)$, is defined by

$$\begin{aligned} {\text {dis}}^*(X_i,V_j)= \sum _{d=1}^{D} w_d^\beta {\text {dis}}^*_d(X_i,V_j)= \sum _{d=1}^{D} w_d^\beta \sum _{o_{dl} \in O_{jd}} P_{jd}(o_{dl}) {\text {dis}}^*_d(x_{id},o_{dl}) \end{aligned}$$

(24)

Based on these definitions, the clustering algorithm now aims to minimize the following objective function:

$$\begin{aligned} J(U, {\mathcal {V}}, W) = \sum _{j=1}^{k} \sum _{i=1}^{N} \sum _{d=1}^{D} u_{i,j}w_d^\beta {\text {dis}}^*_d(X_i,V_j) \end{aligned}$$

(25)

subject to

$$\begin{aligned} \begin{array}{lr} \sum _{j=1}^k u_{i,j} = 1, &{} 1 \le i \le N\\ u_{i,j} \in \{0,1\}, &{} 1 \le i \le N, 1 \le j \le k \\ \sum _{d=1}^D w_d = 1, &{} 0 \le w_d \le 1 \end{array} \end{aligned}$$

where $U = [u_{i,j}]_{N\times k}$ is a partition matrix.

The proposed clustering algorithm is then formulated as below.

4 Experimental studies

In this section, we will provide experiments conducted on ten datasets obtained from the UCI Machine Learning Repository (Blake and Merz 1998) to compare the clustering performances of k-modes, k-representatives, and the proposed clustering method (denoted by New). In addition, in order to see the effectiveness and efficiency of the new dissimilarity measure and our modification of the kernel-based concept of cluster centers, we also conduct experiments for two modified versions of k-representatives and the proposed clustering method, respectively. In the modified version of k-representatives (denoted by M-k-representatives), we simply replace the simple matching dissimilarity measure used in k-representatives with the information theoretic-based dissimilarity measure defined by (22). For the modified version of the proposed clustering method (denoted by M-k-Centers), we use the information theoretic-based dissimilarity measure defined by (22) in combination with the concept of cluster centers defined by Chen and Wang (2013). That is, M-k-Centers is as Algorithm 1 formulated above but with the cluster centers being updated by (12) instead of (17).

4.1 Datasets

The main characteristics of the used datasets are summarized in Table 1. These datasets are chosen to test our algorithm because of their public availability and since all attributes can be treated as categorical ones.

The Balance-Scale dataset was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The Breast Cancer dataset was obtained from the university of Wisconsin hospitals. Each instance has one of two possible classes: benign or malignant. The Car Evaluation dataset was derived from a simple hierarchical decision model originally developed for the demonstration of DEX (M. Bohanec, 1990). Each instance belongs to one of four classes: unacc, acc, good and vgood. The Lenses dataset is used for fitting contact lenses. All instances in this dataset is classified into three classes: contact lenses, soft contact lenses, no contact lenses. The Mushroom dataset records drawn from The Audubon Society Field Guide to North American Mushrooms in 1981. It includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as edible or poisonous.

The Nursery dataset was derived from a hierarchical decision model originally developed to rank applications for nursery schools. All instances in this dataset belong to five classes: not_recom, recommend, very_recom, priority, spec_prior. The Soybean (Large) is the research of R. S. Michalski and R. L. Chilausky in 1980. It contains 307 instances with 35 attributes. All instances are classified into 19 classes. The Soybean (Small) is a small subset of the original soybean database. It contains 47 instances and classified into 4 classes. The Tictactoe dataset encodes the complete set of possible board configurations at the end of tic-tac-toe games, where “x” is assumed to have played first. It contains 958 instances that belong to two classes: positive and negative. The Tae dataset consists the evaluation of teaching performance over three regular semesters and two summer semesters of 151 teaching assistant assignments at the Statistics Department of the University of Wisconsin-Madison. All instances belong to three classes: low, medium and high.

Table 1 Categorical datasets for the experiment

Full size table

4.2 Clustering quality evaluation

Evaluating the clustering quality is often a hard and subjective task (Ienco et al. 2009). Generally, objective functions in clustering are purposely designed so as to achieve high intra-cluster similarity and low inter-cluster similarity. This can be viewed as an internal criterion for the quality of a clustering (Manning et al. 2008). However, as observed in the literature, good scores on an internal criterion do not necessarily translate into good effectiveness in an application. Here, by the same way as in (Ienco et al. 2012), we use three external criteria to evaluate the results: purity, normalized mutual information (NMI) and adjusted rand index (ARI). These methods make use of the original class information of each object and the cluster to which the same objects have been assigned to evaluate how well the clustering result matches the original classes.

We denote by $C = \{C_1, \dots , C_J \}$ the partition of the dataset built by the clustering algorithm, and by $P = \{P_1, \dots , P_I \}$ the partition inferred by the original classification. J and I are respectively the number of clusters |C| and the number of classes |P|. We denote by N the total number of objects.

4.2.1 Purity metric

Purity is a simple and transparent evaluation measure. To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned objects and dividing by the number of objects in the dataset. High purity is easy to achieve when the number of clusters is large. Thus, we cannot use purity to trade off the quality of the clustering against the number of clusters.

$$\begin{aligned} Purity(C,P)= \frac{1}{N} \sum _{j}\max _i |C_j \cap P_i| \end{aligned}$$

(26)

4.2.2 NMI metric

The second metric (NMI) provides an information that is independent from the number of clusters (Strehl and Ghosh 2003). This measure takes its maximum value when the clustering partition matches completely the original partition. NMI is computed as the average mutual information between any pair of clusters and classes

$$\begin{aligned} NMI(C,P)=\frac{\sum _{i=1}^{I} \sum _{j=1}^{J}|C_j \cap P_i| \log \frac{N |C_j \cap P_i|}{|C_j||P_i|}}{\sqrt{\sum _{j=1}^{J}|C_j| \log \frac{|C_j|}{N} \sum _{i=1}^{I}|P_i| \log \frac{|P_i|}{N}}} \end{aligned}$$

(27)

4.2.3 ARI metric

The third metric is the adjusted rand index (Hubert and Arabie 1995). Let a be the number of object pairs belonging to the same cluster in C and to the same class in P. This metric captures the deviation of a from its expected value corresponding to the hypothetical value of a obtained when C and P are two random, independent partitions.

The expected value of a denoted by E[a] is computed as follows:

$$\begin{aligned} E[a]= \frac{\pi (C)\pi (P)}{N(N-1)/2} \end{aligned}$$

(28)

where $\pi (C)$ and $\pi (P)$ denote respectively the number of object pairs from the same clusters in C and from the same class in P. The maximum value for a is defined as:

$$\begin{aligned} \max (a)= \frac{1}{2} (\pi (C)+\pi (P)) \end{aligned}$$

(29)

The agreement between C and P can be estimated by the adjusted rand index as follows:

$$\begin{aligned} ARI(C, P) = \frac{a-E[a]}{\max (a)-E[a]} \end{aligned}$$

(30)

when $ARI(C, P) = 1$, we have identical partitions.

In many previous studies, only purity metric has been used to analyze the performance of clustering algorithm. However, purity is easy to achieve when the number of cluster is large. In particular, purity is 1 if each object data gets its own cluster. Beside, many partitions have the same purity but they are different from each other e.g, the number of object data in each clusters, and which objects constitute the clusters. Therefore, we need the other two metrics to have the overall of how our clustering results matches the original classes.

4.3 Experimental results

The experiments were run on a Mac with a 3.66 GHz Intel QuadCore processor, 8GB of RAM running Mac OSX 10.10. For each categorical dataset, we run 300 times per algorithm. We provide the parameter k equals to the number of classes in each dataset. The performance of three evaluation metrics are calculated by the average after 300 times of running. The weighting exponent $\beta $ was set to 8 as experimentally recommended in Huang et al. (2005).

As we can see from Tables 2, 3 and 4, the proposed clustering method produces the best results in eight out of ten datasets in purity and NMI indexes, and in seven out of ten datasets in adjusted rand index. Comparing the performance of k-representatives and M-k-representatives, we can see that the new dissimilarity measure has yielded better results, especially in NMI and adjusted rand index. Similarly, by comparing the performance of M-k-Centers and the proposed algorithm, it is shown that our kernel-based concept of cluster centers has significantly improved the clustering performance in comparison to the kernel-based concept of cluster centers previously defined in Chen and Wang (2013). In conclusion, the new approach has been proved to enhance the performance of previously developed k-means like algorithms for clustering categorical data.

Table 2 Purity results of categorical dataset algorithms

Full size table

Table 3 NMI results of categorical dataset algorithms

Full size table

Table 4 Adjusted rand index results of categorical dataset algorithms

Full size table

5 Conclusions

In this paper, we have introduced a new k-means like method for clustering categorical data based on an information theoretic based dissimilarity measure and a kernel density estimate-based concept of cluster centers for categorical objects. Several variations of the proposed clustering method have been also examined. The experimental results on real world datasets from UCI Machine Learning Repository have shown that the proposed clustering method outperformed the k-means like algorithms previously developed for clustering categorical data.

For the future work, we are planning to extend the proposed approach to the problem of clustering mixed numeric and categorical datasets as well as fuzzy clustering. Applications of the developed clustering methods in such areas as chemical compounds analysis, DNA sequences analysis, social networks analysis, and patient similarity for treatment regimen discovery are also subjects of our future studies.

Notes

This paper is a significantly revised and extended version of Nguyen and Huynh (2016).

References

Aitchison J, Aitken CGG (1976) Multivariate binary discrimination by the kernel method. Biometrika 63(3):413–420. https://doi.org/10.1093/biomet/63.3.413
Article MathSciNet Google Scholar
Berkhin P (2002) Survey of clustering data mining techniques. Technical report
Blake CL, Merz CJ (1998) UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the SIAM international conference on data mining, SDM—2008, pp 243–254. https://doi.org/10.1137/1.9781611972788.22
Chen L, Wang S (2013) Central clustering of categorical data with automated feature weighting. In: Proceedings of the twenty-third international joint conference on artificial intelligence, pp 1260–1266. https://www.ijcai.org/Proceedings/13/Papers/190.pdf
Fahad et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Article Google Scholar
Ganti V, Gehrke J, Ramakrishnan R (1999) CATUS—clustering categorical data using summaries. In: Proceedings of the international conference on knowledge discovery and data mining, (San Diego, USA), pp 73–83. https://doi.org/10.1145/312129.312201
Gibson D, Kleinberg J, Raghavan P (2000) Clustering categorical data: an approach based on dynamic systems. VLDB J 8:222–236. https://doi.org/10.1007/s007780050005
Article Google Scholar
Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366. https://doi.org/10.1016/S0306-4379(00)00022-3
Article Google Scholar
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD international conference on management of data, New York, pp 73–84. https://doi.org/10.1145/276304.276312
Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann Publishers, San Francisco
Google Scholar
Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Lu H, Motoda H, Liu H (eds) KDD: techniques and applications. World Scientific, Singapore, pp 21–34
Google Scholar
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2:283–304. https://doi.org/10.1023/A:1009769707641
Article Google Scholar
Huang Z, Ng MK, Rong H, Li Z (2005) Automated variable weighting in $k$-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668. https://doi.org/10.1109/TPAMI.2005.95
Article PubMed Google Scholar
Hubert L, Arabie P (1995) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075
Article Google Scholar
Ienco D, Pensa RG, Meo R (2012) From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data 6(1):1–25. https://doi.org/10.1145/2133360.2133361
Article Google Scholar
Ienco D, Pensa RG, Meo R (2009) Context-based distance learning for categorical data clustering. In: Advances in intelligent data analysis viii: 8th international symposium. Springer, pp 83–94. https://doi.org/10.1007/978-3-642-03915-7_8
Kogan J, Teboulle M, Nicholas C (2005) Data driven similarity measures for $k$-means like clustering algorithms. Inf Retr 8(2):331–349. https://doi.org/10.1007/s10791-005-5666-8
Article Google Scholar
Kushwaha N, Pant M (2018) Fuzzy magnetic optimization clustering algorithm with its application to health care. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-0941-x
Article Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539
Article CAS PubMed ADS Google Scholar
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, pp 296–304. http://dl.acm.org/citation.cfm?id=645527.657297
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth symposium on mathematical statistics and probability, Berkeley, CA, 1967, vol 1, no. AD 669871, pp 281–297
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book Google Scholar
Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in $k$-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29:503–507. https://doi.org/10.1109/TPAMI.2007.53
Article PubMed Google Scholar
Nguyen TTH, Huynh VN (2016) A $k$-means like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure. In: Foundations of information and knowledge systems—9th international symposium, FoIKS-2016. Springer, pp 115–130. https://doi.org/10.1007/978-3-319-30024-5_7
San OM, Huynh VN, Nakamori Y (2004) An alternative extension of the $k$-means algorithm for clustering categorical data. Int J Appl Math Comput Sci 14(2):241–247. http://matwbn.icm.edu.pl/ksiazki/amc/amc14/amc14212.pdf
Selim SZ, Ismail MA (1984) k-Means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6:81–87. https://doi.org/10.1109/TPAMI.1984.4767478
Article CAS PubMed Google Scholar
Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T, (2014) Big data clustering: a review. In: Computational science and its applications—ICCSA (2014) 14th international conference, Guimaraes, Portugal, Proceedings, part V, pp 707–720: https://doi.org/10.1007/978-3-319-09156-3_49
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617. https://doi.org/10.1162/153244303321897735
Article MathSciNet Google Scholar
Sumangali K, Aswani Kumar Ch (2019) Concept lattice simplification in formal concept analysis using attribute clustering. J Ambient Intell Humaniz Comput 10:2327–2343. https://doi.org/10.1007/s12652-018-0831-2
Article Google Scholar
Tellaroli P, Bazzi M, Donato M, Brazzale AR, Draghici S (2016) Cross-clustering: a partial clustering algorithm with automatic estimation of the number of clusters. PLoS One 11(3):e0152333. https://doi.org/10.1371/journal.pone.0152333
Article CAS PubMed PubMed Central Google Scholar
Titterington DM (1980) A comparative study of kernel-based density estimates for categorical data. Technometrics 22(2):259–268. https://doi.org/10.1080/00401706.1980.10486142
Article Google Scholar
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165–193. https://doi.org/10.1007/s40745-015-0040-1
Article Google Scholar
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678. https://doi.org/10.1109/TNN.2005.845141
Article PubMed Google Scholar

Download references

Acknowledgements

This paper is based upon work supported in part by the Asian Office of Aerospace R&D (AOARD), Air Force Office of Scientific Research (Grant no. FA2386-17-1-4046). We would also like to thank the Associate Editor and the anonymous reviewers for their careful reading of our manuscript and thoughtful comments, which have helped us to improve our manuscript significantly.

Author information

Authors and Affiliations

Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan
Thu-Hien Thi Nguyen, Duy-Tai Dinh & Van-Nam Huynh
Faculty of Economics, Centre of Excellence in Econometrics, Chiang Mai University, Chiang Mai, Thailand
Songsak Sriboonchitta

Authors

Thu-Hien Thi Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Duy-Tai Dinh
View author publications
You can also search for this author in PubMed Google Scholar
Songsak Sriboonchitta
View author publications
You can also search for this author in PubMed Google Scholar
Van-Nam Huynh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Van-Nam Huynh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, TH.T., Dinh, DT., Sriboonchitta, S. et al. A method for k-means-like clustering of categorical data. J Ambient Intell Human Comput 14, 15011–15021 (2023). https://doi.org/10.1007/s12652-019-01445-5

Download citation

Received: 15 February 2019
Accepted: 29 August 2019
Published: 06 September 2019
Issue Date: November 2023
DOI: https://doi.org/10.1007/s12652-019-01445-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A method for k-means-like clustering of categorical data

Abstract

Similar content being viewed by others

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

The Performance of Objective Functions for Clustering Categorical Data

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

1 Introduction

2 k-Means algorithm and its extensions for categorical data

2.1 k-Means algorithm

2.2 Extensions of k-means for categorical data

2.2.1 k-Modes algorithm

2.2.2 k-Representative algorithm

2.2.3 k-Centers algorithm

3 The proposed clustering method

3.1 Representation of cluster centers

3.2 An information theoretic-based dissimilarity measure

3.3 The clustering algorithm

4 Experimental studies

4.1 Datasets

4.2 Clustering quality evaluation

4.2.1 Purity metric

4.2.2 NMI metric

4.2.3 ARI metric

4.3 Experimental results

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A method for k-means-like clustering of categorical data

Abstract

Similar content being viewed by others

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

The Performance of Objective Functions for Clustering Categorical Data

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Explore related subjects

1 Introduction

2 k-Means algorithm and its extensions for categorical data

2.1 k-Means algorithm

2.2 Extensions of k-means for categorical data

2.2.1 k-Modes algorithm

2.2.2 k-Representative algorithm

2.2.3 k-Centers algorithm

3 The proposed clustering method

3.1 Representation of cluster centers

3.2 An information theoretic-based dissimilarity measure

3.3 The clustering algorithm

4 Experimental studies

4.1 Datasets

4.2 Clustering quality evaluation

4.2.1 Purity metric

4.2.2 NMI metric

4.2.3 ARI metric

4.3 Experimental results

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation