1 Introduction

During the last decades, data mining has emerged as a rapidly growing interdisciplinary field, which merges together databases, statistics, machine learning and other related areas in order to extract useful knowledge from data (Han and Kamber 2001). Clustering is one of fundamental tasks in data mining and machine learning that aims at partitioning a set of data objects into multiple clusters, so that objects within a cluster are similar one another but dissimilar to objects in other clusters. Dissimilarities and similarities between objects are assessed based on those attribute values describing the objects and often involve distance measures. Clustering has been widely applied in a variety of fields (Tellaroli et al. 2016; Kushwaha and Pant 2018), ranging from medical sciences, economics, computer sciences, engineering to social sciences and earth sciences. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a large number of applications (Berkhin 2002; Fahad 2014; Xu and Wunsch 2005; Xu and Tian 2015). However, clustering of massive data sets with categorical and mixed-type data is still a challenge in many applications of big data mining.

Additionally, according to LeCun et al. (2015), unsupervised learning will become far more important in the longer term, because labeling data is both costly and time consuming, and sometimes impossible especially in the context of big data. As an important branch in unsupervised learning, clustering has recently reemerged as an active research topic in big data mining. Particularly, it can be considered as an important tool for analyzing the massive volume of data generated by modern applications: having big data clustered/grouped in a compact format that is still an informative version of the entire data (Fahad 2014). There are difficulties for applying clustering techniques to big data due to new challenges that are raised with big data (Shirkhorshidi et al. 2014). Especially, big data mining adds to clustering the complications of very large datasets with many attributes of different types (Berkhin 2002).

Typically, in clustering objects can be considered as vectors in n-dimensional space, where n is the number of features. When objects are described by numerical features, similarity between objects can be defined based on distance measures such as Euclid distance or Manhattan distance. However, such distance measures are not applicable for categorical data which contains values, for instance, from gender, locations, etc. Clustering data with categorical attributes has increasingly gained considerable attention in the research community (Ganti et al. 1999; Gibson et al. 2000; Guha et al. 1998, 2000; Huang 1997, 1998; Sumangali and Aswani Kumar 2019). As for categorical data, the simple matching measure based on an equal comparison of categorical values is often used to define similarity between objects (Huang 1997). However, this metric does not distinguish between the different values taken by the attribute, since we only measure the equality between pair of values, as argued in Ienco et al. (2009).

The classical k-means algorithm (MacQueen 1967) is probably the most popular and widely used clustering technique. Numerous k-means like algorithms have been also proposed in the literature; each of which typically uses different similarity measure (Kogan et al. 2005). The k-means like algorithms are easy to implement, linear time complexity in size of the data, and almost surely convergence to local optima (Selim and Ismail 1984). Consequently, k-means like algorithms are very efficient for handling large data sets. However, working only on numerical data prohibits them from being used for clustering categorical data.

So far, numerous attempts have been made in order to overcome the numerical-only limitation of the k-means algorithm making it applicable for categorical data clustering, such as k-modes algorithm (Huang 1998) and k-representative algorithm (San et al. 2004), k-centers algorithm (Chen and Wang 2013). Particularly, in the k-modes algorithm (Huang 1998), the simple matching similarity measure is used to compute distance between categorical objects, and “modes” are used instead of means for cluster centers. The mode of a cluster is a data point, in which the value of each attribute is assigned the most frequent value of the attribute’s domain set appearing in the cluster. Furthermore, Huang (1998) also combined the k-modes algorithm with k-means algorithm in order to deal with mixed numerical and categorical databases. It is worth, however, noting that in k-modes algorithm a cluster can have more than one mode and, therefore, the performance of k-mode algorithm depends strongly on the selection of modes during the clustering process. This observation had motivated the authors in San et al. (2004) to introduce a new notion of “cluster centers” called representatives for categorical objects and develop a so-called k-representative algorithm for clustering categorical data. In particular, the representative of a cluster is defined by making use of the distributions of categorical values appearing in clusters. Then, the dissimilarity between a categorical object and the representative of a cluster is easily defined based on relative frequencies of categorical values within the cluster and the simple matching measure between categorical values. In such a way, the k-representative algorithm is also formulated in a similar fashion to the k-means algorithm. In fact, it has been shown that the k-representative algorithm is very effective for clustering categorical data (Ng et al. 2007). More recently, Chen and Wang (2013) have proposed a new kernel density based method for defining cluster centers in central clustering of categorical data. Then the so-called k-centers algorithm that incorporates the new formulation of cluster centers and the weight attributes calculation scheme has been also developed. The experimental results have shown that the k-centers algorithm has good performance especially for the task of recognizing biological concepts in DNA sequences.

While the above-mentioned k-means based algorithms use a similar clustering procedure to the k-means algorithm, they are different in the way of defining “cluster center” or “similarity measure” for categorical data. In this paper,Footnote 1 we propose a new extension of the k-means algorithm for clustering categorical data. In particular, as for measuring dissimilarity between categorical objects, we make use of the information theoretic definition of similarity proposed in Lin (1998), which is intuitively defined based on the amount of information contained in the statement of commonality between values in the domain set of a categorical attribute. On the other hand, the definition of cluster centers is generalized using the kernel-based density estimates for categorical clusters as similarly considered in Chen and Wang (2013), instead of using the frequency estimates as originally in San et al. (2004). We then develop a new clustering method by incorporating a feature weighting scheme that automatically measures the contribution of individual attributes to formation of the clusters.

The rest of this paper is organized as follows. Section 2 briefly recalls the k-means algorithm and then presents its existing extensions for clustering categorical data. The proposed method is discussed in Sect. 3, and the experimental results are presented in Sect. 4. Finally, Sect. 5 wraps up the paper with some conclusions.

2 k-Means algorithm and its extensions for categorical data

Assume that DB is a data set consisting of N objects, each of which is characterized by D attributes with finite domains \(O_1,\ldots ,O_D\), respectively. That is, each object in DB is represented by a tuple in \(O_1\times \cdots \times O_D\), and the dth attribute takes \(|O_d|(> 1)\) discrete values. In addition, the categories in \(O_d\) will be denoted by \(o_{dl}\), for \(l=1,\ldots ,\vert O_d\vert \), and each data object in DB will be denoted by X, with subscript if necessary, which is represented as a tuple \(X=(x_1,\ldots ,x_D) \in O_1\times \cdots \times O_D\). Let \({\mathcal {C}}=\{C_1,\ldots ,C_k\}\) be the set of k clusters of DB, i.e. we have

$$\begin{aligned} C_{j} \cap C_{j'} = \emptyset {\text { if }} j\not = j' {\text { and }} DB = \bigcup _{j=1}^k C_j \end{aligned}$$

Regarding the clustering problem discussed in this paper, we consider two types of data: numeric and categorical. The domain of numerical attributes consists of continuous real values. Thus, a distance measure such as Euclid distance or Manhattan distance can be used. A domain \(O_d\) is defined as categorical if it is finite and unordered, so that only a comparison operation is allowed in \(O_d\). It means, for any \(x,y \in O_d\), we have either \(x=y\) or \(x \ne y\).

2.1 k-Means algorithm

The k-means algorithm (MacQueen 1967) is one of the most popular algorithms in partitional or non-hierarchical clustering method. Given a set DB of N numerical data objects, a natural number \(k \le N\), and a distance measure \({\text {dis}}(\cdot ,\cdot )\), the k-means algorithm searches for a partition of DB into k non-empty disjoint clusters that minimizes the overall sum of the squared distances between data objects and their cluster centers. Mathematically, the problem can be formulated in terms of an optimization problem as follows:

Minimize

$$\begin{aligned} {P(U,{\mathcal {V}})=\sum _{j=1}^{k}\sum _{i=1}^{N}u_{i,j}{\text {dis}}(X_i,V_j)} \end{aligned}$$
(1)

subject to

$$\begin{aligned}&\sum _{j=1}^{k} u_{i,j}=1, \quad 1\le i \le N, \nonumber \\&u_{i,j} \in \{0,1\}, \quad 1 \le i \le N, \quad 1 \le j \le k, \end{aligned}$$
(2)

where \(U = [u_{i,j}]_{N\times k}\) is a partition matrix (\(u_{i,j}\) take value 1 if object \(X_i\) is in cluster \(C_j\), and 0 otherwise), \({\mathcal {V}} =\{V_1,\dots ,V_k\}\) is a set of cluster centers, and \({\text {dis}}(\cdot ,\cdot )\) is the squared Euclidean distance between two objects.

The problem P can be solved by iteratively solving two problems:

  • Fix \({\mathcal {V}} = \hat{{\mathcal {V}}}\) then solve the reduced problem \(P(U,\hat{{\mathcal {V}}})\) to find \({{\hat{U}}}\).

  • Fix \(U={{\hat{U}}}\) then solve the reduced problem \(P({{\hat{U}}}, {\mathcal {V}})\).

Basically, the k-means algorithm iterates through a three-step process until \(P(U, {\mathcal {V}})\) converges to some local minimum:

  1. 1.

    Select an initial \({\mathcal {V}}^{(0)} = {V_1^{(0)},\dots , V_k^{(0)}}\), and set \(t=0\).

  2. 2.

    Keep \({\mathcal {V}}^{(t)}\) fixed and solve \(P(U, {{\mathcal {V}}^{(t)}})\) to obtain \(U^{(t)}\). That is, having the cluster centers, we then assign each object to the cluster of its nearest cluster center.

  3. 3.

    Keep \(U^{(t)}\) fixed and generate \({\mathcal {V}}^{(t+1)}\) such that \(P(U^{(t)}, {\mathcal {V}}^{(t+1)})\) is minimized. That is, construct new cluster centers according to the current partition.

  4. 4.

    In the case of convergence or if a given stopping criterion is fulfilled, output the result and stop. Otherwise, set \(t=t+1\) and go to step 2.

In numerical clustering problem, the Euclidean norm is often chosen as a natural distance measure in the k-means algorithm. With this distance measure, we calculate the partition matrix in step 2 as below, and the cluster center is computed by the mean of cluster’s objects.

$$\begin{aligned}&{\text {if }} {\text {dis}}(X_i, V_j) \le {\text {dis}}(X_i, V_p) {\text { then}} \nonumber \\&u_{i,j}=1, {\text {and }} u_{i,p}=0, \quad {\text {for }} 1 \le p \le k, p \ne j \end{aligned}$$
(3)

2.2 Extensions of k-means for categorical data

2.2.1 k-Modes algorithm

It was also shown in Huang (1997) that the k-means method can be extended to categorical data by using a simple matching distance measure for categorical objects and the most frequent values to define the “cluster centers” called modes. Let \(X_1, X_2\) are two categorical objects in DB, with \(X_1 = (x_{11},\dots ,x_{1D})\) and \(X_2 = (X_{21},\dots ,X_{2D})\). The dissimilarity between \(X_1\) and \(X_2\) can be computed by the total matching of the corresponding attribute values of the two objects. Formally,

$$\begin{aligned} {\text {dis}}(X_1, X_2)=\sum _{d=1}^{D}\delta (x_{1d},x_{2d}) \end{aligned}$$
(4)

where

$$\begin{aligned} \delta (x_{1d},x_{2d}) = \left\{ \begin{array}{l l} 0 &{} {\text {if }} x_{1d} = x_{2d},\\ 1 &{} {\text {if }} x_{1d} \ne x_{2d}. \end{array} \right. \end{aligned}$$

Given a cluster \(\{X_1, \dots ,X_p\}\) of categorical objects, with \(X_i = (x_{i1},\dots ,x_{iD})\), \(1 \le i \le p\), its mode \(V = (o_1,\dots ,o_D)\) is defined by assigning \(o_d\), \(1 \le d \le D\), the value most frequently appeared in \(\{x_{1d} , \dots , x_{pd}\}\). With these modifications, Huang (1998) developed the k-modes algorithm that mimics the k-means method to cluster categorical data. However, as mentioned previously, by definition the mode of a cluster is not in general unique. This makes the algorithm unstable depending on the selection of modes during the clustering process.

2.2.2 k-Representative algorithm

Instead of using modes for cluster centers as in Huang (1997), San et al. (2004) proposed the notion of representatives for clusters defined as follows..

Again, let \(C=\{X_1,\ldots ,X_p\}\) be a cluster of categorical objects and

$$\begin{aligned} X_i=(x_{i1},\ldots ,x_{iD}), \quad 1\le i\le p. \end{aligned}$$

For each \(d=1,\ldots ,D\), let us denote \(O^{C}_{d}\) the set forming from categorical values \(x_{1d},\ldots ,x_{pd}\). Then the representative of C is defined by \(V_C = (v^C_{1}, \ldots , v^C_{D})\), with

$$\begin{aligned} v^C_d=\{(o_{dl},f_C({o_{dl}}))\,\vert \, o_{dl}\in O^{C}_{d}\}, \end{aligned}$$
(5)

where \(f_C({o_{dl}})\) is the relative frequency of category \(o_{dl}\) within C,  i.e.

$$\begin{aligned} f_C({o_{dl}})=\frac{\#_C(o_{dl})}{p} \end{aligned}$$
(6)

where \(\#_C(o_{dl})\) is the number of objects in C having the category \(o_{dl}\) at dth attribute. More formally, each \(v^C_d\) is a distribution on \(O^{C}_{d}\) defined by relative frequencies of categorical values appearing within the cluster.

Then, the dissimilarity between object \(X=(x_1,\ldots ,x_D)\) and representative \(V_C\) is defined based on the simple matching measure \(\delta \) by

$$\begin{aligned} {\text {dis}}(X,V_C)=\sum _{d=1}^{D}\sum _{o_{dl}\in O^{C}_{d}}f_C({o_{dl}})\cdot \delta (x_d,o_{dl}) \end{aligned}$$
(7)

As such, the dissimilarity \({\text {dis}}(X,V_C)\) is mainly dependent on the relative frequencies of categorical values within the cluster and simple matching between categorical values.

2.2.3 k-Centers algorithm

More generally, Chen and Wang (2013) have recently proposed a generalized definition for centers of categorical clusters as follows. The center of a cluster \(C_j\) is defined as

$$\begin{aligned} V_j = [{\varvec{\nu }}_{j1}, \ldots , {\varvec{\nu }}_{jD}] \end{aligned}$$
(8)

in which the dth element \({\varvec{\nu }}_{jd}\) is a probability distribution on \(O_d\) estimated by a kernel density estimation method (Aitchison and Aitken 1976). More particularly, let denote \(X_d\) a random variable associated with observations \(x_{id}\), for \(i=1,\ldots , \vert C_j\vert \), appearing in \(C_j\) at dth attribute, and \(p(X_d)\) its probability density. Let \(O_{jd}\) be the set forming from categorical values \(\{x_{id}\}_{i=1}^{\vert C_j\vert }\). Then the kernel density based estimate of \(p(X_d)\), denoted by \({\hat{p}}(X_d,\lambda _j\vert C_j)\), is of the following form (see, e.g., Titterington 1980):

$$\begin{aligned} {\hat{p}}(X_d,\lambda _j\vert C_j) = \sum _{o_{dl}\in O_{jd}} f_j(o_{dl})K(X_d,o_{dl}\vert \lambda _j) \end{aligned}$$
(9)

where \(K(\cdot ,o_{dl}\vert \lambda _j)\) is a so-called kernel function, \(\lambda _j\in [0,1]\) is a smoothing parameter called the bandwidth, and \(f_j\) is the frequency estimator for \(C_j\), i.e.

$$\begin{aligned} f_j(o_{dl})=\frac{\#_j(o_{dl})}{|C_j|} \end{aligned}$$
(10)

with \(\#_j(o_{dl})\) being the number of \(o_{dl}\) appearing in \(C_j\). Note that another equivalent form of (9) was used in Chen and Wang (2013) for defining a kernel density estimate of \(p(X_d)\).

Also, Chen and Wang (2013) used a variation of Aitchison and Aitken’s kernel function (Aitchison and Aitken 1976) defined by

$$\begin{aligned} K(X_d,o_{dl}\vert \lambda _j)=\left\{ \begin{array}{ll} 1-\frac{\vert O_d\vert -1}{\vert O_d\vert }\lambda _j&{} \text{ if } X_d = o_{dl}\\ \frac{1}{\vert O_d\vert }\lambda _j &{} \text{ if } X_d \not = o_{dl} \end{array}\right. \end{aligned}$$
(11)

to derive the estimate \({\hat{p}}(X_d,\lambda _j\vert C_j)\), which is then used to define \({\varvec{\nu }}_{jd}\).

It is worth noting here that the kernel function \(K(X_d,o_{dl}\vert \lambda _j)\) is defined in terms of the cardinality of the whole domain \(O_d\) but not in terms of the cardinality of the subdomain \(O_{jd}\) of the given cluster \(C_j\).

From (9)–(11), it easily follows that \({\varvec{\nu }}_{jd}\) can be represented as

$$\begin{aligned} {\varvec{\nu }}_{jd} = [ P_{jd}(o_{d1}), \dots , P_{jd}(o_{dl}), \dots , P_{jd}(o_{d|O_d|}) ] \end{aligned}$$

where

$$\begin{aligned} P_{jd}(o_{dl})=\lambda _j\frac{1}{|O_d|} + (1-\lambda _j)f_j(o_{dl}) \end{aligned}$$
(12)

and \(\lambda _j \in [0,1]\) is the bandwidth for \(C_j\).

When \(\lambda _j = 0\), the center degenerates to the pure frequency estimator, which is originally used in the k-representative algorithm to define the center of a categorical cluster.

To measure the dissimilarity between a data object and its center, each data object \(X_i\) is represented by a set of vectors \(\{y_{id}\}^D_{d=1}\), with

$$\begin{aligned} y_{id} = [ I(x_{id} = o_{d1}),\dots ,I(x_{id} = o_{dl}),\dots ,I(x_{id} = o_{d|O_d|}) ] \end{aligned}$$

Here \(I(\cdot )\) is an indicator function whose value is either 1 or 0, indicating whether \(x_{id}\) is the same as \(o_{dl} \in O_d\) or not. The dissimilarity on the dth dimension is then measured by

$$\begin{aligned} {\text {dis}}_d(X_i, V_j)=||y_{id}-{\varvec{\nu }}_{jd}||_2 \end{aligned}$$
(13)

We can see that, k-centers uses a different way to calculate the dissimilarities between objects and cluster centers, but the idea of comparing two categorical values is still based on the simple matching method (represented by indicator function \(I(\cdot )\)). The remains of the k-center mimics the idea of k-means algorithm.

3 The proposed clustering method

Basically, the clustering method proposed in this paper follows a general procedure for clustering categorical data (Fig. 1) outlined as follows.

  1. 1.

    Based on kernel smoothing techniques (Aitchison and Aitken 1976; Titterington 1980), we will develop a kernel-based method for representation of cluster centers for categorical objects. Such a kernel-based approach could provide an interpretation of cluster centers being consistent with the statistical interpretation of the cluster means for numerical data.

  2. 2.

    Taking underlying distribution of categorical attributes into consideration, we will define an information theoretic-based measure of dissimilarity for categorical data. This dissimilarity measure will be further extended for computing the distance between categorical objects and cluster centers.

  3. 3.

    Under the above considerations, we can formulate the problem of clustering categorical data in the fashion similar to partitioning-based clustering. Eventually, we can also investigate incorporating weighting schemes introduced in Huang et al. (2005) into partitioning-based clustering for categorical and mixed data.

Fig. 1
figure 1

The general procedure for clustering categorical data

In particular, we now introduce a new extension of the k-means clustering for categorical data by combining a kernel-based estimation method of cluster centers and an information theoretic based dissimilarity measure for categorical objects.

3.1 Representation of cluster centers

Similar as in k-centers algorithm (Chen and Wang 2013), for each cluster \(C_j\), let us define the center of \(C_j\) as

$$\begin{aligned} V_j = [{\varvec{\nu }}_{j1}, \ldots , {\varvec{\nu }}_{jD}] \end{aligned}$$
(14)

where \({\varvec{\nu }}_{jd}\) is a probability distribution on \(O_d\) estimated by a kernel density estimation method.

As our aim is to derive a kernel density based estimate \({\hat{p}}(X_d,\lambda _j\vert C_j)\) for the dth attribute of cluster \(C_j\), instead of directly using Chen and Wang’s kernel function as above, we use another version of the kernel function defined in terms of the cardinality of the subdomain \(O_{jd}\) as follows.

For any \(o_{dl}\in O_d\), if \(o_{dl}\in O_{jd}\) then we define

$$\begin{aligned} K(X_d,o_{dl}\vert \lambda _j)=\left\{ \begin{array}{ll} 1-\frac{\vert O_{jd}\vert -1}{\vert O_{jd}\vert }\lambda _j&{} \text{ if } X_d = o_{dl}\\ \frac{1}{\vert O_{jd}\vert }\lambda _j &{} \text{ if } X_d \not = o_{dl} \end{array}\right. \end{aligned}$$
(15)

otherwise, i.e. \(o_{dl}\not \in O_{jd}\), we let \(K(X_d,o_{dl}\vert \lambda _j)=0.\) Then, from (9), (10) and (15) it easily follows that \({\varvec{\nu }}_{jd}\) can be obtained as

$$\begin{aligned} {\varvec{\nu }}_{jd} = \left[ P_{jd}(o_{d1}), \dots , P_{jd}(o_{dl}), \dots , P_{jd}(o_{d|O_d|})\right] \end{aligned}$$
(16)

where

$$\begin{aligned} P_{jd}(o_{dl})=\left\{ \begin{array}{ll} \lambda _j\frac{1}{|O_{jd}|} + (1-\lambda _j)f_j(o_{dl}) &{} \text{ if } o_{dl}\in O_{jd}\\ 0 &{} \text{ otherwise } \end{array}\right. \end{aligned}$$
(17)

and \(\lambda _j \in [0,1]\) is the smoothing parameter for \(C_j\).

The parameter \(\lambda _j\) is selected using the least squares cross validation (LSCV) as done in Chen and Wang (2013), which is based on the principle of selecting a bandwidth that minimizes the total error of the resulting estimation over all the data objects. Specifically, the optimal \(\lambda _j^*\) is determined by the following equation:

$$\begin{aligned} \lambda _j^*= \frac{1}{\vert C_j\vert -1} \frac{\sum _{d=1}^{D}\left( 1-\sum _{o_{dl}\in O_{jd}}[f_j(o_{dl})]^2\right) }{{\sum _{d=1}^{D}}\left( \sum _{o_{dl}\in O_{jd}}[f_j(o_{dl})]^2- \frac{1}{|O_{jd}|}\right) } \end{aligned}$$
(18)

3.2 An information theoretic-based dissimilarity measure

Instead of using the simple matching measure as in Huang (1997) and San et al. (2004) or the Euclidean norm as in Chen and Wang (2013), we first introduce a dissimilarity measure for categorical values making use of an information-theoretic definition of similarity proposed by Lin (1998), and then propose a new method for computing the distance between categorical objects and cluster centers, making use of the kernel density based definition of centers and the information-theoretic based dissimilarity measure for categorical data.

In Lin (1998), the author developed an information-theoretic framework for similarity within which a formal definition of similarity can be derived from a set of underlying assumptions. Basically, Lin’s definition of similarity is stated in information theoretic terms, as quoted “the similarity between A and B is measured by the ratio between the amount of information needed to state the commonality of A and B and the information needed to fully describe what A and B are.” Formally, the similarity between A and B is generally defined as

$$\begin{aligned} \mathrm {sim}(A,B) = \frac{\log P(\mathrm {common}(A,B))}{\log P(\mathrm {description}(A,B))} \end{aligned}$$
(19)

where P(s) is the probability of a statement s, and the information contained in s is measured by \([- \log P(s)]\). To show the universality of the information-theoretic definition of similarity, Lin (1998) also discussed it in different settings, including ordinal domain, string similarity, word similarity and semantic similarity.

Boriah et al. (2008) applied Lin’s framework to the categorical setting and proposed a similarity measure for categorical data as follows. Let DB be a data set consisting of objects defined over a set of D categorical attributes with finite domains denoted by \(O_1,\ldots ,O_D\), respectively. For each \(d=1,\ldots ,D\), the similarity between two categorical values \(o_{dl},o_{dl'}\in O_d\) is defined by

$$\begin{aligned} \mathrm {sim}_d(o_{dl},o_{dl'}) =\left\{ \begin{array}{ll} 2\log f_d(o_{dl}) &{} \text{ if } o_{dl}=o_{dl'}\\ 2\log (f_d(o_{dl})+f_d(o_{dl'})) &{} \text{ otherwise } \end{array}\right. \end{aligned}$$
(20)

where

$$\begin{aligned} f_d(x) = \frac{\#(x)}{\vert DB\vert } \end{aligned}$$

with \(\#(x)\) being the number of objects in DB having the category x at dth attribute. In fact, Boriah et al. (2008) also proposed another similarity measure derived from Lin’s framework and conducted an experimental evaluation of many different similarity measures for categorical data in the context of outlier detection.

It should be emphasized here that the similarity measure \(\mathrm {sim}_d(\cdot ,\cdot )\) does not satisfy the Assumption 4 assumed in Lin’s framework (Lin 1998), which states that the similarity between a pair of identical object is 1. Particularly, the range of \(\mathrm {sim}_d(o_{dl},o_{dl'})\) for \(o_{dl}=o_{dl'}\) is \(\left[ -2\log \vert DB\vert ,0\right] \), with the minimum being attained when \(o_{dl}\) occurs only once and the maximum being attained when \(O_d=\{o_{dl}\}\). Similarly, the range of \(\mathrm {sim}_d(o_{dl},o_{dl'})\) for \(o_{dl}\not =o_{dl'}\) is \(\left[ -2 \log \frac{\vert DB\vert }{2}, 0\right] \), with the minimum being attained when \(o_{dl}\) and \(o_{dl'}\) each occur only once, and the maximum value is attained when \(o_{dl}\) and \(o_{dl'}\) each occur \(\frac{\vert DB\vert }{2}\) times, as pointed out in Boriah et al. (2008).

Based on the general definition of similarity given in (19) and its application to similarity between ordinal values briefly discussed in Lin (1998), we introduce another similarity measure for categorical values as follows.

For any two categorical values \(o_{dl},o_{dl'}\in O_d\), their similarity, denoted by \(\mathrm {sim}^*_d(o_{dl},o_{dl'})\), is defined by

$$\begin{aligned} \mathrm {sim}^*_d(o_{dl},o_{dl'}) =\frac{2\log f_d(\{o_{dl},o_{dl'}\})}{\log f_d(o_{dl})+ \log f_d(o_{dl'})} \end{aligned}$$
(21)

where

$$\begin{aligned} f_d(\{o_{dl},o_{dl'}\}) = \frac{\#(\{o_{dl},o_{dl'}\})}{\vert DB\vert } \end{aligned}$$

with \(\#(\{o_{dl},o_{dl'}\})\) being the number of categorical objects in DB that receive the value belonging to \(\{o_{dl},o_{dl'}\}\) at the dth attribute. Clearly, we have \(\mathrm {sim}^*_d(o_{dl},o_{dl'}) = 1\) if \(o_{dl}\) and \(o_{dl'}\) are identical, which satisfies the Assumption 4 stated as above.

Then, the dissimilarity measure between two categorical values \(o_{dl},o_{dl'}\in O_d\) can be defined by

$$\begin{aligned} {\text {dis}}^*_d(o_{dl},o_{dl'}) = 1 -\mathrm {sim}^*_d(o_{dl},o_{dl'}) =1- \frac{2\log f_d(\{o_{dl},o_{dl'}\})}{\log f_d(o_{dl})+ \log f_d(o_{dl'})} \end{aligned}$$
(22)

Let \(X_i = [x_{i1},x_{i2},\dots ,x_{iD}]\in DB\) and \(V_j = [{\varvec{\nu }}_{j1}, \ldots , \varvec{\nu }_{jD}]\) be the center of cluster \(C_j\). We are now able to extend the dissimilarity between categorical values of \(O_d\) to the dissimilarity on the dth attribute between \(X_i\) and \(V_j\), i.e. the dissimilarity between the dth component \(x_{id}\in O_d\) of \(X_i\) and the dth component \({\varvec{\nu }}_{jd}\) of the center \(V_j\), as follows. Without danger of confusion, we shall also use \({\text {dis}}^*_d\) to denote this dissimilarity and

$$\begin{aligned} {\text {dis}}^*_d(X_i,V_j)= \sum _{o_{dl} \in O_{jd}} P_{jd}(o_{dl}) {\text {dis}}^*_d(x_{id},o_{dl}) \end{aligned}$$
(23)

3.3 The clustering algorithm

With the modifications just made above, we are now ready to formulate the problem of clustering categorical data in a similar way as k-means clustering. Adapted from Huang’s W-k-means algorithm (Huang et al. 2005), we also use a weighting vector \(W=[w_1, w_2, \dots ,w_D]\) for D attributes and \(\beta \) being a parameter for attribute weight, where \(0 \le w_d \le 1\) and \(\sum _d w_d = 1\). The principal for attribute weighting is to assign a larger weight to an attribute that has a smaller sum of the within cluster distances and a smaller one to an attribute that has a larger sum of the within cluster distances. More details of this weighting scheme can be found in (Huang et al. 2005). Then, the weighted dissimilarity between data object \(X_i\) and cluster center \(V_j\), denoted by \({\text {dis}}^*(X_i,V_j)\), is defined by

$$\begin{aligned} {\text {dis}}^*(X_i,V_j)= \sum _{d=1}^{D} w_d^\beta {\text {dis}}^*_d(X_i,V_j)= \sum _{d=1}^{D} w_d^\beta \sum _{o_{dl} \in O_{jd}} P_{jd}(o_{dl}) {\text {dis}}^*_d(x_{id},o_{dl}) \end{aligned}$$
(24)

Based on these definitions, the clustering algorithm now aims to minimize the following objective function:

$$\begin{aligned} J(U, {\mathcal {V}}, W) = \sum _{j=1}^{k} \sum _{i=1}^{N} \sum _{d=1}^{D} u_{i,j}w_d^\beta {\text {dis}}^*_d(X_i,V_j) \end{aligned}$$
(25)

subject to

$$\begin{aligned} \begin{array}{lr} \sum _{j=1}^k u_{i,j} = 1, &{} 1 \le i \le N\\ u_{i,j} \in \{0,1\}, &{} 1 \le i \le N, 1 \le j \le k \\ \sum _{d=1}^D w_d = 1, &{} 0 \le w_d \le 1 \end{array} \end{aligned}$$

where \(U = [u_{i,j}]_{N\times k}\) is a partition matrix.

The proposed clustering algorithm is then formulated as below.

figure a

4 Experimental studies

In this section, we will provide experiments conducted on ten datasets obtained from the UCI Machine Learning Repository (Blake and Merz 1998) to compare the clustering performances of k-modes, k-representatives, and the proposed clustering method (denoted by New). In addition, in order to see the effectiveness and efficiency of the new dissimilarity measure and our modification of the kernel-based concept of cluster centers, we also conduct experiments for two modified versions of k-representatives and the proposed clustering method, respectively. In the modified version of k-representatives (denoted by M-k-representatives), we simply replace the simple matching dissimilarity measure used in k-representatives with the information theoretic-based dissimilarity measure defined by (22). For the modified version of the proposed clustering method (denoted by M-k-Centers), we use the information theoretic-based dissimilarity measure defined by (22) in combination with the concept of cluster centers defined by Chen and Wang (2013). That is, M-k-Centers is as Algorithm 1 formulated above but with the cluster centers being updated by (12) instead of (17).

4.1 Datasets

The main characteristics of the used datasets are summarized in Table 1. These datasets are chosen to test our algorithm because of their public availability and since all attributes can be treated as categorical ones.

The Balance-Scale dataset was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The Breast Cancer dataset was obtained from the university of Wisconsin hospitals. Each instance has one of two possible classes: benign or malignant. The Car Evaluation dataset was derived from a simple hierarchical decision model originally developed for the demonstration of DEX (M. Bohanec, 1990). Each instance belongs to one of four classes: unacc, acc, good and vgood. The Lenses dataset is used for fitting contact lenses. All instances in this dataset is classified into three classes: contact lenses, soft contact lenses, no contact lenses. The Mushroom dataset records drawn from The Audubon Society Field Guide to North American Mushrooms in 1981. It includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as edible or poisonous.

The Nursery dataset was derived from a hierarchical decision model originally developed to rank applications for nursery schools. All instances in this dataset belong to five classes: not_recom, recommend, very_recom, priority, spec_prior. The Soybean (Large) is the research of R. S. Michalski and R. L. Chilausky in 1980. It contains 307 instances with 35 attributes. All instances are classified into 19 classes. The Soybean (Small) is a small subset of the original soybean database. It contains 47 instances and classified into 4 classes. The Tictactoe dataset encodes the complete set of possible board configurations at the end of tic-tac-toe games, where “x” is assumed to have played first. It contains 958 instances that belong to two classes: positive and negative. The Tae dataset consists the evaluation of teaching performance over three regular semesters and two summer semesters of 151 teaching assistant assignments at the Statistics Department of the University of Wisconsin-Madison. All instances belong to three classes: low, medium and high.

Table 1 Categorical datasets for the experiment

4.2 Clustering quality evaluation

Evaluating the clustering quality is often a hard and subjective task (Ienco et al. 2009). Generally, objective functions in clustering are purposely designed so as to achieve high intra-cluster similarity and low inter-cluster similarity. This can be viewed as an internal criterion for the quality of a clustering (Manning et al. 2008). However, as observed in the literature, good scores on an internal criterion do not necessarily translate into good effectiveness in an application. Here, by the same way as in (Ienco et al. 2012), we use three external criteria to evaluate the results: purity, normalized mutual information (NMI) and adjusted rand index (ARI). These methods make use of the original class information of each object and the cluster to which the same objects have been assigned to evaluate how well the clustering result matches the original classes.

We denote by \(C = \{C_1, \dots , C_J \}\) the partition of the dataset built by the clustering algorithm, and by \(P = \{P_1, \dots , P_I \}\) the partition inferred by the original classification. J and I are respectively the number of clusters |C| and the number of classes |P|. We denote by N the total number of objects.

4.2.1 Purity metric

Purity is a simple and transparent evaluation measure. To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned objects and dividing by the number of objects in the dataset. High purity is easy to achieve when the number of clusters is large. Thus, we cannot use purity to trade off the quality of the clustering against the number of clusters.

$$\begin{aligned} Purity(C,P)= \frac{1}{N} \sum _{j}\max _i |C_j \cap P_i| \end{aligned}$$
(26)

4.2.2 NMI metric

The second metric (NMI) provides an information that is independent from the number of clusters (Strehl and Ghosh 2003). This measure takes its maximum value when the clustering partition matches completely the original partition. NMI is computed as the average mutual information between any pair of clusters and classes

$$\begin{aligned} NMI(C,P)=\frac{\sum _{i=1}^{I} \sum _{j=1}^{J}|C_j \cap P_i| \log \frac{N |C_j \cap P_i|}{|C_j||P_i|}}{\sqrt{\sum _{j=1}^{J}|C_j| \log \frac{|C_j|}{N} \sum _{i=1}^{I}|P_i| \log \frac{|P_i|}{N}}} \end{aligned}$$
(27)

4.2.3 ARI metric

The third metric is the adjusted rand index (Hubert and Arabie 1995). Let a be the number of object pairs belonging to the same cluster in C and to the same class in P. This metric captures the deviation of a from its expected value corresponding to the hypothetical value of a obtained when C and P are two random, independent partitions.

The expected value of a denoted by E[a] is computed as follows:

$$\begin{aligned} E[a]= \frac{\pi (C)\pi (P)}{N(N-1)/2} \end{aligned}$$
(28)

where \(\pi (C)\) and \(\pi (P)\) denote respectively the number of object pairs from the same clusters in C and from the same class in P. The maximum value for a is defined as:

$$\begin{aligned} \max (a)= \frac{1}{2} (\pi (C)+\pi (P)) \end{aligned}$$
(29)

The agreement between C and P can be estimated by the adjusted rand index as follows:

$$\begin{aligned} ARI(C, P) = \frac{a-E[a]}{\max (a)-E[a]} \end{aligned}$$
(30)

when \(ARI(C, P) = 1\), we have identical partitions.

In many previous studies, only purity metric has been used to analyze the performance of clustering algorithm. However, purity is easy to achieve when the number of cluster is large. In particular, purity is 1 if each object data gets its own cluster. Beside, many partitions have the same purity but they are different from each other e.g, the number of object data in each clusters, and which objects constitute the clusters. Therefore, we need the other two metrics to have the overall of how our clustering results matches the original classes.

4.3 Experimental results

The experiments were run on a Mac with a 3.66 GHz Intel QuadCore processor, 8GB of RAM running Mac OSX 10.10. For each categorical dataset, we run 300 times per algorithm. We provide the parameter k equals to the number of classes in each dataset. The performance of three evaluation metrics are calculated by the average after 300 times of running. The weighting exponent \(\beta \) was set to 8 as experimentally recommended in Huang et al. (2005).

As we can see from Tables 2, 3 and 4, the proposed clustering method produces the best results in eight out of ten datasets in purity and NMI indexes, and in seven out of ten datasets in adjusted rand index. Comparing the performance of k-representatives and M-k-representatives, we can see that the new dissimilarity measure has yielded better results, especially in NMI and adjusted rand index. Similarly, by comparing the performance of M-k-Centers and the proposed algorithm, it is shown that our kernel-based concept of cluster centers has significantly improved the clustering performance in comparison to the kernel-based concept of cluster centers previously defined in Chen and Wang (2013). In conclusion, the new approach has been proved to enhance the performance of previously developed k-means like algorithms for clustering categorical data.

Table 2 Purity results of categorical dataset algorithms
Table 3 NMI results of categorical dataset algorithms
Table 4 Adjusted rand index results of categorical dataset algorithms

5 Conclusions

In this paper, we have introduced a new k-means like method for clustering categorical data based on an information theoretic based dissimilarity measure and a kernel density estimate-based concept of cluster centers for categorical objects. Several variations of the proposed clustering method have been also examined. The experimental results on real world datasets from UCI Machine Learning Repository have shown that the proposed clustering method outperformed the k-means like algorithms previously developed for clustering categorical data.

For the future work, we are planning to extend the proposed approach to the problem of clustering mixed numeric and categorical datasets as well as fuzzy clustering. Applications of the developed clustering methods in such areas as chemical compounds analysis, DNA sequences analysis, social networks analysis, and patient similarity for treatment regimen discovery are also subjects of our future studies.