Keywords

1 Introduction

The aim of cluster analysis is to partition the data into meaningful homogeneous groups which should differ considerably from each other. The problem is made more difficult by the presence of mixed-type data: ordinal and continuous variables. In order to find a solution, mainly two different approaches exist, based on a model describing the data generation process or a distance able to capture the dissimilarity between two entities. Before to summarize the main features of the two approaches, let us specify that when we use the word categorical data, we are still referring to the ordinal variables. Following the definition given in [1], ordinal variables are categorical variables with ordered categories.

As regards the model-based approach, the literature on clustering for continuous data is rich and wide; the most commonly clustering model-based used is the finite mixture of Gaussians [17]). Differently, that one developed for categorical data is still limited. In the Underlying Response Variable (URV), mainly developed in the SEM framework (see, e.g., [11, 14, 20] approach, the ordinal variables are seen as a discretization of continuous latent variables jointly distributed as a finite mixture (see [5, 16, 23]. However, this makes the maximum likelihood estimation rather complex because it requires the computation of many high-dimensional integrals. The problem is usually solved by approximating the likelihood function by a surrogate one. In this regard we mention some useful surrogate functions, such as the variational likelihood [7] or the composite likelihood [21, 23, 24]. The problem arises when we consider the joint distribution between continuous and ordinal variables. By assuming the local independence assumption, the issue can be easily solved by factorizing the joint density into the product of univariate marginals. However, this assumption is unrealistic and too restrictive.

Following the URV approach, [5, 23] proposed a model according to which the variables follow a Gaussian mixture model, where some variables, the ordinal ones, are only partially observed through their discretization. As a side note, at this stage, nominal variables cannot be included in the model, since there is no type of proximity among the unordered categories.

Besides these methods, there are others based on the Gower’s distance [8]. This is computed as the average of partial dissimilarities across subjects (or entities), where the type of partial dissimilarity used depends on the specific type of the variable. To cluster the data then a k-medoids algorithm can be used (PAM algorithm, [13, 25]). However, these clustering methods are not the only ones existing in literature. Indeed there are many techniques for mixed-type data and many reviews. See, for example, [2, 6, 10]. Comparing clustering techniques is extremely useful and benchmarking in cluster analysis has been increasing. A good discussion on it can be found in [18].

The paper aims at exploring and comparing the behavior of the mixture model for mixed-type data with the distance-based methods, and some more naive approaches, according to which ordinal data are treated as metric.

The plan of the paper is as follows. In Sect. 2, we describe the model-based approach to cluster mixed-type data. The Gower distance method followed by the PAM algorithm is described in Sect. 3. In Sect. 4, we compare these clustering techniques through a simulation study. In the last section, some concluding remarks are pointed out.

2 The Model-Based Approach

Let \(\mathbf{x} =[x_1,\ldots , x_{O}]^{\prime }\) and \(\mathbf{y} ^{\bar{O}}=[y_{O+1}, \ldots , y_P]^{\prime }\) be O ordinal and \(\bar{O}=P-O\) continuous variables, respectively. The associated categories for each ordinal variable are denoted by \(c_{i}=1,2,\ldots , C_{i}\) with \(i=1,2,\ldots , O\).

Following the Underlying Response Variable (URV) approach, the ordinal variables \(\mathbf{x} \) are considered as a categorization of a continuous multivariate latent variable \(\mathbf{y} ^{O}=[y_{1}, \ldots , y_O]^{\prime }\). The latent relationship between \(\mathbf{x} \) and \(\mathbf{y} ^{O}\) is explained by the threshold model,

$$x_{i}=c_{i} \Leftrightarrow \gamma _{c_i-1}^{(i)} \le y_{i} < \gamma _{c_{i}}^{(i)}, $$

where \(-{\infty } =\gamma _{0}^{(i)}< \gamma _{{1}}^{(i)}<\ldots< \gamma _{{C_i-1}}^{(i)}< \gamma _{{C_i}}^{(i)}=+{\infty } \) are the thresholds defining the \(C_i\) categories collected in a set \(\boldsymbol{\varGamma }\). To accommodate both cluster structure and dependence within the groups, we assume that \(\mathbf{y} =[\mathbf{y} ^{O \prime },\mathbf{y} ^{\bar{O} \prime }]^{\prime }\) follows a heteroscedastic Gaussian mixture, \(f\left( \mathbf{y} \right) =\sum _{g=1}^{G}\tau _{g}\phi _p\left( \mathbf{y} ; \boldsymbol{\mu }_{g},\boldsymbol{\varSigma }_{g}\right) \), where the \(\tau _g\)’s are the mixing weights and \(\phi _p\left( \mathbf{y} ; \boldsymbol{\mu }_{g},\boldsymbol{\varSigma }_{g}\right) \) is the density of a P-variate normal distribution with mean vector \(\boldsymbol{\mu }_g\) and covariance matrix \(\boldsymbol{\varSigma }_g\).

Let us set \(\boldsymbol{\psi }=\left\{ \tau _1,\ldots ,\tau _{G},\boldsymbol{\mu }_1,\ldots ,\boldsymbol{\mu }_G,\boldsymbol{\varSigma }_1,\ldots ,\boldsymbol{\varSigma }_G,\boldsymbol{\varGamma } \right\} \in \boldsymbol{\Psi }\), where \(\boldsymbol{\Psi }\) is the parameter space. For a random i.i.d. sample of size N: \((\mathbf{x} _1,\mathbf{y} ^{\bar{O}}_1),\ldots ,(\mathbf{x} _N,\mathbf{y} ^{\bar{O}}_N)\), the log-likelihood is

$$\begin{aligned} \ell (\boldsymbol{\psi })= & {} \sum _{n=1}^N\log \left[ \sum _{g=1}^G \tau _g \phi _{\bar{O}}(\mathbf{y} ^{\bar{O}}_n; \boldsymbol{\mu }^{\bar{O}}_g,\boldsymbol{\varSigma }_g^{\bar{O}\bar{O}}) \pi _{n}\left( \boldsymbol{\mu }_{n;g}^{O\mid \bar{O}},\boldsymbol{\varSigma }_{g}^{O\mid \bar{O}},\boldsymbol{\varGamma }\right) \right] , \end{aligned}$$
(1)

where with obvious notation

$$\begin{aligned} \pi _n\left( \boldsymbol{\mu }_{n;g}^{O\mid \bar{O}},\boldsymbol{\varSigma }_{g}^{O\mid \bar{O}},\boldsymbol{\varGamma }\right)= & {} \int _{\gamma _{c_{1}-1}^{(1)}}^{\gamma _{c_{1}}^{(1)}}\cdots \int _{\gamma _{c_O-1}^{(O)}}^{\gamma _{c_{O}}^{(O)}} \phi _O(\mathbf{u} ;\boldsymbol{\mu }_{n;g}^{O\mid \bar{O}},\boldsymbol{\varSigma }_{g}^{O \mid \bar{O}})d\mathbf{u} \\ \boldsymbol{\mu }_{n;g}^{O\mid \bar{O}}= & {} \boldsymbol{\mu }_g^{O}+\boldsymbol{\varSigma }_g^{O\bar{O}}(\boldsymbol{\varSigma }_g^{\bar{O}\bar{O}})^{-1}(\mathbf{y} _n^{\bar{O}}-\boldsymbol{\mu }_g^{\bar{O}}),\\ \boldsymbol{\varSigma }_g^{O\mid \bar{O}}= & {} \boldsymbol{\varSigma }_g^{OO}-\boldsymbol{\varSigma }_g^{O\bar{O}}(\boldsymbol{\varSigma }_g^{\bar{O}\bar{O}})^{-1}\boldsymbol{\varSigma }_g^{\bar{O}O}. \end{aligned}$$

\(\pi _n\left( \boldsymbol{\mu }_{n;g}^{O\mid \bar{O}},\boldsymbol{\varSigma }_{g}^{O\mid \bar{O}},\boldsymbol{\varGamma }\right) \) is the conditional joint probability of response pattern \(\mathbf{x} _n=(c_{1;n},\ldots ,c_{O;n})\) given the cluster g and the values \(\mathbf{y} _n^{\bar{O}}\) for the continuous variables. Finally, \(\tau _g\) is the probability of belonging to group g subject to \(\tau _g>0\) and \(\sum _{g=1}^{G}\tau _g=1\).

The presence of multidimensional integrals makes the maximum likelihood estimation computationally demanding and infeasible as the number of ordinal variables increases. To overcome this, a composite likelihood approach is adopted [15]. It allows us to simplify the problem by replacing the full likelihood with a surrogate function. As suggested in [21, 23, 24] within a similar context, the full log-likelihood could be replaced by \(O(O-1)/2\) marginal distributions each of them composed of a pair of ordinal variables and the \(\bar{O}\) continuous variables. In this way, the computational complexity is greatly decreased because the evaluation of the new function requires the calculation of bivariate, rather than O-variate, integrals. This leads to the following surrogate function

$$\begin{aligned} c\ell (\boldsymbol{\psi })=&\sum _{n=1}^N\sum _{i=1}^{O-1}\sum _{j=i+1}^O\sum _{c_{i}=1}^{C_i} \sum _{c_{j}=1}^{C_j}\delta _{nc_{i}c_{j}}^{(ij)}\log \Bigg [ \sum _{g=1}^G \tau _g \phi _{\bar{O}}(\mathbf{y} _n^{\bar{O}}; \boldsymbol{\mu }_g^{\bar{O}},\boldsymbol{\varSigma }_g^{\bar{O}\bar{O}})\pi _{c_{i}c_{j}}^{(ij\mid \bar{O})}\\&(\mu _{n;g}^{(ij\mid \bar{O})},\varSigma _g^{(ij\mid \bar{O})},\boldsymbol{\varGamma }^{(ij)})\Bigg ], \end{aligned}$$

where \(\delta _{n c_{i}c_{j}}^{(ij)} \) is a dummy variable assuming 1 if the nth observation presents the combination of categories \(c_{i}\) and \(c_{j}\) for variables \(x_{i}\) and \(x_{j}\), respectively, 0 otherwise; \( \pi _{c_{i}c_{j}}^{(ij\mid \bar{O})}(\mu _{n;g}^{(ij\mid \bar{O})},\varSigma _g^{(ij\mid \bar{O})},\boldsymbol{\varGamma }^{(ij)})\) is the conditional probability of the pair \((x_i=c_i, x_j=c_j)\) obtained by integrating the density of a bivariate normal distribution with parameters \((\boldsymbol{\mu }_{n;g}^{(ij\mid \bar{O})},\boldsymbol{\varSigma }_g^{(ij\mid \bar{O})})\) between the corresponding threshold parameters contained in the set \(\boldsymbol{\varGamma }^{(ij)}\). The parameter estimates are carried out through an EM-like algorithm that works in the same manner as the standard EM. Likewise, it suffers from the problem of local optima.

In the simulation study, the partition has been initialized randomly. The output of a mixture model for continuous data has been considered as a good rational starting point for the component parameters. On the other hand, the initial values for the thresholds have been computed as follows: for each variable, we have considered the empirical relative frequency of each category and then we have minimized the quadratic difference between this frequency and the corresponding quantile of the mixture.

2.1 Classification, Model Selection, and Identifiability

The classification is obtained by assigning the observations to the component with the maximum scaled composite fit, i.e., the CMAP criterion [23, 24]. As regards model selection, the best model is chosen by minimizing the composite version of penalized likelihood selection criteria like BIC or CLC (see [22] and references therein). Finally, as regards identifiability, adopting a composite likelihood approach, the sufficient condition should be reformulated by investigating the Godambe information matrix, that is, the analogous of the information matrix. However, as far as we know, such modification has not been formally investigated yet. About the necessary condition, we note that the number of essential parameters in the block of ordinal variables equals the number of parameters of a log-linear model with only two-factor interaction terms. Thus, it means that we can estimate a lower number of parameters compared to a full maximum likelihood approach. Furthermore, under the underlying response variable approach, the means and the variances of the latent variables are set to 0 and 1, respectively, because they are not identified. This identification constraint individualizes uniquely the mixture components (ignoring the label switching problem), as well described in [19]. This is sufficient to estimate both thresholds and component parameters if all the observed variables have three categories at least and when groups are known. Given the particular structure of the mean vectors and covariance matrices, it is preferable to adopt an alternative, but equivalent, parametrization. This is analogous to that one used by [12]; it consists in setting the first two thresholds to 0 and 1, respectively, without constraining means and variances. This means that there is a one-to-one correspondence between the two sets of parameters. If there is a binary variable, then the variance of the corresponding latent variable is set equal to 1 (while its mean should be still kept free).

3 The Gower Distance Method

Gower distance is computed as the average of partial dissimilarities across observations (subjects or objects), where the computation of the partial dissimilarities depends on the specific type of the variable. For the continuous variables, a range-normalized Manhattan distance is used; for the ordinal variables, they are first ranked, then Manhattan distance is used with a special adjustment for ties. Then, a weighted sum is calculated to create the final distance matrix. However, it is important to note that as the sample size increases, its storage becomes infeasible.

One of the popular partitioning algorithms for mixed-type data is k-medoids (PAM algorithm [13, 25]), which is based on the Gower’s distance. The k-means and the PAM algorithm are briefly described in Sects. 3.1 and 3.2. Both suffer from reaching local optima; indeed different initializations can lead to different partitions. Finally, the choice of the number of cluster can be made based on different criteria; the most commonly used is choosing the number of clusters corresponding to an elbow of the scree plot of the within deviance versus the number of clusters.

3.1 k-means

By letting \(\mathbf{X} =\lbrace \mathbf{x} _n: n=1,\ldots ,N \rbrace \) be the sample of P-dimensional observations, k-means is based on the minimization of the loss function

$$\begin{aligned} \ell _{km}\left( \mathbf {\psi },\mathbf{Z} ; \mathbf{X} \right) =\sum _{n=1}^{N}\sum _{g=1}^{G}z_{ng}d^2(\mathbf{x} _n,\boldsymbol{\upmu }_g), \end{aligned}$$
(2)

where \(d^2(\mathbf{x} _n,\boldsymbol{\upmu }_g)\) is the squared distance, usually the classical unweighted Euclidean between \(\mathbf{x} _n\) and \(\boldsymbol{\upmu }_g\), \(\mathbf{Z} =[z_{ng}]\) is a binary membership matrix, with rows that sum to 1, such that \(z_{ng}=1\) if observation n belongs to cluster g and 0 otherwise, and \(\mathbf {\psi }=\lbrace \mathbf {\mu }_1,\ldots ,\mathbf {\mu }_G\rbrace \) is the set of cluster centroids.

3.2 k-medoids

The PAM algorithm is an iterative algorithm composed of the following steps:

  1. 1.

    choose k random entities to become the medoids;

  2. 2.

    assign every entity to its closest medoid using the distance matrix computed;

  3. 3.

    for each cluster, the observation with the lowest average distance is re-assigned as the medoid;

  4. 4.

    if at least one medoid has changed, repeat steps 2–4, otherwise the algorithm reaches convergence.

Both k-means and k-medoids are partitioning algorithms and both attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. However, k-means has cluster centers defined by Euclidean distance (i.e., centroids), while cluster centers for PAM are restricted to be the observations themselves (i.e., medoids). Furthermore, k-medoids can be based on an arbitrary dissimilarity matrix. As a consequence, k-medoids is more robust because it minimizes a sum of dissimilarities instead of a sum of squared Euclidean distances.

Table 1 True values of the observed/latent three-component mixture model and thresholds under different scenarios
Table 2 Simulation results: ARI values for different clustering methods across the eight scenarios with \(N=100, 500\), groups with high (H) or low (L) level of separation and number of ordinal variables equal to 3 or 5 with \(G=3\). The Gower distance methods, Gower + PAM (G-PAM) and k-means were initialized using 10 (10) random starting points

4 Simulation Study

To evaluate empirically the performance of the different clustering methods, a simulation study has been conducted. We compare: a mixture of Gaussians treating all variables as continuous (Naive), a mixture model for mixed-type data (Mixed), PAM algorithm, and k-means, treating all variables as continuous. The performance has been evaluated in terms of recovering the true cluster structure using the Adjusted Rand Index (ARI) [9] between the true hard partition matrix and the estimated one. The ARI counts the pairs of entities that are assigned to the same or different clusters under both partition matrices. The index has expected value zero for independent clusterings and maximum value 1 for identical clusterings.

We simulated 250 samples from a latent mixture of Gaussians with three components. We considered 8 scenarios given by three different experimental factors: the sample size (\(N=100,500\)), the separation between clusters (well separated or not), and number of ordinal variables (3 ordinal and 5 continuous variables or the other way around).

In order to have approximately the same computational time for each method, the model-based approaches (Naive and Mixed) were initialized using only one good rational starting point described in Sect. 2, while for the remaining ones, 10 different random starting points were used.

Data were generated from a three-component mixture model partially observed with 3 or 5 ordinal variables (5 categories) and 5 or 3 continuous variables. In Table 1, we report the true values that are used to generate the data. The overlap between groups is measured by the Bhattacharyya distance [3, 4]. The Bhattacharyya distance is equal to: 19.00 considering \(g=1, 2\), 26.27 considering \(g=1,3\) and 34.27 considering \(g=2, 3\) when the groups are well separated; 5.96 considering \(g=1, 2\), 12.98 considering \(g=1, 3\) and 11.24 considering \(g=2, 3\) when the groups are not well separated. In the simulation study, the number of groups is kept fixed. Indeed, the purpose of the study is to assess the ability of the algorithm to capture the cluster structure. In Table 2 we report the simulation results.

Analyzing the results in Table 2, we note that all clustering methods improve their performances as N increases and the level of separation between groups is higher, as expected. In almost all scenarios, the mixture model for mixed-type data seems to behave better than others. Indeed, we note that in terms of mean or median the mixture model for mixed-type data is the best, followed by the k-means and PAM based on the Gower distance matrix. The poorest performances are shown by the naive approach. In terms of mean or median, the mixture model for mixed-type data is not always the best compared to the non-model-based approaches. More specifically, when \(N=100\) and the groups are not well separated, it seems that it is more affected by the issue of local maxima. Furthermore, we note that when there are more ordinal variables than continuous variables, ARI values decrease, although when N increases the worsening is not significant. This is expected, since more ordinal variables we have, more information is losing about the cluster structure underlying the data. Finally, although it is still common to treat ordinal data as metric, we have shown that it can lead to wrong results, especially when the groups are not well separated.

5 Concluding Remarks

In this paper, we compared the model-based approach and Gower distance methods to cluster mixed-type data. From the simulation study, it is possible to conclude that when the groups are less separated, the clustering performances of the Gower distance methods seem to be more affected by the choice of the random starting points. The model-based for mixed type of data as N increases becomes the best one both in terms of means and median. However, it is important to note that larger sample sizes could cause some computational problems. On one hand, for larger N it is possible to compute the Gower matrix, but its storage may become infeasible. On the other hand, this leads to a higher number of bivariate integrals involved in the composite likelihood. However, this increase remains linear, and thus still feasible.