1 Introduction

Clustering of data containing only nominal variables, also referred to as nominal data, has become an important issue in recent years. With an increasing number of machine learning tasks, such as text mining, gene expression analysis, or questionnaire survey data processing, a demand for nominal data clustering arises. Unfortunately, the nominal data clustering is not as much examined as the clustering of quantitative data, where one can choose from many well-researched clustering approaches.

Currently, there are two major approaches to nominal data clustering, namely model-based clustering and distance-based clustering. The model-based clustering assumes that the clusters are defined by parametric distributions, and the whole dataset is a mixture of such distributions (Anderlucci and Hennig 2014). An object is assigned to a cluster with the highest conditional probability. The most commonly used representative is latent class analysis (LCA) (Hagenaars and McCutcheon 2002), which usually provides good-quality clusters. LCA is a complex method determined not only for clustering but also for other statistical tasks, such as causal analysis (Hagenaars and McCutcheon 2002). In R (R Core Team 2019), LCA can be run using the function poLCA() in the poLCA package (Linzer and Lewis 2011).

The traditional distance-based approach considers clusters as compact groups of objects with small within-group distances and a large separation from other groups (Everitt et al. 2009). It can be further divided into partitioning and hierarchical methods. The partitioning methods, also known as flat clustering, assign objects into a certain number of clusters. Their general principle will be demonstrated on the commonly used \(k\)-means algorithm (MacQueen 1967). The first step is to set the initial cluster centers randomly. In the next iterations, some objects are moved to different clusters to minimize the total within-cluster variability. In nominal data, this approach is represented by the k-modes method (Chaturvedi et al. 2001), where the cluster centers are defined as the vector of modes for variable values belonging to the clusters. An advantage of the method is the computation speed, enabling a researcher to cluster objects in large datasets. However, the method has some drawbacks that need to be considered. First, since the initial clusters are chosen randomly, the iterative process tends to find a local optimal solution that can differ from the global one. Second, the method requires setting the number of clusters in advance. Fortunately, a lot of indices can help a researcher determine the optimal number of clusters based on the clustered data, such as the silhouette index or Baker-Hubert Gamma (Halkidi et al. 2015). Since these indices require calculating a dissimilarity matrix, the method’s advantage of higher computing speed is lost. Third, the simple matching approach (Sokal and Michener 1958), where the relationship between two categories can be evaluated either by one or by zero, is used for the distance calculation of an object from its cluster center. The use of this approach often leads to less compact clusters in terms of their variability expressed, e.g., by the entropy (Ng et al. 2007). In R, the \(k\)-modes clustering can be provided by the function kmodes() in the klaR package (Weihs et al. 2005).

Hierarchical cluster analysis (HCA) creates a hierarchy of clusters. There have been proposed many hierarchical clustering methods, e.g., TwoStep cluster analysis (Bacher et al. 2004) or divisive clustering (Everitt et al. 2009). However, in most cases, the agglomerative clustering is used (Everitt et al. 2009), where all objects are considered clusters at the beginning of the algorithm. Then, in each step, they are consequently merged until there is only one cluster at the end of the algorithm. The clusters are merged based on the dissimilarity (proximity) matrix, which contains dissimilarities for all pairs of objects. The way of joining clusters depends on a linkage method, which defines a distance between two clusters based on a dissimilarity matrix. Agglomerative clustering is easy to implement. It also allows a researcher to suggest the number of clusters ex-post based on a dendrogram, a typical output of hierarchical clustering. However, the algorithm has two main drawbacks. First, if two clusters are merged in one step, they cannot be split into any of the following steps, even if the linkage proves to be the bad one. Second, the method requires a dissimilarity matrix, which is very computationally demanding. Thus, agglomerative clustering is suitable for smaller datasets. Throughout the rest of this paper, the term hierarchical clustering will be restricted solely to the algorithm of agglomerative hierarchical clustering.

In the hierarchical clustering of nominal data, the selection of (dis)similarity measures and linkage methods is limited compared with the approaches designed for quantitative data. One can either use the simple matching approach applied to the nominal variables or use the similarity measures for binary data, e.g., (Todeschini et al. 2012), on binary-coded variables. Both approaches have their limitations. The simple matching approach omits important characteristics of the nominal variables, which can be used for better similarity definition, such as their frequency distribution or the number of categories (Boriah et al. 2008). On the other hand, commonly used similarity measures for binary-coded data often provide the same cluster partitions as the simple matching approach. Todeschini et al. (2012) discovered that many of these measures are identical or strongly linearly dependent. Cibulková et al. (2020) divided 66 similarity measures for binary-coded data just into four distinct groups, one of them providing the same cluster partitions as the simple matching approach.

The selection of the linkage methods in nominal data clustering is limited to distance-based linkages, such as the complete or average linkage methods. Some linkages, e.g., Ward’s or centroid, work with the concepts that are not applicable to nominal data even though they can be technically used. In R, hierarchical clustering with nominal variables can be provided by the daisy() function in the cluster package (Maechler et al. 2019). Its output can be used in the hclust() function or agnes() function from the cluster package. The daisy() function is designed for the clustering of mixed data using the Gower measure (Gower 1971). Therefore, it can be used for a purely categorical dataset, where it uses the standard simple matching approach.

The next option is to use the nomclust() function in the nomclust package (Šulc and Řezanková 2015), which was developed by the authors of this paper. The function enables a researcher to choose from 11 similarity measures for nominal data, which were summarized by Boriah et al. (2008) and Morlini and Zani (2012), and from six evaluation criteria for nominal data introduced by Řezanková et al. (2011).

In this paper, we present the second generation of the nomclust R package (Šulc et al. 2020) that we developed for the hierarchical clustering of nominal data. This version completely covers the hierarchical clustering process, from dissimilarity matrix calculation, over the clustering method selection, to final clusters evaluation. It is available at Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=nomclust.

Compared to the first generation of the package, the second generation represents a considerable step forward. Among the most noteworthy changes can be named the wholly redesigned evaluation criteria based on data variability, likelihood concept (adjusted for categorical data), and distance whose ability to determine the optimal number of clusters was investigated by Šulc et al. (2018). In addition, two new similarity measures for nominal data, introduced and evaluated by Šulc and Řezanková (2019), were also added. Next, we wanted to deal with the low computational speed of hierarchical clustering by rewriting the critical parts of code into C++ to substantially reduce the clustering speed. Finally, the support for S3 generic functions and the ability to draw dendrograms and values of evaluation criteria was added to the package.

The rest of the paper is organized as follows. First, Sect. 2 describes the theoretical background of the package, including the used similarity measures, linkage methods, evaluation criteria, and optimization. Then, Sect. 3 presents demonstrations of the main package functionalities. Finally, the final remarks are placed in Conclusion.

2 Methods used in nomclust 2.0

The process of hierarchical clustering consists of three main areas, namely calculation of a dissimilarity matrix using a selected (dis)similarity measure, creating a hierarchy of clusters by a chosen linkage method, and optionally, evaluation of created clusters using one or more evaluation criteria. In subsections of Sect. 2, theoretical background for all these areas in the nomclust package is described in detail.

2.1 Dissimilarity matrix

A dissimilarity matrix is a square symmetric matrix with zeros on a diagonal that contains dissimilarities for all pairs of objects in a dataset. In the nominal data, one can only recognize if two categories match or mismatch, which is a principle of the simple matching approach. However, there were introduced similarity measures that treat dissimilarity between objects based on specific characteristics of a dataset. For instance, they use numbers of categories or absolute or relative frequencies of categories to improve dissimilarity definition. The nomclust package contains 12 of such measures which were introduced by six authors. The measures Goodall 1 (G1), Goodall 2 (G2), Goodall 3 (G3), Goodall 4 (G4), which are based on the original Goodall measure Goodall (1966), and Lin 1 (LIN1) were introduced by Boriah et al. (2008). The Eskin (ES) measure was proposed by Eskin et al. (2002), the Lin (LIN) measure by Lin (1998), the simple matching coefficient (SM) by Sokal and Michener (1958), the measures occurrence frequency (OF) and inverse occurrence frequency (IOF) by Spärck Jones (1972), and the measures variable entropy (VE) and variable mutability (VM) by Šulc and Řezanková (2019).

When calculating a (dis)similarity measure, all the calculations are applied directly to the data matrix \({\varvec{\mathrm {X}}} = [x_{ic}]\), where \(i=1,2, \dots , n\) (\(n\) is the total number of objects); \(c\ = 1, 2, \dots , m\) (\(m\) is the total number of variables). The number of categories of the \(c\)th variable is denoted as \(K_c\), an absolute frequency of a category equal to the value \(x_{ic}\) as \(f(x_{ic})\), and a relative frequency as \(p(x_{ic})\). Some (dis)similarity measures require relative frequency of the \(u\)th category (\(u = 1, 2, ... K_c \)) in the \(c\)th variable. Such a relative frequency is marked as \(p_{cu}\).

Similarity measures in the nomclust package are calculated in two steps. In the first one, similarities between values of the \(c\)th variable for the \(i\)th and \(j\)th objects \({ S }_{ c }\left( { x }_{ ic },{ x }_{ jc } \right) \) are computed separately. The \(S_c\) computation differs based on the match \(x_{ic} = x_{jc}\) or mismatch \(x_{ic} \ne x_{jc}\) of categories. In the second step, the similarity \(S\left( {\varvec{\mathrm { x }}}_{ i },{\varvec{\mathrm { x }}}_{ j } \right) \) between the objects \({\varvec{\mathrm { x }}}_{ i }\) and \({\varvec{\mathrm { x }}}_{ j }\) is determined either according to the formula

$$\begin{aligned} S\left( {\varvec{\mathrm { x }}}_{ i },{\varvec{\mathrm { x }}}_{ j } \right) =\frac{ \sum _{ c=1 }^{ m }{ { S }_{ c }\left( { x }_{ ic },{ x }_{ jc } \right) } }{ m } \end{aligned}$$
(1)

or as

$$\begin{aligned} S\left( {\varvec{\mathrm { x }}}_i, {\varvec{\mathrm { x }}}_j \right) = \frac{{ \sum _{ c=1 }^{ m }{ { S }_{ c }\left( { x }_{ ic },{ x }_{ jc } \right) } }}{\sum _{ c=1 }^{ m }{ \left( \mathrm {ln} \;p\left( { x }_{ ic } \right) +\mathrm {ln} \;p\left( { x }_{ jc } \right) \right) } } \end{aligned}$$
(2)

for certain similarity measures. Table 1 summarizes formulas of 12 of the similarity measures in the nomclust package. Their detailed properties have been described by Boriah et al. (2008) and Šulc and Řezanková (2019).

In order to calculate a dissimilarity matrix, it is necessary to compute dissimilarities \(D\left( {\varvec{\mathrm { x }}}_i, {\varvec{\mathrm { x }}}_j \right) \) between all pairs of objects. All the dissimilarities in the nomclust package are obtained from similarities according to the formula

$$\begin{aligned} D\left( {\varvec{\mathrm { x }}}_i, {\varvec{\mathrm { x }}}_j \right) = 1 - S\left( {\varvec{\mathrm { x }}}_{ i },{\varvec{\mathrm { x }}}_{ j } \right) \end{aligned}$$
(3)

for the similarity measures, that take on values from zero to one, or as

$$\begin{aligned} D\left( {\varvec{\mathrm { x }}}_i, {\varvec{\mathrm { x }}}_j \right) = \frac{1}{S\left( {\varvec{\mathrm { x }}}_{ i },{\varvec{\mathrm { x }}}_{ j } \right) } - 1 \end{aligned}$$
(4)

for the similarity measures, whose possible maximum is lower than one.

Dissimilarities derived from specific similarity measures, namely G1, G2, G3, G4, LIN1, VE, and VM, do not reach the zero dissimilarity for two identical objects as might be expected. The reason is that these similarity measures use weights in the case of a match of categories, see Table 1. Then, two objects with the same profile have a different similarity if they have, for instance, matches of frequent categories compared to another pair of identical objects containing matches in rare categories. After the similarities are transformed into dissimilarities using Eq. (3) or Eq. (4), such a relationship must still be maintained, which can be done by non-zero dissimilarities. The dissimilarities different from zero are not an obstacle for performing HCA since the dissimilarity matrix requirements are met, i.e., \(D\left( {\varvec{\mathrm { x }}}_i, {\varvec{\mathrm { x }}}_i \right) = 0\), \(D\left( {\varvec{\mathrm { x }}}_i, {\varvec{\mathrm { x }}}_j \right) \ge 0\) and \(D\left( {\varvec{\mathrm { x }}}_i, {\varvec{\mathrm { x }}}_j \right) = D\left( {\varvec{\mathrm { x }}}_j, {\varvec{\mathrm { x }}}_i \right) \) (Everitt et al. 2009). However, if one wants to guarantee zero dissimilarities between identical objects, the rest of the similarity measures in the nomclust package can be used for this purpose.

Table 1 Formulas of the similarity measures in the nomclust package

In the nomclust package, dissimilarity matrices for all the available similarity measures can be calculated separately from other analysis steps, e.g., as an input for other R packages. Table 2 presents the function calls for these measures and their most important properties.

Table 2 Function calls and properties of the similarity measures

2.2 Linkage methods

After a dissimilarity matrix is calculated, a hierarchy of clusters needs to be created. The nomclust package uses agglomerative clustering from the cluster package. This type of clustering considers each object as a cluster at the start of the clustering process. Then, the two most similar clusters are joined into a new one, and the dissimilarity matrix is recalculated. This algorithm repeats itself until there is one cluster left. Then, the created hierarchy of clusters can be cut at any point to get a desired number of clusters. The linkage method plays an important role in the cluster hierarchy creation because it defines the value of dissimilarity, which will be used after merging two clusters.

Compared to the quantitative data, where various linkage methods are available, the number of linkage methods for nominal data is limited. The reason is that some linkage methods for the quantitative data work with concepts that do not apply to categorical data, such as cluster centroids by the centroid method or the within-cluster variance by Ward’s method. Therefore, in the nomclust package, only linkage methods using between-cluster distances based on dissimilarities are available, namely the single linkage method, the complete linkage method, and the average linkage method.

The single linkage method considers a distance between two clusters as the dissimilarity between the two closest objects of two merged clusters. This linkage can cause the chaining phenomenon, in which clusters are merged based on their closest elements, even though some of the other cluster elements are very distant. Such groups of objects are challenging to interpret. The complete linkage method considers a distance between two clusters as the dissimilarity between two furthest objects from these clusters. This linkage is sensitive to outlying objects in clusters that may prevent two clusters from merging. However, it usually provides compact groups with approximately equal diameters. The average linkage method takes the average pairwise dissimilarity between objects in two merged clusters. Thus, the distance between two different clusters of objects should be better represented than by the previous two linkage methods, which both depend only on one dissimilarity value.

2.3 Evaluation criteria

After a hierarchy is created, the next step of the analysis can be to choose the number of clusters. One can use evaluation criteria that use internal information about the dataset to assume the optimal number of clusters. Based on results presented by Šulc et al. (2018), who compare selected evaluation criteria, eight internal evaluation criteria were incorporated into the nomclust package. The criteria within-cluster entropy (WCE), within-cluster mutability (WCM), pseudo F index based on the entropy (PSFE), pseudo F index based on the mutability (PSFM) were originally introduced by Řezanková et al. (2011). The BK index (BK), intended to determine the best number of clusters \(k\), was proposed by Chen and Liu (2009), the modified Akaike and Bayesian information criteria (BIC, AIC) in (SPSS 2001), and the silhouette index (SI) by Rousseeuw (1987).

The used criteria can be divided into three groups. The variability-based ones (WCE, WCM, PSFE, PSFM, BK) that express the cluster quality by their low within-cluster variability, the likelihood-based ones (BIC, AIC) that maximize the categorical approximation of the likelihood function, and the distance-based one (SI) that assesses the cluster quality using the within- and between-cluster distances. Regarding the results presented by Šulc et al. (2018), the criteria BK and BIC provide the best abilities to identify the optimal number of clusters.

The overview of the used evaluation criteria with their important properties occurs in Table 3. The column Optimum indicates if the maximal or minimal value of the criterion marks the optimal number of clusters or if one should rely on the elbow of the curve with the evaluation criterion values. The elbow method (Thorndike 1953) allows a researcher to subjectively choose the optimal number of clusters. It requires a researcher to find a point representing a certain number of clusters where the curve of criterion values visibly bends from high slope to low slope. At this point, the increase of the number of clusters by one does not bring a sufficient decrease of the total within-cluster variability. However, in some situations, the elbow does not have to be visible. Then, it is recommended to use a different evaluation criterion.

The recommended number of clusters may differ for different evaluation criteria in the package. There are no strict guidelines on how to proceed in such a case. One can use the number of clusters that is recommended by the majority of the evaluation criteria. It is also good to inspect one lower and one higher number of clusters than the recommended one. A researcher should pay attention to the situation when almost every criterion suggests a different number of clusters. It may indicate that the clusters in data are badly separated or that there are no clusters at all.

Table 3 Overview of the evaluation criteria

The majority of the evaluation criteria in the package use either entropy or mutability (Gini coefficient). These two variability measures represent two alternative approaches to measuring the amount of distinction in categorical variables. Whereas mutability is a simpler measure whose forms can be interpreted as the probabilities of distinctions, entropy is a more complex approach that can be interpreted as the average information needed to distinguish all the information in the data (Ellerman 2013). However, both the variability measures provide comparable results, so it is up to a researcher’s preference which of these approaches will be selected.

The entropy of the \(c\)th variable in the \(g\)th cluster, where \(g\ = 1, 2, ..., k\) (\(k\) is the total number of clusters), is calculated as

$$\begin{aligned} H_{gc} = -\sum _{u = 1}^{K_c} \left( \frac{n_{gcu}}{n_g}\ln \frac{n_{gcu}}{n_g}\right) , \end{aligned}$$
(5)

where \(n_g\) is the number of objects in the \(g\)th cluster (\(g = 1, 2, \dots , k\)), \(n_{gcu}\) is the number of objects in the \(g\)th cluster by the \(c\)th variable with the \(u\)th category (\(u = 1, 2, \dots , K_c\)). Equation (5) is defined only for relative frequencies different from zero. If \(\frac{n_{gcu}}{n_g} = 0\), the corresponding addend is set to zero.

The mutability of the \(c\)th variable in the \(g\)th cluster is calculated as

$$\begin{aligned} G_{gc} = 1- \sum _{u=1}^{K_c}\left( \frac{n_{gcu}}{n_g}\right) ^2. \end{aligned}$$
(6)

The within-cluster entropy (WCE) and within-cluster mutability (WCM) coefficients, which were introduced by Řezanková et al. (2011), express the within-cluster variability by the standardized entropy and mutability according to the formulas

$$\begin{aligned} WCE\left( k \right) =\sum _{ g=1 }^{ k }{ \frac{ n_{ g } }{ n\cdot m } \sum _{ c=1 }^{ m }{ \frac{ 1 }{ \ln { K }_{ c } } } H_{gc} } \end{aligned}$$
(7)

and

$$\begin{aligned} WCM\left( k \right) =\sum _{ g=1 }^{ k }{ \frac{ n_{ g } }{ n\cdot m } \sum _{ c=1 }^{ m }{ \frac{ { K }_{ c } }{ { K }_{ c }-1 } G_{gc} } }. \end{aligned}$$
(8)

They range from zero to one, and with an increasing number of clusters, their values are always decreasing. The values close to one indicate the high within-cluster variability and, thus, poorly separated clusters. The optimal number of clusters can only be assumed using the elbow method, which chooses such a cluster solution in which a significant drop of the within-cluster variability occurs. However, the method provides ambiguous results, and thus, these evaluation criteria can be rather used for the cluster quality evaluation.

The pseudo F coefficient based on the entropy (PSFE) and the pseudo F coefficient based on the mutability (PSFM) are the ratios of the weighted between-cluster variability and the weighted within-cluster variability for the \(k\)-cluster solution. The between-cluster variability is calculated as the difference between the total variability of the whole original dataset and the within-cluster variability. The total variability of the whole original dataset (one-cluster solution) is denoted as \(WCEN(1)\) or \(WCMN(1)\). The pseudo F coefficients are calculated according to the formulas

$$\begin{aligned} { PSFE\left( k \right) }\, =\, \frac{ \left( n-k \right) \left[ WCEN\left( 1 \right) -WCEN\left( k \right) \right] }{ \left( k-1 \right) WCEN\left( k \right) }, \end{aligned}$$
(9)

where \(WCEN(k)\) represents the non-normalized within-cluster variability in \(k\)-cluster solution based on entropy calculated as

$$\begin{aligned} WCEN\left( k \right) =\sum _{ g=1 }^{ k }{ \frac{ n_{ g } }{ n\cdot m } \sum _{ c=1 }^{ m }{ H_{gc} }} \end{aligned}$$

and

$$\begin{aligned} { PSFM\left( k \right) }\, =\, \frac{ \left( n-k \right) \left[ WCMN\left( 1 \right) -WCMN\left( k \right) \right] }{ \left( k-1 \right) WCMN\left( k \right) }, \end{aligned}$$
(10)

where \(WCMN(k)\) represents the non-normalized within-cluster variability in \(k\)-cluster solution based on mutability calculated as

$$\begin{aligned} WCMN\left( k \right) =\sum _{ g=1 }^{ k }{ \frac{ n_{ g } }{ n\cdot m } \sum _{ c=1 }^{ m }{ G_{gc} }}. \end{aligned}$$

The maximal value across all the examined cluster solutions indicates the optimal solution by both the coefficients.

The BK index (Chen and Liu 2009) is defined as the second-order difference of the incremental entropy of the dataset with \(k\) clusters

$$\begin{aligned} BK(k)=\Delta ^2 I(k) = (I(k-1) - I(k)) - (I(k) - I(k+1)), \end{aligned}$$
(11)

where \(I(k)\) is the incremental expected entropy in the \(k\)-cluster solution with the formula

$$\begin{aligned} I(k) = H_E(k) - H_E(k+1), \end{aligned}$$

where \(H_E\) is the expected entropy in a dataset expressed as

$$\begin{aligned} H_E(k) = \sum _{g=1}^k\frac{n_g}{n}\sum _{c=1}^m\frac{H_{gc}}{\ln K_c}. \end{aligned}$$

The maximal value of the index across all the examined cluster solutions indicates the optimal number of clusters.

The likelihood-based evaluation criteria maximize the likelihood of the data while penalizing complex models (Biem 2003). The nomclust package uses the categorical variants of the Bayesian (BIC) and Akaike (AIC) information criteria, which were presented in SPSS (2001) and further described by Bacher et al. (2004).

The modified BIC criterion for categorical data is computed according to the formula

$$\begin{aligned} BIC(k) = 2\sum _{g=1}^k n_g \sum _{c=1}^m H_{gc} + k \sum _{c=1}^m (K_c - 1) \ln n \end{aligned}$$
(12)

and the modified AIC criterion as

$$\begin{aligned} AIC(k) = 2\sum _{g=1}^k n_g \sum _{c=1}^m H_{gc} + 2k \sum _{c=1}^m (K_c - 1). \end{aligned}$$
(13)

Similarly, as by the information criteria for quantitative data, the minimal value of the indices across the examined number of cluster solutions indicates the optimal number of clusters.

The silhouette index (SI) (Rousseeuw 1987), also known as the average silhouette width, can be written as

$$\begin{aligned} SI(k) = \frac{1}{n}\sum _{i=1}^n\frac{b(i)-a(i)}{\max {\{a(i),b(i)\}}} \end{aligned}$$
(14)

where \(a(i)\) is the average dissimilarity of the \(i\)th object to the other objects in the same cluster, and \(b(i)\) is the minimum average dissimilarity of the \(i\)th object to other objects in any cluster not containing the \(i\)th object.

The silhouette index takes on values from \(-1\) to \(1\). The values close to one indicate well-separated clusters with low within-cluster distances and high between-cluster distances. The values close to zero or the negative ones suggest badly separated clusters. The maximal value of the index across all the examined cluster solutions indicates the optimal number of clusters.

2.4 Optimization

Dissimilarity matrix calculation is the most time-demanding part of the hierarchical clustering process because the number of values in this matrix, which need to be calculated, increases with the square of the number of observations. Even by relatively small datasets (several thousand objects), the calculation can take minutes. Therefore, hierarchical cluster analysis is usually recommended for datasets with up to 10,000 rows. Apart from the number of objects, the dissimilarity matrix calculation time depends on a used similarity measure, dataset properties (number of variables or categories), and computer speed.

The high complexity of the dissimilarity matrices calculation is caused by many loops in their code that are not processed efficiently in the R language. Therefore, critical parts of the code of all the used similarity measures were rewritten to the C++ language, which handles the loops much more effectively. The implementation of the C++ code was performed using the Rcpp package (Eddelbuettel and Francois 2011).

To assess the effect of the C++ language implementation, an experiment on 60 generated datasetsFootnote 1 was conducted. For the experiment, the same datasets were used as in (Šulc and Řezanková 2019). The average calculation times of clustering with a certain similarity measure for the old and new version of the package are placed in the columns nomclust 1.0 and nomclust 2.0 in Table 4. Considering much lower computation times by the new version, it is clear that the C++ implementation was successful. Whereas the average calculation time was 58.41 s by the first release, the average calculation time was just 0.38 s by the second version of the package. The Speed-up column shows how many times the second release is faster when using a certain similarity measure compared to the previous version. On average, the new version of the package performed 154 times faster in the experiment.

Table 4 Performance comparison of the first and second versions of the package on the available similarity measures

3 Illustrations of use

The nomclust package can be used either to perform the complete clustering process from dissimilarity matrix calculation to the evaluation of the obtained clusters using the nomclust() function or to perform only a part of the clustering process using one of the subsidiary functions. Typical use will be demonstrated in the CA.methods dataset, which is included in the package.

3.1 The whole clustering process

The most crucial function in the nomclust package is the nomclust() function, which completely covers the hierarchical clustering of nominal data, with the function call below.

figure a

The only mandatory input argument is data, which represents categorical data entering the cluster analysis. The measure argument stands for a similarity measure used for dissimilarity matrix calculation. Any of the measures presented in Table 2 can be used here. The method argument enables a researcher to choose one from the linkage methods presented in Sect. 2.2. As a default choice, the average linkage with the lin similarity measure was set because this combination usually provides the most coherent clusters (Šulc and Řezanková 2019). The clu.high argument defines the upper limit for the number of cluster solutions provided in the final output. A logical argument eval indicates if the evaluation criteria presented in Sect. 2.3 will be calculated in the nomclust object. The prox argument indicates if a dissimilarity matrix will be saved in the nomclust object. It can be either set as a logical value or an integer specifying the maximal number of objects in a dataset for which a dissimilarity matrix will be kept in the output. Since a dissimilarity matrix is needed for dendrogram construction, 100 was chosen as the default value of this argument. Thus, one can display dendrograms in small datasets, where they are most helpful. On the other hand, a large dissimilarity matrix based on a sizable dataset will not be saved in the nomclust object.

The use of the nomclust() function is demonstrated on the CA.methods dataset, which contains five different characteristics of 24 clustering algorithms; six of them are displayed below.

figure b

Hierarchical clustering with the G1 similarity measure and the average linkage method is then performed using the following syntax.

figure c

The resulting output consists of six components in the form of a list. The mem component contains cluster membership partitions for two to six clusters. For instance, the four-cluster solution can be obtained using the syntax below.

figure d

The eval component contains eight evaluation criteria in as vectors in a list. To see them all in once, the form of a data.frame is more appropriate.

figure e

The opt component is present in the output together with the eval component. It displays the optimal number of clusters for the evaluation criteria from the eval component, except for WCM and WCE, where the optimal number of clusters can be determined only by the elbow method.

figure f

Since the number of objects in a dataset is lower than 100, the prox component containing the dissimilarity matrix is present in the output, where the first six rows and columns of a matrix form are displayed.

figure g

The dend component contains all the necessary information for dendrogram creation, and the call component the function call.

3.2 Subsidiary functions

In case one needs to produce only a part of the clustering process, e.g., to calculate a dissimilarity matrix for a different clustering algorithm, or to evaluate a cluster solution that was not produced by the nomclust package, one of the available subsidiary functions can be used.

Dissimilarity matrices based on the available similarity measures in the package can be calculated using the function calls from Table 2. The resulting matrices are objects of the dist class. They can be used as an input for hierarchical clustering functions in other R packages as well, e.g., the agnes() function in the cluster package or the hclust() function in the stat package.

A dissimilarity matrix based on the simple matching coefficient, which can serve as a reference measure because it is well-known and commonly used, is obtained using the sm() function.

figure h

From the output, it is clear that the SM measure, which does not use any additional information about the categorical variables, provided the dissimilarity matrix with many identical values. This makes the objects challenging to cluster unambiguously with different clustering algorithms, e.g., hclust() vs. agnes().

Sometimes, the use of an own-calculated dissimilarity matrix is necessary. In such a situation, the nomprox() function can be used. It enables a user to run hierarchical clustering based on a provided dissimilarity matrix and, if the original dataset is available, to calculate a unique set of evaluation criteria available in the nomclust package. It has the following syntax.

figure i

The diss argument stands for a dissimilarity matrix either of a class matrix or dist. The argument data represents the data from which the dissimilarity matrix was calculated. It is not mandatory for the cluster partitions calculation, but it is necessary for evaluation criteria calculation. The method, clu.high, eval and prox arguments work in the same way as by the nomclust() function.

Hierarchical clustering based on the already calculated dissimilarity matrix of the SM measure and the original dataset can be performed using the following syntax.

figure j

The resulting object contains mem, eval, opt, dend, prox and call components. For instance, the cluster membership variables in a form of a data.frame can be gotten using the following syntax.

figure k

In certain situations, the need may arise to apply a set of evaluation criteria from the nomclust package to cluster membership partitions obtained by different clustering algorithms, e.g., LCA or \(k\)-modes. For such cases, the evalclust() function can be used. It has the syntax expressed below.

figure l

The data argument represents the dataset used for clustering, the clusters argument stands for a data.frame or a list with cluster membership partitions, and the optional diss argument denotes the dissimilarity matrix for the objects in the dataset.

The output of the evalclust() function is the eval component with a set of evaluation criteria, the opt component with the optimal number of clusters based on these criteria, and the call component with the function call. If the dissimilarity matrix is not defined in the diss argument, the function does not provide outcomes for the silhouette index, as is demonstrated in the following example.

figure m

When comparing the outputs for the G1 and SM similarity measures on this particular dataset, it seems that WCM and WCE criteria variability decreases faster by G1, which suggests that the clusters of the G1 measure are more homogeneous. On the other hand, the evaluation criteria for the optimal number of cluster determination seem to be more consistent by the SM measure, where they prefer the four-cluster solution.

3.3 Graphical functions

Graphical outputs can help a researcher choose the optimal number of clusters or evaluate the quality (and interpretability) of the created clusters. The nomclust package offers two graphical functions, one for evaluation criteria visualization and the second one for dendrogram creation.

To visualize the evaluation criteria from the eval component, the eval.plot() function with the following syntax can be used.

figure n

The x argument represents an output of the functions nomclust(), nomprox() or evalclust() containing the eval and opt components. The argument criteria specifies the evaluation criteria which are to be visualized. It can be selected one particular criterion, a vector of criteria or all the available criteria. The argument style defines a graphical style of the produced plots. There are two predefined styles in the nomclust package, namely greys and dark, but a custom color scheme can be set by a user as a vector with colors of a length four. The opt.col argument specifies a color that is used for the optimal number of clusters identification, and the main argument determines the title of a chart. The symbol ... indicates that it is possible to use certain graphical arguments from a generic plot() function.

The eval.plot() can be used to obtain graphical representations of the AIC evaluation criteria from the hca.G1 and hca.SM objects using the syntax below.

figure o
Fig. 1
figure 1

The optimal number of clusters for HCA with G1 and SM measures based on AIC

Figure 1 shows that the lowest value is four for both the similarity measures, so the solution with four clusters should be preferred. One can also examine the cluster solutions with slightly higher values of the evaluation criteria, e.g., the three- or five-cluster solutions. Because a graphical representation can help a researcher notice small differences between the values, it makes the determination of the optimal number of clusters easier.

A dendrogram visualizes a hierarchy of clusters, and it can help a researcher decide on the number of clusters, especially by datasets of smaller sizes. To produce a dendrogram in the nomclust package, the dend.plot() function is used.

figure p

The x argument represents an output of the functions nomclust() or nomprox() containing the dend component. The clusters argument determines the number of clusters that will be displayed in a dendrogram. It can be either set as a number or as a name of the evaluation criterion if the eval and opt components are present in the output. The style and main arguments work in the same way as by the eval.plot() function. A logical argument colorful specifies if the output will be colorful or black and white. An optional argument clu.col allows a researcher to apply user-defined colors distinguish clusters in a dendrogram. The ac argument indicates if the value of an agglomerative coefficient from the cluster package is displayed below the dendrogram.

The dend.plot() function is demonstrated on a comparison of the four-cluster solutions provided by the G1 and SM measures.

figure q
Fig. 2
figure 2

Dendrograms for G1 and SM measures with four marked clusters

Figure 2 shows that clustering with the G1 similarity measure provides more suitable clusters than the clusters obtained using the SM measure in the CA.methods dataset. The clusters obtained by G1 seem to be much more meaningful. The first cluster contains the hierarchical clustering method. The second one comprises the majority of partitioning methods. The third one contains the clustering algorithms primarily determined for the mixed-type data. The fourth cluster consists of clustering methods based on grid and density. Most of the clusters provided by the SM measure do not have a logical structure, despite the higher value of the agglomerative coefficient.

Considering Fig. 2, it is apparent that different similarity measures may lead to substantially different results. Therefore, it is always good to examine more combinations of similarity measures and linkage methods. A common practice is trying to find well-separated clusters that are meaningful to a researcher. In a small dataset, such as the demonstrated one, a dendrogram is sufficient. In a large dataset, the created clusters can be characterized by a series of contingency tables containing the variable categories broken down by the cluster membership variable.

3.4 Generic functions

The second generation of the nomclust package added support for common generic functions, such as summary() or print(). They can be applied to the outputs of the functions nomclust(), nomprox() or evalclust() which are of the class nomclust. The extent of the generic function outcomes differs by the number of components in the output. Thus, the most complex outputs are provided by the nomclust() function, whereas the evalclust() function usually provides most limited outcomes.

The summary() function can be used if one wants quick information about the clustering results. The outcome contains frequency distribution tables for the created numbers of clusters, so a researcher can see if the cluster sizes are balanced or if there occur one-object clusters. The function also provides the optimal numbers of clusters according to all calculated evaluation criteria and the value of the agglomerative coefficient. The outcome of the summary() function is shown in the following example.

figure r

The print() function offers all the necessary information for cluster quality evaluation, namely values of the calculated evaluation criteria, the corresponding optimal numbers of clusters, and the agglomerative coefficient.

Sometimes, a researcher may need to apply some additional functionalities to the obtained clustering outcomes that often require the object of the hclust class as an input. Therefore, the class of the clustering object can be easily changed by the as.hclust() function. Moreover, we introduce the as.agnes() function which transforms a nomclust object into an agnes, twins object, which is occasionally also required. The introduced function is not generic, but it works as expected. The use of both functions is shown below.

figure s

The function plot() applied to the output of the nomclust() or nomprox() functions creates a basic dendrogram which is suitable for a quick look at the hierarchy of clusters. Compared to the more complex function dend.plot(), it allows a researcher just to change the title of the chart.

figure t

Figure 3 depicts the outcomes of the plot() function applied to the objects of the nomclust and hclust classes.

Fig. 3
figure 3

Outputs of the plot() function applied on the nomclust and hclust objects

4 Conclusion

In the paper, the second major version of the nomclust package for hierarchical clustering of categorical (nominal) data was introduced. This completely reworked release focuses primarily on calculation time optimization and adding support for graphical outcomes and generic functions. Moreover, there are many more small changes that improve the overall functionality and intuitiveness of the package. However, it is at the cost of compatibility with the first release, which was not fully maintained. Thus, some code written for the first version of the package will no longer work with the second version, and it will need to be updated.

The variety of available similarity measures and evaluation criteria makes the package unique. The second version contains two more similarity measures and four additional evaluation criteria compared to the first release. Thus, a researcher can try out even more similarity measures and linkage methods to find the preferred combination. This approach usually represents the best practice in clustering since there is no ideal clustering procedure for all the datasets.

The package also enables a researcher to work with other R packages. For example, it can be easily used with an already calculated dissimilarity matrix or cluster membership partitions from a different package. Moreover, the output of the package’s main functions can be effortlessly transformed to the commonly used hclust object, which can be used as an input for the follow-up methods.

In the future releases of the package, we plan to focus on implementing the object and variable weighting schemes. This functionality could further improve the clustering performance of similarity measures in the package.