1 Introduction

In information society, unlabeled categorical data is more and more common. Many fields have generated a large number of unlabeled categorical data sets such as social media, bioinformatics data, news report and web search engine [7, 11, 18, 26, 30, 32]. It will produce much overhead if data is labeled by experts and it is also impossible to obtain the labels of data at any time, therefore how to get knowledge and patterns from unlabeled categorical data is an important problem [28, 35]. Clustering is an unsupervised learning method which can mine knowledge and patterns from unlabeled data and it has been applied to text emotional analysis, bioinformatics and recommender system [12, 17, 27]. Because there is no geometric structure for categorical data, it brings a challenge for data partitioning.

k-modes algorithm is an important clustering algorithm for categorical data and it uses matching distance to represent the dissimilarity of two samples and data is partitioned into clusters by matching distance [5]. After k-modes is proposed, many researchers develop a series of algorithms and the algorithms can mainly divided into three categories: uncertainty-based clustering algorithm, tree clustering algorithm and subspace clustering algorithm.

Michael proposes a new dissimilarity measure for k-modes algorithm [24]; considering that the original matching method may cause the similarity of intra-cluster samples to be too low, it utilizes equivalent class of current attribute value to determine the dissimilarity; for an attribute, if the attribute values of two samples are equal, the dissimilarity is 1 on current attribute, otherwise the dissimilarity is computed by the equivalent class on the attributes; the new k-modes algorithm converts the dissimilarity degree from 0-1 match into real value which can keep more dissimilarity information. Chen et al. propose an entropy method to determine the best k of clustering algorithm for categorical data [6]; in the paper, an incremental entropy is defined which is according to the merge of different clusters and the relation between k and the incremental entropy is derived; when the function value is reduced severely, the corresponding k value is the final result. Andritsos et al. propose a tree clustering algorithm based on entropy called as LIMBO [2]; the algorithm defines a new distance and creates a DCF tree; LIMBO scans DCF tree from top to down and divides data points into leaf nodes; the data points in the same leaf nodes are seen as a cluster; if the information loss of merging two similar clusters is less than a threshold, then the two clusters are merged; the merging steps are repeated until the number of clusters is k. Guha et al. propose a robust hierarchical clustering algorithm for categorical data called as ROCK [15]; ROCK introduces neighbourhood to measure the similarity of the two clusters and defines a link function which is the intersection’s cardinality of two clusters and determines the evaluation criterion function; the similar clusters is merged until there is k clusters; because heap is used in the clustering process, ROCK has a fast speed. Gao et al. propose a rough ensemble subspace-based clustering for categorical data [13]; the algorithm employs discernibility matrix to remove redundant attributes and forms a series of subspaces; then it uses a metric to rank subspaces and remains some subspaces with large metric values; the clustering algorithm is executed in the subspaces and the final clustering result is the fusion of multiple clustering results. Parmar et al. propose a clustering algorithm based on Min-Min-Roughness called as MMR [25]; MMR is a tree clustering algorithm and uses the models of most samples as the clustering modes; for each attribute, it computes the minimum roughness of each attribute value; then it can obtain Min-Min-Roughness of all attributes; the attribute corresponding to Min-Min-Roughness is selected as splitting attribute of the tree and the cluster in the leaf node can be split into two clusters; repeat the steps until the clustering algorithm is terminated. Li et al. propose an incremental entropy-based clustering for categorical data stream with concept drift [21]; different from traditional clustering algorithms, the incremental entropy-based clustering algorithm can deal with massive data stream and it has a self-adaption mechanism for concept drift; the algorithm defines the entropy distance between a sample and a cluster; if the entropy distance between a sample and a cluster is greater than a threshold, the sample is seen as a outlier and a new cluster is generated; in order to detect concept drift, cluster vector is defined; for a time, the cluster vector is made up of the number of samples in the clusters; if the similarity of the cluster vectors at adjacent time moments is less than a threshold, it is said that concept drift has happened.

The above works have promoted the development of the clustering for categorical data; however, for the most algorithms, how to measure the dissimilarity in categorical data set and generate compact clusters is still a problem. In this paper, a new fuzzy rough clustering algorithm (FRC) for categorical data is proposed. FRC introduces rough set to compute the information granularity of each attribute and the weights of attributes are determined by the information granularity. In addition, we define a significance of an attribute for a sample and categorical data set can be transformed into a numeric data set. Different from the existing distance for categorical data, the new distance is a weighting distance and the attribute with a good discriminative ability will be given a large weight. Therefore FRC can obtain a clustering result with minimum intra-cluster distance and maximum inter-cluster distance. The contributions of the paper are as follows:

  • We propose a fuzzy rough clustering algorithm for categorical data set; FRC can obtain a clustering result with minimum intra-cluster distance and maximum inter-cluster distance.

  • We define a significance measure of an attribute for a sample; categorical data set can be transformed into a numeric data set by the significance measure and we introduce a nonlinear dimension reduction algorithm to decrease the dimensions of data set.

  • In FRC algorithm, we employs weighted distance to calculate the dissimilarity. For categorical data set, the weight of each attribute is determined by information granularity. After categorical data set is transformed into numeric data set, the new weight of each attribute is determined by standard deviation.

The rest of the paper is structured as follows: Sect. 2 reviews the concepts of rough set and the basic principles of k-modes algorithm; the detail theories and steps of FRC algorithm are explained in Sect. 3; FRC and the comparison algorithms are executed on the real data sets and the experimental results are discussed in Sect. 4; finally, Sect. 5 concludes the paper and gives some research directions in the future.

2 Preliminaries

In this section, we will give brief introductions about rough set and k-modes algorithm.

2.1 Rough set

Rough set is an effective method for data analysis. It can deal with categorical data without any prior knowledge. Since rough set was proposed, it has been applied to association rule extraction, attribute reduction, data classification and clustering analysis [8,9,10, 14, 19, 23, 29, 31, 33]. Rough set assumes that the objects with the same attribute values should be divided into the same class. The following background knowledge can be seen from the references [1, 20, 36, 37].

An information system can be described as a quadruples \(<U,A,f,V>\) where \(U =\left\{ x_{1}, x_{2},x_{3},\ldots ,x_{n}\right\}\) is an universe, \(A =\left\{ a_{1}, a_{2},a_{3},\ldots ,a_{m}\right\}\) is an attribute set, V is a set of attribute values, f is an information function which constructs the mapping between attributes and attribute values and \(f:U\times A\rightarrow V\) which means \(f(x,a)\in V\) if \(\forall x\in U\) and \(\forall a\in A\). Let \(B\subseteq A\) and \(B\ne \varnothing\), the indiscernibility relation of the objects on B is defined as

$$\begin{aligned} IND(B)=\left\{ \left( x,y \right) \in U\times U|f(x,a)=f(y,a),\forall a\in B \right\} . \end{aligned}$$
(1)

If \((x,y)\in IND(B)\), it means the attribute values of x and y are the same on the attribute set B. For \(\forall x\in U\), the equivalence class of x on B is defined as

$$\begin{aligned} \left[ x \right] _{B} =\left\{ y\in U|\left( x,y \right) \in IND(B) \right\} . \end{aligned}$$
(2)

It can obtain a partition of the universe according to IND(B) which is denoted as U/B; If \(U/B=\left\{ X_{1},X_{2},\ldots ,X_{l} \right\}\), it is known that \(\bigcup _{i=1}^{l}X_{i}=U\) and \(X_{i}\cap X_{j}=\varnothing\) for \(\forall X_{i},X_{j}\subseteq U\) and \(i\ne j\).

For an information system \(<U,A,V,f>\), \(B\subseteq A\), \(B\ne \varnothing\) and \(R_{B}\) is the equivalence relation. For \(\forall X\subseteq U\), the lower approximation set \(\underline{R_{B}}X\) and the upper approximation set \(\overline{R_{B}}X\) is defined as

$$\begin{aligned} \underline{R_{B}}X= & {} \left\{ x\in U|\left[ x \right] _{B}\subseteq X \right\} \ \ \text {and} \ \ \nonumber \\ \overline{R_{B}}X= & {} \left\{ x\in U|\left[ x \right] _{B}\cap X\ne \varnothing \right\} . \end{aligned}$$
(3)

The universe is divided into three parts by rough set. \(POS_{R}(X)=\underline{R_{B}}X\) is called as the positive region of X on B; \(NEG_{R}(X)=U-\overline{R_{B}}X\) is the negative region of X on B and \(BN_{R}(X)=\overline{R_{B}}X-\underline{R_{B}}X\) is the boundary region of X on B. It is known that the boundary region represents the uncertainty of the set. A large boundary region means a large uncertainty. If \(BN_{R}(X)=\varnothing\), it is said that X is crisp; otherwise, X is rough. For a partition, \(U/B=\left\{ X_{1},X_{2},\ldots ,X_{l} \right\}\); in order to measure the uncertainty, the information granularity of the knowledge is defined as

$$\begin{aligned} GK_{B}(U)=\frac{1}{\left| U \right| }\sum _{i=1}^{l}\frac{\left| X_{i} \right| ^{2}}{\left| U \right| ^{2}}. \end{aligned}$$
(4)

It is obvious that \(\frac{1}{\left| U \right| ^{2}}\le GK_{B}(U)\le \frac{1}{\left| U \right| }\). When \(X_{i}=\left\{ x_{i} \right\} (i=1,2,\ldots ,\left| U \right| )\), the information granularity is minimum and the uncertainty is also minimum; when \(X_{i}=U\) and \(X_{j}=\varnothing \ (j=1,2,\ldots ,l,j\ne i)\), the information granularity is maximum and the uncertainty is also maximum.

2.2 k-modes algorithm

k-modes algorithm [5] is a simple and practical clustering algorithm for categorical data. k-modes algorithm uses difference degree to replace the euclidean distance. For \(x_{i}=\left[ x_{i1},x_{i2},\ldots ,x_{im} \right]\) and \(x_{j}=\left[ x_{j1},x_{j2},\ldots ,x_{jm} \right]\), the difference degree of two samples on each attribute is defined as

$$\begin{aligned} \varphi \left( x_{ip},x_{jp} \right) =\left\{ \begin{matrix} 1 \ \ \text {if} \ x_{ip}\ne x_{jp}\\ 0\ \ \text {otherwise} \end{matrix}\right. (p=1,2,\ldots ,m). \end{aligned}$$
(5)

Therefore the difference degree of two samples is as

$$\begin{aligned} \varphi \left( x_{i},x_{j} \right) =\sum _{p=1}^{m}\varphi \left( x_{ip},x_{jp} \right) . \end{aligned}$$
(6)

A smaller difference degree means that the two samples are more similar. k-modes algorithm is a greedy algorithm. The algorithm repeatedly adjusts the clustering result until the clustering result is convergent. Therefore k-modes algorithm is summarized as Algorithm 1.

figure a

From Algorithm 1, it is known that k-modes algorithm can deal with categorical data and the clustering result is local optimum although k-modes algorithm iteratively adjusts the clustering result. It is obvious that there are still some problems for k-modes algorithm. For many unlabeled data sets, how to determine the number of clusters is not easy; in addition, the weight of each attribute is equal in Eq. (6); however, the equal weight mechanism cannot make that the difference degree can represent the similarity well because the equal weight does not present the significant of each attribute. Therefore it is necessary to improve k-modes algorithm.

3 Fuzzy rough clustering algorithm

In this section, we will introduce the detail principles of FRC. At first, we explain how to determine the weights of attributes by rough set; then we describe the method of dimension reduction and the steps of FRC.

3.1 The methods determining the weights of attributes

The attribute weighting mechanism is widely used in clustering tasks. For many tasks, if there are many redundant or irrelevant attributes, the distance or difference degree cannot measure the dissimilarity well. The attribute weighting mechanism gives a weight for each attribute. The large weight means that the attribute has a greater influence on the distance or difference degree. Therefore the attribute weighting mechanism can reduce or eliminate the adverse effects of redundant attributes and it can achieve the effect of dimension reduction without information loss. Rough set is an effective tool to analyze categorical data set and it can obtain knowledge and patterns without any prior knowledge. The performance of k-modes can be improved by rough set.

Let \(U=\left\{ x_{1},x_{2},\ldots ,x_{n} \right\}\) and \(A=\left\{ a_{1},a_{2},\ldots ,a_{m} \right\}\). For \(\forall a\in A\), the partition of the universe on a is as \(U/a=\left\{ X_{1},X_{2},\ldots ,X_{l_{a}} \right\}\) where \(l_{a}\) is the number of equivalence classes on a. Therefore the information granularity of the partition on a is as

$$\begin{aligned} GK_{a}(U)=\frac{1}{\left| U \right| }\sum _{i=1}^{l_{a}}\frac{\left| X_{i} \right| ^{2}}{\left| U \right| ^{2}}. \end{aligned}$$
(7)

\(GK_{a}(U)\) is the uncertainty of the knowledge on a; a small information granularity means the uncertainty of the partition on a is small; in other words, the objects of the universe can be discriminated well on the attribute a. Therefore the attribute a should be given a large weight when computing the distance or difference degree. The weight of the attribute a which is denoted as \(w_{a}\) is defined as

$$\begin{aligned} Weight_{a}=\frac{\sum \limits _{\forall b\in A}GK_{b}(U)}{GK_{a}(U)}\ \ \text {and}\ \ \omega _{a}=\frac{Weight_{a}}{\sum \limits _{\forall b\in A}Weight_{b}}. \end{aligned}$$
(8)
Table 1 The decision table of an information system

In order to explain the principle of the weights of attributes, an example is introduced. Table 1 is the decision table of an information system. \(U=\left\{ x_{1},x_{2},x_{3},x_{4},x_{5},x_{6} \right\}\) and \(A=\left\{ a_{1},a_{2},a_{3},a_{4}\right\}\). From the decision table, \(U/a_{1}=\left\{ \left\{ x_{1}\right\} ,\left\{ x_{2} \right\} ,\left\{ x_{3},x_{4} \right\} ,\left\{ x_{5}\right\} ,\left\{ x_{6}\right\} \right\}\), \(U/a_{2}=\left\{ \left\{ x_{1},x_{3},x_{4},x_{5},x_{6}\right\} ,\left\{ x_{2} \right\} \right\}\), \(U/a_{3}=\left\{ \left\{ x_{1},x_{2}\right\} ,\left\{ x_{3},x_{4} \right\} ,\left\{ x_{5},x_{6} \right\} \right\}\) and \(U/a_{4}=\left\{ \left\{ x_{1},x_{2},x_{5},x_{6}\right\} ,\left\{ x_{3}, x_{4}\right\} \right\}\). From the partitions of the attributes, it is obvious that the objects of the universe can be better discriminated by a rather than the other attributes. Therefore the attribute a has the largest weight.

$$\begin{aligned} \therefore GK_{a_{1}}(U)= & {} \frac{1}{6}\cdot \left( \frac{1}{36}+\frac{1}{36}+\frac{4}{36}+\frac{1}{36}+\frac{1}{36} \right) \approx 0.0370 \\ GK_{a_{2}}(U)= & {} \frac{1}{6}\cdot \left( \frac{1}{36}+\frac{25}{36} \right) \approx 0.1204\\ GK_{a_{3}}(U)= & {} \frac{1}{6}\cdot \left( \frac{4}{36}+\frac{4}{36}+\frac{4}{36} \right) \approx 0.0556 \\ \ GK_{a_{4}}(U)= & {} \frac{1}{6}\cdot \left( \frac{16}{36}+\frac{4}{36}\right) \approx 0.0926\\ \therefore \omega _{a_{1}}= & {} 0.42152 \ \ \ \ \omega _{a_{2}}=0.12954\\ \omega _{a_{3}}= & {} 0.28051 \ \ \ \ \omega _{a_{4}}=0.16843. \end{aligned}$$

3.2 Dimension reduction

For \(U/A=\left\{ X_{1},X_{2},\ldots ,X_{l} \right\}\), the partition can be seen as an initial clustering result. However, because of redundant and irrelevant attributes, the initial clustering result often does not accord with the real result. In order to improve the quality of the partition, attribute reduction is a common method for categorical data, but the complexity of rough set algorithm is high and the clustering algorithm with rough set attribute reduction will bring a large time overheard. Manifold learning is an effective algorithm to decrease the number of dimension which can reconstruct data points in a low dimensional embedding space according to their neighbors [22]. Therefore in this paper, we will transform categorical data into numerical data and utilize manifold learning to decrease the dimension of data points.

For \(\forall a\in A\), let \(U/a=\left\{ X_{1},X_{2},\ldots ,X_{l_{a}} \right\}\), it is obvious that the difference of data points in the same equivalence class should be as small as possible and the difference of data points in different equivalence classes should be as large as possible. For \(\forall x_{i}\in U\), the dissimilarity between \(x_{i}\) and data point in the same equivalence class is defined as follow

$$\begin{aligned} d_{s}\left( x_{i},y \right) =\sum _{j=1}^{m}\omega _{j}\cdot \varphi \left( x_{ij},y_{j} \right) \ \text {and} \ y\in \left[ x_{i} \right] _{a}. \end{aligned}$$
(9)

For \(\forall x_{i}\in U\), the dissimilarity between \(x_{i}\) and data point in different equivalence classes is defined as follow

$$\begin{aligned} d_{f}\left( x_{i},y \right) =\sum _{j=1}^{m}\omega _{j}\cdot \varphi \left( x_{ij},y_{j} \right) \ \text {and} \ y\notin \left[ x_{i} \right] _{a}. \end{aligned}$$
(10)

Therefore the significance of the attribute a for \(x_{i}\) can be defined as follow

$$\begin{aligned} sig_{a}\left( x_{i} \right)= & {} \frac{1}{1+\sum _{y\in U} d_{s}\left( x_{i},y \right) /\left| \left[ x_{i} \right] _{a} \right| }\nonumber \\&+\frac{1}{1+\exp \left( -\sum _{y\in U} d_{f}\left( x_{i},y \right) /\left| U- \left[ x_{i} \right] _{a} \right| \right) }. \end{aligned}$$
(11)

In Eq. (11), \(sig_{a}\left( x_{i} \right)\) includes the discrimination information of equivalence partition. A large value of \(sig_{a}\left( x_{i} \right)\) means that the attribute a is more important for \(x_{i}\). If the significance of each data point is calculated, a new significant matrix can be obtained. The new significant matrix \({\varvec{S}}\in {\mathbb {R}}^{m\times n}\) is defined as

$$\begin{aligned} {\varvec{S}}=\begin{bmatrix} sig_{a_{1}}\left( x_{1} \right)&\cdots&sig_{a_{1}}\left( x_{n} \right) \\ \vdots&\ddots&\vdots \\ sig_{a_{m}}\left( x_{1} \right)&\cdots&sig_{a_{m}}\left( x_{n} \right) \end{bmatrix}_{m\times n}. \end{aligned}$$
(12)

In Eq. (12), \({\varvec{S}}\) is numerical and the significant matrix can be seen as a new representation of \({\varvec{X}}\) which means that the categorical data set \({\varvec{X}}\) is transformed into a numerical data set. Therefore we can introduce manifold learning to decrease the dimensions of \({\varvec{S}}\).

Let \({\varvec{Y}}=\left[ {\varvec{Y}}_{1},{\varvec{Y}}_{2},\ldots ,{\varvec{Y}}_{n} \right] ={\varvec{P}}^{T}{\varvec{S}}\) and \({\varvec{P}}\in {\mathbb {R}}^{m\times d}\)\(\left( d< m \right)\), the data points can be reconstructed by their neighbors. Therefore the problem can be expressed as follow

$$\begin{aligned} \begin{aligned}&\underset{{\varvec{P}}}{\min }\ \sum _{i=1}^{n}\sum _{j=1}^{n}W_{ij}\left\| {\varvec{Y}}_{i}-{\varvec{Y}}_{j} \right\| _{2}^{2}+\alpha \left\| {\varvec{P}}\right\| _{F}^{2}\\&s.t. \ {\varvec{P}}^{T}{\varvec{S}}{\varvec{D}}{\varvec{S}}^{T}{\varvec{P}}={\varvec{1}} \end{aligned}. \end{aligned}$$
(13)

where \(W_{ij}\) is the weight between \({\varvec{Y}}_{i}\) and \({\varvec{Y}}_{j}\); \(\alpha\) is a parameter and \(\alpha >0\). The first term of Eq. (13) is the reconstruction error; the second term is the generalization ability. Let \({\varvec{D}}_{ii}=\sum _{j=1}^{n}W_{ij}\) and \({\varvec{L}}={\varvec{D}}-{\varvec{W}}\); Eq. (13) can be also expressed as follow

$$\begin{aligned} \begin{aligned}&\underset{{\varvec{P}}}{\min } \ {\varvec{tr}}\left( {\varvec{P}}^{T}{\varvec{S}}{\varvec{L}}{\varvec{S}}^{T}{\varvec{P}}\right) +\alpha \left\| {\varvec{P}}\right\| _{F}^{2}\\&s.t. \ {\varvec{P}}^{T}{\varvec{S}}{\varvec{D}}{\varvec{S}}^{T}{\varvec{P}}={\varvec{1}} \end{aligned}. \end{aligned}$$
(14)

From Eq. (14), \({\varvec{P}}\) can be solved by the following generalized eigenvalue problem:

$$\begin{aligned} {\varvec{S}}{\varvec{L}}{\varvec{S}}^{T}{\varvec{P}}+\alpha {\varvec{P}}=\lambda {\varvec{S}}{\varvec{D}}{\varvec{S}}^{T}{\varvec{P}}. \end{aligned}$$
(15)

Therefore From Eq. (15), it can known that

$$\begin{aligned} \left( {\varvec{S}}{\varvec{L}}{\varvec{S}}^{T}+\alpha {\varvec{I}}\right) {\varvec{P}}=\lambda {\varvec{S}}{\varvec{D}}{\varvec{S}}^{T}{\varvec{P}}. \end{aligned}$$
(16)

\({\varvec{P}}\) is made up of the d eigenvectors corresponding to the d smallest eigenvalues. If \({\varvec{P}}\) is solved, \({\varvec{Y}}={\varvec{P}}^{T}{\varvec{S}}\) is the result of dimension reduction.

3.3 The clustering processes of FRC

After data points is executed by the dimension reduction algorithm, a low dimension representation of data points can be obtained. Therefore it can use k-means clustering algorithm with fuzzy partition to get the final clusters. Let k be the number of clusters and \({\varvec{Z}}=\left[ {\varvec{z}}_{1},{\varvec{z}}_{2},\ldots ,{\varvec{z}}_{k} \right] \in {\mathbb {R}}^{d\times k}\) be the cluster center points; however, how to select an appropriate k value is an important problem. Rough set can generate a partition of the universe for each attribute. Therefore it can employ the partition results to decrease the range of the number of clusters. Let U be a categorical data set and A be a attribute set; for \(\forall a\in A\), \(U/a=\left\{ X_{1},X_{2},\ldots ,X_{l_{a}} \right\}\); the range of k should satisfy the following equation:

$$\begin{aligned} 2\le k \le \max \left\{ \left| U/a \right| \right\} \ \ \forall a\in A. \end{aligned}$$
(17)

For the new data points \({\varvec{Y}}\), let \(\delta _{i}\) be the standard deviation of \({\varvec{Y}}_{i}\)\(\left( i=1,2,\ldots ,d \right)\). For a unlabeled data set, \(\delta _{i}\) can reflect the scatter of data set on this attribute and a large \(\delta _{i}\) means a good discernibility. \(\delta _{i}\) can be normalized as follow

$$\begin{aligned} \delta _{i}\leftarrow \delta _{i}/\sum _{j=1}^{d}\delta _{j}. \end{aligned}$$
(18)

It is obvious that \(\delta _{i}\) can be seen as the weight of the ith attribute. Therefore the objective optimization function of clustering algorithm can be expressed as follow

$$\begin{aligned} \begin{aligned}&\underset{{\varvec{U}},{\varvec{Z}}}{\min }\ \frac{1}{n}\sum _{i=1}^{n}\sum _{j=1}^{d}\sum _{t=1}^{k}u_{it}\cdot \delta _{j}\cdot \left( x_{ij}-z_{tj} \right) ^{2}\\&\quad -\frac{1}{k}\sum _{j=1}^{d}\sum _{t=1}^{k}\sum _{l=t+1}^{k}\delta _{j}\cdot \left( z_{tj}-z_{lj} \right) ^{2}\ \\&s.t. \ 0\le u_{it}\le 1 \end{aligned}. \end{aligned}$$
(19)

Therefore the membership \(u_{it}\) is updated as

$$\begin{aligned} u_{it}=1/\left( 1+\sum _{j=1}^{d}\delta _{j}\cdot \left( x_{ij}-z_{tj} \right) ^{2} \right) \ \ i=1,2,\ldots ,n ; \ t=1,2,\ldots ,k. \end{aligned}$$
(20)

The center point \(z_{tj}\) is updated as

$$\begin{aligned} z_{tj}=\frac{\sum _{i=1}^{n}u_{it}x_{ij}}{\sum _{i=1}^{n}u_{it}}\ \ \ t=1,2,\ldots ,k; \ j=1,2,\ldots ,d. \end{aligned}$$
(21)

From the above descriptions, the detail steps of FRC algorithm are summarized as Algorithm 2.

figure b

From Algorithm 2, it is known that the range of the number of clusters is decreased by the number of equivalence classes which is based on rough set. In Eq. (11), the significance of an attribute for a sample can be measured by the weight. In addition, the weighting mechanism of Eq. (19) makes that the dissimilarity can better represent the real dissimilarity. It is also known that FRC algorithm is different from k-modes algorithm and FRC algorithm can transforms categorical data set into numeric data set; therefore it can use many nonlinear dimension reduction algorithms to decrease the number of dimensions which is different from the attribute reduction algorithms based on rough set.

4 Experiments and results

In this section, we choose k-modes [5], DKmodes [3], WKModes [4] as comparison algorithms and all algorithms are executed on MATLAB 2017Ra. The details of data setsFootnote 1 are showed as Table 2. In order to measure the experimental results of the algorithms, the following evaluation criteria are used in this paper [34].

  1. (1)

    Jaccard coefficient:

    $$\begin{aligned} J=\frac{SS}{SS+SD+DS}. \end{aligned}$$
    (22)
  2. (2)

    Fowlkes and Mallows index:

    $$\begin{aligned} FM=\sqrt{\frac{SS}{SS+SD}\cdot \frac{SS}{SS+DS}}. \end{aligned}$$
    (23)
  3. (3)

    Czekanowski–Dice index:

    $$\begin{aligned} CD=\frac{2\cdot SS}{2\cdot SS+DS+SD}. \end{aligned}$$
    (24)
  4. (4)

    Kulczynski index:

    $$\begin{aligned} K=\frac{1}{2}\cdot \left( \frac{SS}{SS+SD} +\frac{SS}{SS+DS}\right) . \end{aligned}$$
    (25)

where SS is the number of data points which are in the same cluster and also belong to the same class; SD is the number of data points which are in the same cluster but belong to different classes; DS is the number of data points which are in different clusters but belong to the same classes; DD is the number of data points which are in different clusters and also belong to different classes. For the evaluation criteria, \(0\le\)J, FM, CD, K\(\le 1\) and a larger value indicates a better performance.

Table 2 The details of the experimental data sets

In order to test the performance of FRC, we execute k-modes, DKmodes, WKModes and FRC on the experimental data sets. For FRC, \(\alpha\) is set to 5, \(d=0.85\cdot col\) where col is the number of attributes. The test results are showed as Tables 3, 4, 5, 6, 7 and 8.

Table 3 The test results of the algorithms on Germany data set
Table 4 The test results of the algorithms on student data set
Table 5 The test results of the algorithms on nursery data set
Table 6 The test results of the algorithms on Thoracic data set
Table 7 The test results of the algorithms on adult data set
Table 8 The test results of the algorithms on car data set

Tables 3, 4, 5, 6, 7 and 8 show the test results of the four algorithms on the experimental data set and the bold results are the best results. From the results, it can be seen that the J, FM, CD and K of FRC algorithm are better than the three comparison algorithms on the experimental data sets except for nursery and student data sets. The results prove that FRC algorithm is an effective algorithm for clustering task. In FRC algorithm, it introduces the attribute weighting mechanism and the data set type conversion mechanism. The attribute weighting mechanism gives a weight for each attribute and an attribute with a larger weight has a large impact on the distance. By the data set type conversion mechanism, a categorical data set can be transformed into a numeric data set and then linear or nonlinear dimension reduction algorithms can be used to decrease the dimension of data set. The above mechanisms effectively avoid the influence of redundant attributes on the performance of the algorithm. Therefore the clustering results of FRC algorithm are better than the comparison algorithms on the most data sets. Tables 3, 4, 5, 6, 7 and 8 also show the time cost of the four algorithms. From the results, it is known that k-modes algorithm cost the least time, WKModes is the second least, DKmodes is the third least and FRC algorithm spends much more time on clustering task than the comparison algorithms which means the time complexity of FRC algorithm is high. WKModes and DKModes are the developments of k-modes algorithm; WKModes uses the equivalence partition of current attribute to determine the importance of current attribute in each cluster; DKmodes considers the frequency of mode components in the current cluster when it determines the dissimilarity between data point and center point; it is obvious that WKModes and DKModes introduce the more complex dissimilarity measurement methods to measure the dissimilarity, therefore WKModes and DKModes cost more time than k-modes algorithm; for FRC algorithm, it uses information granularity of equivalence partition on each attribute to determine the weight of the attribute and also uses the significance of each attribute for each data point to transform categorical data set into numeric data set; the time complexity of the above steps is \({\mathcal {O}}\left( mn^{2} \right)\); the time complexity of the dimension reduction of FRC algorithm is \({\mathcal {O}}\left( n^{2} \right)\); the time complexity of Eq(19) is \({\mathcal {O}}\left( n\cdot k\cdot d\cdot I \right) \approx {\mathcal {O}}\left( n \right)\) where I is the number of iterations; therefore the time complexity of FRC algorithm is \({\mathcal {O}}\left( mn^{2} \right)\). For the comparison algorithms, there is no the data set type conversion mechanism; therefore the time complexity of each comparison algorithm is not larger than \({\mathcal {O}}\left( mn^{2} \right)\). From the analysis, it is known that the complexity of FRC algorithm is highest.

In order to test the impacts of the parameter \(\alpha\) on the performance of FRC algorithm, we choose Germany as the experimental data set and FRC algorithm is executed on the data set; \(\alpha\) is changed in \(\left[ 0,100 \right]\) and the other parameter is set as the above experiment; the test results are showed as Figs. 1, 2, 3 and 4.

Fig. 1
figure 1

The J of FRC algorithm with different \(\alpha\) values

Fig. 2
figure 2

The FM of FRC algorithm with different \(\alpha\) values

Fig. 3
figure 3

The CD of FRC algorithm with different \(\alpha\) values

Fig. 4
figure 4

The K of FRC algorithm with different \(\alpha\) values

Figures 1, 2, 3 and 4 show the results of FRC algorithm tested on Germany data set with different \(\alpha\) values. From the results, it can be seen that the improvement of the performance of FRC algorithm is the largest when \(\alpha =5\). When \(\alpha <15\), the performance of FRC algorithm increases with the increase of \(\alpha\) value; after \(\alpha\) is larger than 15, the performance of FRC algorithm almost keeps stable. For FRC algorithm, \(\alpha\) is a parameter which can affect the generalization ability of FRC algorithm. If \(\alpha \rightarrow 0\), Eq. (13) is equal to LPP algorithm [16]. If \(\alpha > 0\), the optimization problem of Eq. (13) considers the the generalization ability, therefore FRC algorithm can obtain a better generalization ability and the test results of Figs. 1, 2, 3 and 4 also prove the conclusion. If \(\alpha\) is a large value, the impact of the first term of Eq. (13) will be decreased. If \(\alpha\) is too large, the impact of the first term of Eq. (13) is almost ignored. In a word, an appropriate \(\alpha\) value is important for the performance of FRC algorithm.

In order to test the scalability of FRC algorithm, we choose Germany as the experimental data set; the parameter is set as \(\alpha =5\); FRC is executed on adult data set with different dimensions. The test results are showed as Figs. 5, 6, 7 and 8 and Table 9.

Fig. 5
figure 5

The J of FRC algorithm with different dimensions

Fig. 6
figure 6

The FM of FRC algorithm with different dimensions

Fig. 7
figure 7

The CD of FRC algorithm with different dimensions

Fig. 8
figure 8

The K of FRC algorithm with different dimensions

Table 9 The time cost of FRC algorithm on adult data set with different dimensions

Figures 5, 6, 7 and 8 show the test results of FRC algorithm on adult data set with different dimensions. From the results, it can be seen that the values of evaluation criteria are different when d is different. In other words, d can effect the performance of FRC algorithm. For FRC algorithm, d determines the dimensions of data set after dimension reduction. If d is too small, much effective discriminant information is removed from data. If d is too large, it is obvious that much reductant or ineffective information is also added into data which decreases the performance of FRC algorithm. Table 9 shows the time cost of FRC algorithm with different dimensions. From the results, it is known that the overall change of time cost is increased. It is known that the time cost of FRC algorithm is mainly determined by the data processing of Sects. 3.1 and 3.2 and the time complexity is \({\mathcal {O}}\left( mn^{2} \right)\); if d is changed, it can effect the time cost of the clustering steps and the time complexity of Eq. (19) is \({\mathcal {O}}\left( n\cdot k\cdot d\cdot I \right)\). Therefore the change of the time cost of FRC algorithm is also changed when d is changed.

In order to test the effectiveness of FRC algorithm, we choose Germany as the experimental data set and execute FRC algorithm and FRC algorithm without dimension reduction which is denoted as UFRC. For the two algorithm, \(\alpha\) is set to 5, \(d=3\). The results are showed as Fig. 9.

Fig. 9
figure 9

The test results of FRC algorithm and FRC algorithm on Germany data set

Figure 9 shows the results of FRC algorithm and UFRC algorithm on Germany data set. From the results, it is known that FRC algorithm outperforms UFRC algorithm on J, FM, K and CD evaluation criteria. It means the dimension reduction can improve the performance of the algorithm. For FRC algorithm, dimension reduction is to remove and eliminate irrelevant and redundant information. In the dimension reduction algorithm, nonlinear transformation is introduced; redundant or redundant information can be removed, therefore the dimension reduction algorithm can improve the performance of FRC algorithm.

5 Conclusions

In this paper, a fuzzy rough clustering algorithm (FRC) for categorical data is proposed. FRC algorithm uses the partition of rough set to compute the information granularity of each attribute and introduces information granularity to determine the weights of attributes. Different from original k-modes algorithm, FRC algorithm transforms categorical data set into numeric data set and employs nonlinear dimension reduction algorithm to decrease the dimensions of data set; the objective optimization function of RFC algorithm considers intra-cluster distance and inter-cluster distance of clustering result and FRC algorithm can obtain a clustering result with minimum intra-cluster dissimilarity and maximum inter-cluster dissimilarity. FRC algorithm and the comparison algorithms are executed on real data sets. The experimental results show that RFC algorithm outperforms the comparison algorithms on the most data sets and it proves that FRC algorithm is an effective clustering algorithm for categorical data. However, FRC algorithm randomly chooses k modes which reduces the convergence speed; in addition, the weights of attributes are the same for each cluster which is not fit for the theory of subspace learning. Therefore we can introduce a method to select high quality initial center points and use subspace clustering to further improve the performance of FRC algorithm in the future.