1 Introduction

As a research hotspot in the field of rough sets, three-way decisions theory [29] is an extension of classical two-way decisions theory, in which a deferment decision (or non-commitment decision) is adopted when current information is limited or insufficient to make acceptance decision or rejection decision. Three-way decisions is a kind of decision-making theory to describe human beings in dealing with uncertainty information based on the semantic studies of positive rules, boundary rules and negative rules coming from rough set theory. When the decision makers have sufficient confidence based on given information, the acceptance or rejection in two-way decisions are usually adopted to output an immediately result. However, when the decision makers have insufficient confidence in the prejudgment based on current information, the deferment decision is often adopted to reduce the risk brought by the immediate decisions. For uncertainty information, whether it is to accept or reject, there is a high probability of making the wrong decision.

Three-way clustering [32, 35, 36] is an important application of the three-way decisions theory, which can effectively solve the problem of inaccurate partition caused by the incomplete information or insufficient data in the traditional two-way clustering methods. Compared with two-way clustering methods, the three-way clustering incorporates the concept of fringe region (or boundary region) for uncertain samples, and the clustering result is mainly affected by the number of clusters and the thresholds for making three-way decisions. In the existing works, people usually choose the appropriate number of clusters based on the expert opinion, and select the same constant threshold for all data in the iterations of making three-way decisions. However, the choices of such fixed thresholds and number of clusters does not provide a good indication of the differences between clusters or data sets, especially for data sets with different sizes and densities.

Generally, three-way clustering can be regarded as an effective combination of unsupervised clustering and three-way partitioning of clusters [33]. Firstly, from the perspective of clustering validity, it is expected that the samples in different clusters after clustering are easier to distinguish, that is, the distance between different clusters is as far as possible. For the samples in the same cluster, it is expected that the samples are as close as possible, that is, the smaller the distance between the samples within the cluster, the better [6]. Similarly, for the three-way partitioning of clusters, it is to improve the accuracy of the samples contained in each cluster by introducing the fringe region, which is beneficial for the accuracy of the overall clustering result [32]. However, the introduction of the fringe region also increases the uncertainty of knowledge representation of the clustering result. Therefore, from the perspective of partition validity, the sample distribution and its uncertainty in the fringe region are the key factors affecting the performance of the cluster partitioning. In view of this, for three-way clustering, it is expected to improve the performance of clustering by selecting appropriate number of clusters and thresholds for three-way partitioning. To be specific, appropriate number of clusters should be selected to make the samples within same cluster closer and the samples in different clusters farther, and appropriate thresholds for three-way partitioning should be selected to maximize the accuracy of the samples contained in each cluster and minimize the uncertainty of knowledge representation of the final clustering result.

To address the above problems, in this paper, we propose an automatic three-way clustering method based on sample similarity, in which the quantitative changes of the relationship between all samples and clusters are considered. First, we define sample similarity to describe the quantitative relationship among samples. Based on the sample similarity, we redefine the concept of roughness to measure the uncertainty of the clusters after partitioning. Moreover, by considering the roughness and the sample distribution of the fringe region, we also define the partition validity index to measure the performance of the cluster partitions corresponding to different partition thresholds. Second, based on the sample similarity, we introduce the concepts of intra-class sample similarity and inter-class sample similarity to describe the quantitative relationships of samples within a same cluster and samples in different clusters, respectively. Furthermore, by combining the two kinds of sample similarities, we also define the cluster validity index to measure the performance of clustering corresponding to the number of clusters. Third, we design an automatic three-way clustering method based on the proposed partition validity index and cluster validity index, which aims to obtain the appropriate number of clusters and thresholds of three-way partitioning for achieving a high clustering performance. At last, we implement comparative experiments to verify the effectiveness of the proposed method.

The remainder of this paper is organized as follows. Section 2 introduces the preliminary knowledge and related work of three-way decisions theory and three-way clustering. Section 3 gives an automatic threshold selection algorithm based on proposed sample similarity. Section 4 gives an automatic cluster number selection algorithm. Section 5 integrates the threshold selection algorithm and the cluster number selection algorithm to implement an automatic three-way clustering method. Section 6 reports the comparison experimental results. Section 7 concludes this paper.

2 Preliminary knowledge and related work

In this section, we will introduce the preliminary knowledge and related work on three-way decisions theory and three-way clustering.

2.1 Three-way decisions

In classical two-way decisions theory, one only has two kinds of choices: accept it or reject it. However, in many real-world applications, if existing information is insufficient to support decisions that are explicitly accepted or rejected, no matter which of the two decisions is chosen, it may cause inaccuracy in the decision. Three-way decisions (3WD) theory [29] is an extension of classical two-way decisions (2WD) theory, in which a deferment decision (or non-commitment decision) is adopted when current information is limited or insufficient to make acceptance decision or rejection decision.

In 3WD theory, a set of samples X can be divided into three pairwise disjoint regions through three-way decisions, namely the positive region POS (X), the boundary region BND (X) and the negative region NEG (X), which respectively correspond to the acceptance decision, the deferment decision and the rejection decision. A simple representation of 3WD is to construct three-way decisions based rules by incorporating a pair of thresholds. To be specific, for a given sample x and a subset of samples X, p(X|x) is the conditional probability of x belongs to X, if we have a pair of thresholds \((\alpha , \beta )\) with an assumption of \(0\le \beta \le \alpha \le 1\), we can make the following 3WD based rules:

(P)::

If \(p(X|x) \ge \alpha \), decide \(x\in \mathrm{POS}(X)\);

(B)::

If \(\beta< p(X|x) < \alpha \), decide \(x\in \mathrm{BND}(X)\);

(N)::

If \(p(X|x)\le \beta \), decide \(x\in \mathrm{NEG}(X)\).

For the (P) rule, if the probability of x belongs to X is greater than or equal to \(\alpha \), we make an acceptance decision to classify x into the positive region of X. For the (B) rule, if the probability of x belongs to X is greater than \(\beta \) and less than \(\alpha \), we make a deferment decision to classify x into the boundary region of X. For the (N) rule, if the probability of x belongs to X is less than or equal to \(\beta \), we make a rejection decision to classify x into the negative region of X.

3WD has been widely concerned by researchers. Many results on 3WD have been reported from the perspectives of theory, model and application. Here we take some representative achievements from the three perspectives to introduce the related work of 3WD. In terms of theory, starting from the 3WD concept proposed by Yao, it has gradually expanded from 3W decisions to 3W + X theories, such as 3W classifications [10, 14, 19, 39, 40], 3W attribute reduction [8, 9, 11, 17], 3W clustering [1, 34, 36], 3W active learning [23], 3W concept learning [13], 3W concept lattices [26, 27], etc. Furthermore, Yao [30] discussed a wide sense of 3WD and proposed a trisecting-acting-outcome (TAO) model of 3WD. In addition, Liu and Liang [22] proposed a function based three-way decisions to generalize the existing models. Li et al. [18] generalized 3WD models based on subset evaluation. Hu [5] studied three-way decision spaces. In terms of model, 3WD was combined with existing models to generate a lot of new models, such as 3WD based Bayesian network [4], neighbourhood three-way decision-theoretic rough set model [15], 3WD based intuitionistic fuzzy decision-theoretic rough sets model [20], dynamic 3WD models [38], three-way enhanced convolutional neural networks (3W-CNN) [42], etc. In terms of application, 3WD has been successfully applied in sentiment analysis [41, 42], image processing [12], medical decision [28], spam filtering [7], software defect detection [16], etc.

2.2 Three-way clustering

As an effective data processing method in machine learning and data mining areas, clustering can effectively divide samples without class information into multiple clusters composed of similar samples, thereby determining the sample distribution in each cluster and improving the efficiency of data post-processing.

The three-way clustering is an improvement of the K-means algorithm. To address the problem of inaccurate decision-making caused by inaccurate information or insufficient data in traditional clustering methods, three-way clustering adds the concepts of positive domain and fringe domain to the clusters in the traditional K-means-based results to further divide the clustering results, which forms the three regions of a cluster as core region (represented by Co(C) for cluster C), fringe region (represented by Fr(C) for cluster C) and trivial region (represented by Tr(C) for cluster C), respectively. If \(x\in Co(C)\), the object x belongs to the cluster C definitely; if \(x\in Fr(C)\), the object x might belong to C; if \(x\in Tr(C)\), the object x does not belong to C definitely. Moreover, assume \(\pi _K=\{U_1,U_2,\dots ,U_K\}\) is the K clusters of the set of all objects U based on K-means clustering method, the three regions of all clusters are represented by the followings:

$$\begin{aligned} \pi _K^T=\{Co(\pi _K), Fr(\pi _K), Tr(\pi _K)\}. \end{aligned}$$
(1)

According to 3WD theory, if \(x\in Co(\pi _K)\), then x is determined to belong to a certain cluster in \(\pi _K\); if \(x\in Fr(\pi _K)\), then x maybe related to multiple clusters in \(\pi _K\); if \(x\in Tr(\pi _K)\), then x is determined not to belong to a certain cluster in \(\pi _K\). Usually, we have \(Tr(\pi _K)=U-Co(\pi _K)-Fr(\pi _K)= \emptyset \) and \(Co(\pi _K)\cap Fr(\pi _K)=\emptyset \).

In three-way clustering, each cluster can be represented by a pair of core region and fringe region. For the K cluster partitioning of the sample set, it satisfies the following properties:

  1. (1)

    \(\forall U_i \in \pi _K\), \(Co(U_i)\ne \emptyset \);

  2. (2)

    \(U = \bigcup _{i=1}^K Co(U_i) \cup \bigcup _{i=1}^K Fr(U_i)\).

The property (1) requires that the core domain of each cluster cannot be empty after division, that is, there is at least one sample in each cluster; the property (2) is used to ensure that the clustering method can implement effective division for all samples. Thus, the three-way clustering result can be represented by follows:

$$\begin{aligned} \pi _K = \{(Co(U_1), Fr(U_1)), \dots , (Co(U_K), Fr(U_K))\}. \end{aligned}$$
(2)

For the fringe region \(Fr(U_i)\), if \(\bigcup _{i=1}^K Fr(U_i) = \emptyset \), then \(\pi _K = \{Co(U_1), \dots , Co(U_K)\}\) is a two-way clustering result. In the traditional clustering method, all samples can only be divided into uniquely determined clusters. In the three-way clustering method, if the sample is closely related to multiple clusters at the same time, the sample may belong to multiple clusters at the same time. In this case, it is reasonable to classify the sample into the fringe regions of the closely related clusters.

With the in-depth study of clustering methods, more and more scholars believe that traditional clustering methods are based on hard partitioning, which can not truly reflect the actual relationship between samples and clusters. To address this problem, some approaches such as fuzzy clustering, rough clustering and interval clustering, have been proposed to deal with this kind of uncertain relationship between samples and clusters. These approaches are also called soft clustering or overlapping clustering based on the meaning that a sample can belong to more than one cluster [25, 31]. By using the concept of membership instead of the traditional distance-based methods to measure the degree of membership of each sample with a certain cluster, Friedman and Rubin first proposed the concept of fuzzy clustering [3]. Then, Dunn proposed a widely used fuzzy C-means (FCM) clustering algorithm based on the related concepts of fuzzy clustering [2]. Later, through the extension of the concept of rough set theory, Lingras et al. proposed a rough C-means (RCM) clustering method [21], which considers the different influences of positive and fringe regions of each cluster on calculating corresponding center. Yu introduced 3WD theory into traditional clustering and proposed a framework of three-way cluster analysis [32, 33, 35].

3 An automatic threshold selection method based on sample similarity

In the three-way clustering methods, each cluster has a corresponding positive region and fringe region. The thresholds for partitioning the regions of each cluster is the key factor to determine the final clustering performance. Different partition thresholds will generate different regions for the clusters. For three-way decisions theory, it makes the deferment decision for the samples with uncertain information by classifying them into the fringe region. The introduction of such fringe region can deduce the possibility of classifying the sample directly into the wrong cluster. However, for partitioned clusters, the introduction of the fringe region also brings additional uncertainty for the knowledge representation of the clustering result. If the fringe region contains more samples, the uncertainty of the cluster representation is greater. In view of this, in order to select the optimal thresholds making the clusters have the best partition performance, we define the roughness degree by introducing the sample similarity to measure the quantitative change of the sample distribution in the positive and fringe regions. Moreover, based on the roughness degree, we define a novel partitioning validity index to measure the clustering partitions and propose an automatic threshold selection method for categorical data.

3.1 Partitioning validity index

We first give the definition of sample similarity as follows.

Definition 1

(Sample similarity) [11] Let \(IS=(U, A)\) be an information system, where U denotes the set of all samples, and A denotes the set of attributes describing each sample, for any two samples \(x_i\in U\) and \(x_j\in U\), the similarity of \(x_i\) and \(x_j\) is:

$$\begin{aligned} S_A(x_i,x_j) = \frac{\sum _{k=1}^{|A|}I(x_{ik}, x_{jk})}{|A|}, \end{aligned}$$
(3)

where |A| represents the number of attributes. For categorical data, the function \(I(x_{ik}, x_{jk})\) is computed by:

$$\begin{aligned} I(x_{ik}, x_{jk}) = \left\{ \begin{array}{rcl} 1 &{} &{} {x_{ik} = x_{jk}}\\ 0 &{} &{} {x_{ik}\ne x_{jk}} \end{array}. \right. \end{aligned}$$
(4)

As defined above, the sample similarity is an indicator of how similar between the samples is. The greater the value, the more similar the two samples.

Through the introduction of sample similarity, the related concepts in 3WD theory can be calculated, which are defined as follows.

Definition 2

(Equivalence relation) [11] Given an information system \(IS=(U,A)\), the equivalence relation based on a threshold \(\alpha \) is represented by \(E_{\alpha }\):

$$\begin{aligned} E_{\alpha } = \{(x,y)\in U\times U | S_{A}(x,y)>\alpha \}. \end{aligned}$$
(5)

Similarly, for sample x, its equivalence class is computed by:

$$\begin{aligned} {[}x]_{\alpha } = \{y\in U | S_{A}(x,y)>\alpha \}. \end{aligned}$$
(6)

The equivalence relation \(E_{\alpha }\) can divide U into several disjoint sample subsets, denoted by \(\pi _{\alpha }\). For a given objective subset \(X\subseteq U\), we can define its lower and upper approximations based on \(\alpha \) as follows:

$$\begin{aligned} \begin{aligned}&\underline{apr}_{\alpha }(X) = \{x\in U | [x]_{\alpha }\subseteq X\},\\&\overline{apr}_{\alpha }(X) = \{x\in U | [x]_{\alpha }\cap X \ne \emptyset \}. \end{aligned} \end{aligned}$$
(7)

Furthermore, we can use the positive region \(\mathrm{POS}_{\alpha }(X)\) and the fringe region \(\mathrm{BND}_{\alpha }(X)\) to describe the objective subset X:

$$\begin{aligned} \begin{aligned}&\mathrm{POS}_{\alpha }(X) = \underline{apr}_{\alpha }(X),\\&\mathrm{BND}_{\alpha }(X) = \overline{apr}_{\alpha }(X) - \underline{apr}_{\alpha }(X). \end{aligned} \end{aligned}$$
(8)

Usually, the positive region \(\mathrm{POS}_{\alpha }(X)\) contains the samples that belong to X definitely and the fringe region contains the samples that belong to X possibly, we can use roughness to compute the uncertainty of X represented by two regions.

Definition 3

(Roughness) [24] Given an information system \(IS=(U,A)\), for objective subset \(X\subseteq U\), \(RN_{\alpha }(X)\) is the roughness of X based on the threshold \(\alpha \):

$$\begin{aligned} RN_{\alpha }(X) = 1 - \frac{|\underline{apr}_{\alpha }(X)|}{|\overline{apr}_{\alpha }(X)|} = \frac{|\mathrm{BND}_{\alpha }(X)|}{|\mathrm{POS}_{\alpha }(X)| + |\mathrm{BND}_{\alpha }(X)|}. \end{aligned}$$
(9)

Given a threshold, the roughness degree is related to the distribution of samples in the positive region and the fringe region for each cluster, which also leads to a monotonicity property between the roughness and the value of the threshold.

Property 1

Given an information system \(IS=(U,A)\), \(X\subseteq U\) is an objective subset of samples, for any two thresholds \(\alpha \) and \(\beta \), if \(0\le \beta < \alpha \le 1\), then \(RN_{\beta }(X)\ge RN_{\alpha }(X)\).

Proof

For any sample \(x\in U\), if \(\beta < \alpha \), according to Eq. (6) and Eq. (7), we have \([x]_{\alpha }\subseteq [x]_{\beta }\), \(\underline{apr}_{\beta }(X)\subseteq \underline{apr}_{\alpha }(X)\) and \(\overline{apr}_{\alpha }(X) \subseteq \overline{apr}_{\beta }(X)\), thus \(\frac{|\underline{apr}_{\alpha }(X)|}{|\overline{apr}_{\alpha }(X)|} \ge \frac{|\underline{apr}_{\beta }(X)|}{|\overline{apr}_{\beta }(X)|}\). Based on Eq. (9), now we have \(RN_{\beta }(X)\ge RN_{\alpha }(X)\). \(\square \)

As mentioned above, the introduction of fringe region in three-way clustering is a mixed blessing. On the one hand, it can reduce the classification error by classifying samples with less certainty into the fringe region. On the other hand, it also brings additional uncertainty for knowledge representation. By considering the trade-off relationship between the size of fringe region and the roughness, we define a novel partitioning validity index as follows.

Definition 4

(Partitioning Validity Index) Given an information system \(IS=(U,A)\), \(X\subseteq U\) is a subset of samples, for a specific threshold \(\alpha \), the partitioning validity index of X is defined as:

$$\begin{aligned} PVI_{\alpha }(X) = (1 - RN_{\alpha }(X))\cdot \frac{|\mathrm{BND}_{\alpha }(X) \cap X|}{|X|}. \end{aligned}$$
(10)

In this definition, \(\frac{|\mathrm{BND}_{\alpha }(X) \cap X|}{|X|}\) represents the ratio of the number of samples belonging to the objective subset X and within its fringe region to the number of samples in X. \(0 \le PVI_{\alpha }(X) \le 1\).

In the three-way clustering problem, we prefer to obtain a partition with low classification error and put the uncertainty sample in the fringe region to delay decision making, which increases the fringe region while also posing the problem of increasing roughness, so we constrain the fringe region to be too large by minimizing the roughness.

Example 1

Table 1 is an information system, where \(U=\{x_1, x_2, \dots , x_9\}\) and \(A = \{a_1, a_2, \dots , a_6\}\). For objective subset \(X=\{x_1, x_2\}\), if the threshold \(\alpha = 0.8\), we have \(RN_{0.8}(X) = 3/4\) and \(PVI_{0.8}(X) = 3/8\). Similarly, if \(\alpha = 0.6\), we have \(RN_{0.6}(X) = 5/5\) and \(PVI_{0.6}(X) = 0\). Apparently, \(\alpha = 0.8\) is a better choice for objective subset \(\{x_1,x_2\}\).

Table 1 An information system

3.2 An automatic threshold selection algorithm

For the partitioning validity index, its value tends to be 0 when either the fringe region is maximized or the roughness is minimized, so we choose the value of the largest PVI as the balance point between the roughness and the fringe region size to show the better comprehensive performance of the clustering result. Thus, we can construct an optimization problem for the threshold selection problem, in which the largest \(PVI_{\alpha }(X)\) is determined by the optimal threshold \(\alpha \).

$$\begin{aligned} \mathop {\arg \max _{\alpha }\ PVI_{\alpha }(X)}. \end{aligned}$$
(11)

Since \(PVI_{\alpha }(X)\) presents a kind of trade-off relationship between the size of fringe region and the roughness, there does not exist a monotonicity property between \(PVI_{\alpha }(X)\) and \(\alpha \). In order to obtain the optimal value of \(\alpha \), we traverse all candidate partition thresholds by setting the appropriate step size. Algorithm 1 shows the description of the proposed algorithm.

figure a

In the algorithm, the sample similarities of all possible pairs are computed first. The minimal value \(S_{\mathrm{min}}\) and the maximal value \(S_{\mathrm{max}}\) with step-size 0.01 constitute the candidate threshold space. For each candidate threshold, we compute the corresponding positive region and fringe region of objective subset X, and obtain the current partitioning validity index PVI. Finally, the threshold with respect to the maximal PVI is outputted as the optimal threshold. Assume objective subset X has n samples, the time complexities of computing all sample similarities and iteratively solving for the optimal threshold are \(O(n^2)\) and O(n), thus, the time complexity of Algorithm 1 is \(O(n^2)\).

4 An automatic cluster number selection method based on sample similarity

In the process of clustering, when the clustering method is determined, the number of clusters will become the key influencing factor of clustering performance. Different number of clusters will also result in completely different clustering results. In general, most existing clustering methods set the number of clusters by expert opinion or experimental cross-validation. In this section, we will give an automatic cluster number selection method based on sample similarity.

In general, all samples can be easily divided into two clusters if there exist large inter-class distance between the two clusters and small intra-class distance in each cluster. Similarly, we can change the number of clusters by changing the inter-class distance between any two clusters and the intra-class distance among in each cluster, thereby affecting the final clustering performance. To select appropriate number of clusters, we first define the intra-class sample similarity and the inter-class sample similarity concepts that can fully reflect the tightness of distribution between samples in each cluster and the difference of sample distribution between various clusters, respectively. Then, by integrating the intra-class sample similarity and the inter-class similarity, we define a clustering validity index to measure the performance of clusters related to the number of clusters.

4.1 Clustering validity index

We first give the definition of intra-class similarity as follows.

Definition 5

(Intra-class Similarity for a Cluster) Given an information system \(IS=(U,A)\), the intra-class similarity \(IntraS(U_l)\) represents the sample similarity of cluster \(U_l\subset U\) under the set of attributes A, which is computed as:

$$\begin{aligned} IntraS(U_l) = \frac{2\cdot \sum _{i=1}^{|U_l|-1}\sum _{j>i}^{|U_l|}S_A(U_{li},U_{lj})}{|U_l|\cdot (|U_l|-1)}, \end{aligned}$$
(12)

where \(|U_l|\) is the number of samples in cluster \(U_l\), \(U_{li}\) and \(U_{lj}\) are the i-th sample and the j-th sample in the cluster \(U_l\), respectively.

Similarly, we can define the intra-class similarity for the partition of U as follows.

Definition 6

(Intra-class Similarity for All Clusters) Given an information system \(IS=(U,A)\), the intra-class similarity \(IntraS(\pi _K)\) represents the sample similarity of all clusters \(\pi _K = \{U_1, U_2, \dots , U_K\}\), which is computed as:

$$\begin{aligned} IntraS(\pi _K) = \frac{\sum _{l=1}^{K} IntraS(U_l)}{K}, \end{aligned}$$
(13)

where \(U_l\in \pi _K\) is the l-th cluster in the partition \(\pi _K\).

Different from the classification problem, the samples in the clustering problem have no label information. Therefore, it is not enough to only consider the intra-class similarity for each cluster during the clustering process. The inter-class similarity is also a kind of measurement to evaluate the clustering performance, which reflects the degree of differentiation of different clusters.

Definition 7

(Inter-class Similarity for Two Clusters) Given an information system \(IS=(U,A)\), \(\pi _K = \{U_1, U_2, \dots , U_K\}\) is the partition of U with K clusters. The inter-class similarity \(InterS(U_i, U_j)\) between any two clusters \(U_i\) and \(U_j\) under the set of attributes A is computed as:

$$\begin{aligned} InterS(U_i, U_j) = \frac{\sum _{m=1}^{|U_i|}\sum _{n=1}^{|U_j|} S_A(U_{im}, U_{jn})}{|U_i|\cdot |U_j|}, \end{aligned}$$
(14)

where \(U_{im}\) and \(U_{jn}\) are the m-th sample in cluster \(U_i\) and the n-th sample in cluster \(U_j\), respectively..

Moreover, we can define the inter-class similarity for the partition \(\pi _K\) as follows.

Definition 8

(Inter-class similarity for all clusters) Given an information system \(IS=(U,A)\), the inter-class similarity \(InterS(\pi _K)\) represents the sample similarity of the partition \(\pi _K = \{U_1, U_2, \dots , U_K\}\), which is computed as:

$$\begin{aligned} InterS(\pi _K) = \frac{2\cdot \sum _{i=1}^{K-1}\sum _{j>i}^{K} InterS(U_i, U_j)}{K\cdot (K-1)}. \end{aligned}$$
(15)

As discussed above, both the intra-class similarity for all clusters and the inter-class similarity for all clusters are two important factors that reflect the clustering performance. Thus, by integrating the two kinds of similarities, we define a novel clustering validity index as follows.

Definition 9

(Clustering validity index) Given an information system \(IS=(U,A)\), \(\pi _K = \{U_1, U_2, \dots , U_K\}\) are the K clusters, the clustering validity index for \(\pi _K\) is computed as:

$$\begin{aligned} CVI(\pi _K) = IntraS(\pi _K) - InterS(\pi _K). \end{aligned}$$
(16)

This definition tells us that a good clustering result with a large \(CVI(\pi _K)\) will have a large intra-class sample similarity and a small inter-class sample similarity.

Example 2

For information system as shown in Table 1 , assume all samples are divided into the two clusters \(\pi _2 = \{\{x_1,x_2,x_3,x_4,x_5\}, \{x_6, x_7, x_8, x_9\}\}\), then we have \(IntraS(\pi _2) = 0.5305\) and \(InterS(\pi _2)= 0.6111\), thus the clustering validity index based on \(\pi _2\) is \(CVI(\pi _2) = -0.0806\). Considering another situation, assume all samples are divided into the three clusters \(\pi _3 = \{ \{x_1,x_2\}, \{x_3,x_4,x_5\}, \{x_6, x_7, x_8, x_9\}\}\), then we have \(IntraS(\pi _3) = 0.5833\) and \(InterS(\pi _3)= 0.5796\), thus the clustering validity index based on \(\pi _3\) is \(CVI(\pi _3) = 0.0037\). Apparently, \(\pi _3\) is a better choice based on a larger clustering validity index.

4.2 An automatic cluster number selection method

For clustering validity index, the larger the value of \(CVI(\pi _K)\), the better the performance of the clustering result determined by the current cluster number K. Therefore, to choose the appropriate number of clusters, we can construct an optimization problem based on clustering validity index, which is represented as follows:

$$\begin{aligned} \mathop {\arg \max _{K}}\ CVI(\pi _K). \end{aligned}$$
(17)

This optimization problem expects to obtain the optimal number of clusters that maximize the clustering validity index after clustering. Since the distribution of samples (number of samples) in each cluster is not monotonic with the cluster number K, the clustering validity index does not satisfy the monotonicity property. Thus, we traverse all candidate number of clusters to obtain the optimal K. Algorithm 2 shows the description of the proposed algorithm. In the algorithm, based on the suggestion in Ref. [37], the search range of K is set from 2 to \(\sqrt{|U|}\) with step-size 1. Then, we apply K-means clustering algorithm to calculate the initial clustering result and compute the corresponding clustering validity index. Finally, the optimal clustering number with respect to the maximal clustering validity index \(CVI(\pi _K)\) is outputted. Assume \(|U|=n\), since the time complexity of K-means is O(nK), then the time complexity of Algorithm 2 is \(O(n^{\frac{3}{2}}K)\). Usually, K is a constant, thus we can say that the final time complexity is \(O(n^{\frac{3}{2}})\).

figure b

5 An automatic three-way clustering method based on sample similarity

In the fuzzy or rough clustering methods, the influence of the positive region and the fringe region on the cluster centers is different. Similarly, in the three-way clustering method, the positive region and the fringe region also reflect different influences on the cluster centers. Therefore, based on the threshold \(\alpha \) and the sample similarity, the cluster center in three-way clustering method can be defined as follows.

Definition 10

(Cluster center in three-way clustering method) Given an information system \(IS=(U,A)\), \(\pi _K=\{U_1, U_2, \dots , U_K\}\) are K clusters determined by K-means method. For each cluster \(U_i\), assume \(\alpha _i\) is the optimal threshold. Let \(PU=\mathrm{POS}_{\alpha _i}(U_i)\) and \(BU=\mathrm{BND}_{\alpha _i}(U_i)\) represent the positive region and the fringe region of \(U_i\) based on \(\alpha _i\), respectively. \(x_m\) and \(x_n\) represent the m-th sample in PU and the n-th sample in BU, respectively. Thus, \(x_m=[x_{m1},x_{m2},\dots ,x_{m|A|}]\) and \(x_n=[x_{n1},x_{n2},\dots ,x_{n|A|}]\). Then the center of cluster \(U_i\) is represented by \(v_i\) which is also a vector with |A| attribute values. \(v_i=[v_{i1},v_{i2},\dots ,v_{i|A|}]\). And the value of \(v_i\) on attribute j can be computed as:

$$\begin{aligned} v_{ij} = \frac{\omega _P\cdot \sum _{m=1}^{|PU|} x_{mj} + \omega _B\cdot \sum _{n=1}^{|BU|} x_{nj}}{\omega _P\cdot |PU| + \omega _B\cdot |BU|}, \end{aligned}$$
(18)

where \(\omega _P\) and \(\omega _B\) are two weight factors for the positive region and the fringe region.

In general, \(\omega _P > \omega _B \ge 0\). In our experiments, \(\omega _P\) and \(\omega _B\) are set to 1 and 0.5, respectively. From the calculation of the cluster centers and above definitions, it can be seen that the change of the threshold for three-way clustering will also affect the division of the positive and the fringe regions for each cluster, thus affecting the update of cluster centers and the final clustering result. Similarly, the selection of the number of clusters also affects the selection of optimal thresholds and the division of different regions. Therefore, it is not appropriate to separate the selection of the optimal number of clusters and the selection of optimal threshold in the three-way clustering process. In view of this, in this paper, through the integration of the two concepts of clustering validity index and partitioning validity index, an automatic three-way clustering method based on sample similarity is proposed, which can consider the interaction between the change of number of clusters and the variation of the threshold, and results in an optimal three-way clustering result based on the optimal number of clusters and the optimal thresholds. Algorithm 3 shows the description of the proposed algorithm.

figure c

In this algorithm, the Euclidean distance is used to calculate the distance of any two samples. The optimal threshold based on Algorithm 1 is first selected. When iteratively updating the cluster center changes, if the distance between the older and the new centers is less than \(\epsilon \), the center is no longer changed and the iteration stops. The front part of the algorithm is mainly to combine Algorithm 1 and Algorithm 2 to obtain the optimal thresholds and the optimal number of clusters, and the latter part is to give the three-way representation of the clustering result. The time complexity of Algorithm 3 is \(O(n^{\frac{5}{2}})\).

6 Experiments

6.1 Experiments setting

To evaluate the efficiency of proposed method, we implement several comparison experiments on 15 categorical data sets coming from UCI.Footnote 1 The brief descriptions of these data sets are listed in Table 2. In these data sets, some attributes representing ID information have been removed in the preprocessing procedure. Since our data sets are all categorical attributes, it is not easy to calculate the distance of any two samples. Therefore, to facilitate the calculation of the sample distance, we applied one-hot encoding to convert categorical data to integer data.

Table 2 The brief descriptions of 15 data sets

We compare our method with four kinds of clustering methods including K-means, FCM [2], RCM [21] and a classical three-way clustering method (TWC) [32]. The evaluation measures are the accuracy (ACC) and the normalized mutual information (NMI) that are common used in clustering related works. The clustering accuracy is measured by counting the number of correctly assigned samples and dividing by the number of all samples, which can be computed as follows:

$$\begin{aligned} \mathrm{ACC}=\frac{\sum _{i=1}^{|\pi _L|}|\max \{|\pi _{Li}\cap \pi _{Cj}|\mid \pi _{Cj}\in \pi _C\}|}{|U|}, \end{aligned}$$
(19)

where \(\pi _C\) and \(\pi _L\) are real clusters in U and predicted clusters, respectively. \(\pi _{Li}\) represents the i-th cluster of \(\pi _L\) and \(\pi _{Cj}\) represents the j-th cluster of \(\pi _C\). The normalized mutual information is used to measure the similarity between two clustering results.

$$\begin{aligned} \mathrm{NMI}(\pi _C,\pi _L) = 2\cdot \frac{I(\pi _C;\pi _L)}{H(\pi _C)+H(\pi _L)}, \end{aligned}$$
(20)

where \(I(\pi _C;\pi _L)\) represents the mutual information of \(\pi _C\) and \(\pi _L\), \(H(\pi _C)\) and \(H(\pi _L)\) are the related entropy of \(\pi _C\) and \(\pi _L\), respectively.

6.2 Experimental result of clustering performance

The comparison results on ACC and NMI are show in Tables 3 and 4. In each table, ‘score’ means the number of best results in the corresponding clustering method.

Table 3 The accuracy of all comparison methods on 15 data sets
Table 4 The normalized mutual information of all comparison methods on 15 data sets
Fig. 1
figure 1

The trend graph of clustering validity index with the change of the number of clusters on each data set

Fig. 2
figure 2

The trend graph of partitioning validity index with the change of the threshold on data set zoo

From Table 3, we can see that our proposed method can significantly improve the clustering accuracy on 12 data sets. Both our algorithm and the comparison algorithm have low accuracy on datasets with a large number of actual classes, such as data set zoo with 7 classes and data set primary-tumor with 22 classes, which also indicates that our algorithm performs better on datasets with a small number of actual classes.

Similarly, Table 4 shows that our proposed method can get 11 best results on 15 data sets. Specially, for the NMI measures, our proposed method can beat TWC on all data sets except heat-stalog data set.

6.3 Experimental result of number of clusters

We also examined the number of clusters based on our proposed clustering method. Table 5 compares our proposed method and the ground-truth result. It can be seen that our proposed method can obtain the true number of clusters on 8 data sets and approximate results on 5 data sets.

Another interesting phenomenon is that our algorithm tends to cluster into very few clusters, such as the datasets zoo and primary-tumor, which actually have 7 and 22 classes, but our algorithm divides them both into two clusters. On this point we suspect that the clusters themselves are not directly distinguishable, and that many of them are merged into the fringe region of the two clusters. The tendency to use fewer clusters to represent the data and to put the uncertainty into the fringe region is itself a feature of three-way clustering method.

Table 5 The comparison between the number of clusters calculated by the proposed algorithm and the actual number of clusters

Figure 1 is the trend graph of the clustering validity index with the change of the number of clusters on 15 data sets. In the above comparison experiments, the number of clusters with the largest CVI value of each data set is selected.

6.4 Experimental analysis of partitioning threshold

In order to visually show the trend of the partitioning validity index with the partitioning threshold selected by the cluster, we also analyze the PVI values of the two clusters of the data set zoo with the change of the selected partitioning threshold. In Fig. 2, for the two clusters in zoo, the selected thresholds vary by [0.19, 1.0] and [0.44, 1.0], respectively. In the above comparison experiments, we also choose the thresholds with the largest PVI values.

7 Conclusion

As a kind of extension of traditional clustering method, three-way clustering method can effectively deal with the incomplete and inaccurate data. In view of the problem that existing three-way clustering methods often select the appropriate number of clusters and the partitioning threshold according to the subjective tuning in the implementation process, in this paper, based on introducing the sample similarity, we defined the clustering validity index and the partitioning validity index to automatically compute the number of clusters and the partitioning threshold, respectively. Furthermore, we proposed an automatic three-way clustering approach by combining the proposed cluster number selection method and the threshold selection method. Comparison experiments also validate the effectiveness of our proposed method.