Keywords

1 Introduction

Clustering is one of the most significant unsupervised learning problems which has been used in diverse areas like machine vision and pattern recognition as well as in medical applications. The fundamental objective of data clustering is to group similar objects in one cluster and divide dissimilar objects into different clusters. Research on clustering algorithm has received much attention and a number of clustering methods have been developed over the past decades.

The various methods for clustering can be divided into two categories: hard clustering and soft clustering. Hard clustering methods, such as C-means [11], spectral clustering [4], are based on an assumption that a cluster is represented by a set with a crisp boundary. That is, a data point is either in or not in a specific cluster. The requirement of a sharp boundary leads to easy analytical results, but may not adequately show the fact that a cluster may not have a well-defined cluster boundary. Furthermore, It is not the best way to absolutely divide the boundary objects into one cluster. In such cases, an object in the boundary should belong to more than one cluster.

In order to relax the constraint of hard clustering methods, many soft clustering methods were proposed for different application backgrounds. Fuzzy sets are a well known generalization of crisp sets, first introduced by Zadeh [23]. Incorporating fuzzy sets into C-means clustering, Bezdek [2] proposed Fuzzy C-Means (FCM), which is assumed that a cluster is represented by a fuzzy set that models a gradually changing boundary. Another effective tool for uncertain data analysis is rough set theory [15], which use a pair of exact concepts, called the lower and upper approximations, to approximate a rough (imprecise) concept. Based on the rough set theory, Lingras and West [10] introduced Rough C-Means (RCM) clustering, which describes each cluster not only by a center, but also with a pair of lower and upper bounds. Incorporating membership in the RCM framework, Mitra et al. [13] put forward a Rough-Fuzzy C-Means (RFCM) clustering method. Shadowed set, proposed by Pedrycz [16], provides an alternate mechanism for handling uncertainty. As a conceptual and algorithmic bridge between rough sets and fuzzy sets, shadowed set has been successfully used for clustering analysis, resulting in Shadowed C-Means (SCM) [14]. A brief summary of existing clustering methods can be shown in Fig. 1.

Fig. 1.
figure 1

Classification diagram of existing clustering methods

Although there are a lot of clustering methods, the performance of many clustering algorithms is critically dependent on the characteristics of the data set and the input parameters. Improper input parameters may lead to clusters that deviate from those in the data set. There is not one clustering method that can identify any form of data structure distribution. In order to solve this problem, Strehl and Ghosh [17] proposed cluster Ensemble algorithm, which combines multiple clusterings of a set of objects into one clustering result without accessing the original features of the objects. It has been shown that cluster ensembles are useful in many applications, such as knowledge-reuse [6], multi-view clustering [8], distributed computing [12] and in improving the quality and robustness of clustering results [3, 5, 7, 9].

Recently, three-way decisions for problem solving was proposed by Yao [18, 19], which is an extension of the commonly used binary-decision model by adding a third option. The approach of three-way decisions divides the universe into the Positive, Negative and Boundary regions which denote the regions of acceptance, rejection and non-commitment for ternary classifications. Specifically, for the objects partially satisfy the classification criteria, it is difficult to directly identify them without uncertainty. Instead of making a binary decision, we use thresholds on the degrees of satisfiability to make one of three decisions: accept, reject, non-commitment. The third option may also be referred to as a deferment decision that requires further judgments. Three-way decisions have been proved to build on solid cognitive foundations and are a class of effective ways commonly used in human problem solving and information processing [20]. Many soft computing models for leaning uncertain concepts, such as interval sets, rough sets, fuzzy sets and shadowed sets, have the tri-partitioning properties and can be reinvestigated within the framework of three-way decisions [19].

Motivated by the three-way strategy, Yu [21, 22] proposed a new soft clustering framework, three-way clustering, which uses two regions to represent a cluster, i.e., core region (Co) and fringe region (Fr) rather than one set. The core region is an area where the elements are highly concentrated of a cluster and fringe region is an area where the elements are loosely concentrated. There are maybe common elements in the fringe region among different clusters.

This paper aims at presenting a three-way clustering method by using cluster ensemble and three-way decisions based on the results of hard clustering. In the proposed method, hard clustering methods are used to produce different clustering results and cluster labels matching are used to align each clustering results to a given order. The three-way ensemble re-clustering results are obtained by the following strategy. The intersection of the clusters with same order are regarded as the core region and the difference between the union and the intersection of the clusters with same order are regarded as the fringe region of the specific cluster.

The study is organized into five sections. We start with a briefly introduction of the background knowledge in Sect. 2 and in Sect. 3 we present the process of Ensemble Re-clustering by two main steps. Experiment results are reported in Sect. 4.

2 A Three-Way Ensemble Re-clustering

By following ideas of cluster ensemble and three-way decisions, we present a three-way ensemble re-clustering algorithm. In this section, we assume that the universal has been divided into k disjoint sets m times by existing hard clustering algorithms. We discuss how to design a valid consensus function to obtain a three-way clustering based on the hard clustering results.

We begin our discussion by introducing some notations. We suppose that \(V =\{v_1,\cdots , v_n\}\) is a set of n objects and \(\mathbb {C}_i, (i=1, \cdots m), \) denotes i-th clustering of V, where \(\mathbb {C}_i=\{C_{i1}, \cdots , C_{ik}\}\) is a hard clustering results of V. Although we have obtained the the clustering results of V, \(\mathbb {C}_i\) can not be directly used for the conclusion of the next stage due to the lack of priori category information. As an example, consider the dataset \(V = (v_1, v_2, v_3, v_4, v_5, v_6)\) that consists of six objects, and let \(\mathbb {C}_1, \mathbb {C}_2 \) and \(\mathbb {C}_3\) be three clusterings of V which are shown in Table 1. Although they are expressed in different orders, they represent the same clustering result, so in order to combine the clustering results, the cluster labels must be matched to establish the correspondence between each other.

Table 1. Different ways of the same clustering results

In general, the number of identical objects covered by the corresponding cluster labels should be the largest, so the cluster labels can be registered based on this heuristic. Assuming that there are two clustering results \(\mathbb {C}_1\) and \(\mathbb {C}_2\). Each divides the dataset into k clusters, respectively, denoted by \(\{C_{11}, \cdots , C_{1k}\}\) and \(\{C_{21}, \cdots , C_{2k}\}\). First, the numbers of identical objects covered by each pair of cluster labels \(C_{1i}\) and \(C_{2j}\) in the two clusters are recorded in the overlap matrix of \(k\times k\). And then select the cluster label that covers the largest number of identical objects to establish the correspondence and remove the result from the overlap matrix. Repeat the above process until all the cluster labels have established the corresponding relationship.

When there are \(m (m>2)\) clustering results, we can randomly select one as the matching criterion and match the other clustering results with the selected results. The matching algorithm only needs to check the \(m-1\) clustering results and store the overlap matrix with the storage space of \((m-1)\times k^2\). The whole matching process is fast and efficient.

After all clustering labels match, all objects of V can be divided into three types for a given label j based on the results of labels matching:

$$\begin{aligned} \text{ Type } \text{ I }= & {} \{v\mid \forall i=1, \cdots , m, v\in C_{ij} \},\\ \text{ Type } \text{ II }= & {} \{v\mid \exists i\ne p, i, p=1, \cdots , m, v\in C_{ij}\wedge v\notin C_{pj}\},\\ \text{ Type } \text{ III }= & {} \{v\mid \forall i=1, \cdots , m, v\notin C_{ij}\}, \end{aligned}$$

From the above classifications, it can be seen that the objects in Type I are assigned to j-th cluster in all clustering results. The objects of Type II are assigned to j-th cluster in part of clustering results. The objects in Type III have not intersection with j-th in each clustering results. Based on the ideas of three-way decisions and three-way clustering, The elements in Type I are clearly attributable to the j-th cluster. And should be assigned to core region of j-th cluster. The elements in Type II should be assigned to fringe region of j-th cluster and all the elements in Type III should be assigned to trivial region of j-th cluster. From the above discussion, we get the following strategy to obtain a three-way clustering by cluster ensemble.

$$\begin{aligned} \text{ Co }(C_j)= & {} \{v\mid \forall i=1, \cdots , m, v\in C_{ij} \}=\bigcap _{i=1}^m C_{ij},\\ \text{ Fr }(C_j)= & {} \{v\mid \exists i\ne p, i, p=1, \cdots , m, v\in C_{ij}\wedge v\notin C_{pj}\}=\bigcup _{i=1}^m C_{ij}-\bigcap _{i=1}^m C_{ij} \end{aligned}$$

The above clustering method are called a three-way ensemble re-clustering. The procedure of three-way ensemble re-clustering consists mainly of three steps.

  1. 1.

    Obtain a group of hard clustering results \(\mathbb {C}_i, (i=1, \cdots m)\) by using existing methods.

  2. 2.

    Randomly select one clustering result in step 1 as the matching criterion and match the other clustering results with the selected results

  3. 3.

    Compute the intersection of the clusters with same labels and the difference between the union and the intersection of the clusters with same labels.

The above procedure can be depicted by Fig. 2. Finally, we present Algorithm 1, which describes the proposed three-way ensemble re-clustering based on hard clustering results. In Algorithm 1, we choose the first clustering results \(\mathbb {C}_1\) as matching criterion and match the other clustering results with \(\mathbb {C}_1\) during labels matching.

Fig. 2.
figure 2

Procedure diagram of three-way ensemble re-clustering

figure a
Table 2. A description of 5 data sets

3 Experimental Illustration

To illustrate the effectiveness of Algorithm 1, some experiments on UCI [1] data sets are employed in this section. Before running Algorithm 1 on date sets, we need to obtain m clustering results. We use NJW spectral clustering with different scale parameter to get 10 clustering results in our experiments. All codes are run in Matlab R2013b on a personal computer. The details of these data are shown in Table 2.

In order to measure the tests accuracy, we use Macro F\(_1\) values and Micro F\(_1\) values [24] of cluster results as the evaluation indicator, which are two commonly used methods of testing the effect of classification. The results of above 5 UCI data sets by spectral clustering and three-way ensemble re-clustering are computed respectively. We list the experimental results in Table 3.

Table 3. Comparisons of clustering results

With a deep investigation of Table 3, it is not difficult to observe that both the Macro F\(_1\) values and the Micro F\(_1\) values of three-way ensemble re-clustering are higher than those based on spectral clustering. From Table 3, we can conclude that three-way ensemble re-clustering can significantly improve the structure of classification results by comparing with the traditional spectral clustering algorithm.

4 Concluding Remarks

In this paper, we developed a three-way ensemble re-clustering method by employing the ideas of three-way decisions and cluster ensemble. Hard clustering methods are used to produce different clustering results and cluster labels matching is used to align each clustering results to a given order. The intersection of the clusters with same labels are regarded as the core region and the difference between the union and the intersection of the clusters with same labels are regarded as the fringe region of the specific cluster. Based on the above strategy, a three-way explanation of the cluster is naturally formed and experimental results demonstrate that the new algorithm can significantly improve the structure of classification results by comparing with the traditional clustering algorithm. The present study is the first step for the research of three-way clustering. How to determine the number of clusters is an interesting topic to be addressed for further research.