Keywords

1 Introduction

Given a set of data points described by a set of attributes or features, the main task of clustering is to divide these data points into groups such that the data points in the same group are as similar as possible and those in different groups are as dissimilar as possible. Each group is called a cluster, and the family of all groups is called a clustering. The results of some popular clustering methods [2, 4, 5, 8, 16] depend on their initial configurations that involve a priori parameters such as a given number of clusters. In order to improve the robustness and accuracy, these methods are usually repeatedly applied with different initial configurations. The family of produced clusterings are then combined into a single clustering via consensus clustering. This is one of the main motivations for consensus clustering that produces a final clustering by synthesizing a set of input clusterings.

The consensus clustering methods based on co-association matrix [6, 7, 12, 13, 21, 22] are very popular and well studied in the literature. The first step in the main procedure is to synthesize the set of input clusterings into an \(n \times n\) co-association matrix where n is the total number of data points. The values in the matrix reflect how likely the corresponding two data points are clustered together in the input clusterings. The second step is to obtain the final clustering by applying a basic clustering method to the matrix. These consensus clustering methods are easy to understand and implement. However, since they focus on all data-point pairs when constructing the matrix, they usually have high computational cost when applied to large datasets, which restricts their applications.

The consensus clustering can be viewed as a decision making process. In the co-association matrix based methods, we make decisions of whether to cluster two data points together or not based on the information provided by input clusterings. The theory of three-way decisions [23] offers a framework of decision making by dividing a set of objects into three disjoint decision regions according to some criterion. Each region is associated with a specific decision. Generally, the three regions include the positive, negative and boundary regions. The objects in the positive region are associated with an acceptance decision, that is, we accept that these objects satisfy the criterion. The objects in the negative region are associated with a rejection decision, that is, we decide that these objects do not satisfy the criterion. Those in the boundary region cannot be definitely determined to satisfy the criterion or not. They are associated with a third non-commitment decision due to the uncertainty. The theory of three-way decisions has been applied to basic clustering methods by researchers [27,28,29,30].

The sequential three-way decision model [26] iteratively applies the three-way decision model to refine the boundary region and reduce the uncertainty. Definite decisions (i.e., acceptance and rejection) are made on objects in each stage if sufficient information is available. Otherwise, the decision on the objects will be postponed into the next stage where more detailed and sufficient information will be involved. It has been applied to many real-world applications such as face recognition in [14, 15]. Four modes of sequential three-way decisions are examined in [26], including multiple levels of granularity, probabilistic rough set theory, multiple models of classification, and ensemble classifications. Our presented approach in this paper follows a similar mode as ensemble classifications.

The presented approach integrates the sequential three-way decision model into the construction of a co-association matrix. In each stage, based on a set of input clusterings, we put a data-point pair into a positive region if the corresponding value in the matrix is high enough or into a negative region if the value is low enough. The corresponding entry in the matrix is then updated with the largest value 1 or the smallest value 0, respectively. Otherwise, the pair is put into a third boundary region and the corresponding entry is to be determined in the next stage that involves more input clusterings. In this way, we determine the entries in the matrix and correspondingly, make quick decisions on the clustering of some data points in early stages. As a result, we may be able to reduce the overall computational cost of constructing the matrix.

The remaining part of this paper is arranged as follows. Section 2 reviews consensus clustering methods based on co-association matrix. The sequential three-way approach to constructing the matrix is presented in Sect. 3. Section 4 shows the experimental results. Section 5 concludes the paper and discusses possible directions for the future work.

2 A Review of Co-association Matrix Based Consensus Clustering Methods

The main task of consensus clustering is to combine different clusterings of a dataset into one single clustering, usually without referring to the original features or attributes of the data points. A general framework of consensus clustering includes two steps [20], namely, the Generation and Consensus steps. The Generation step generates the set of input clusterings for a given dataset. They can be produced by different basic clustering methods or multiple applications of the same method with different parameters. The Consensus step combines the input clusterings into a final consensus clustering according to a particular consensus function.

A co-association matrix based method includes two steps in the main procedure. The first step is to synthesize the input clusterings into an intermediate representation called a co-association matrix. Each entry in the matrix measures how many times the two corresponding data points are associated or clustered together in the input clusterings. The second step is to get the final consensus clustering by applying a basic clustering method to the matrix.

Suppose \(X=\{x_1,x_2,\cdots ,x_n\}\) is a given dataset and \(\mathbb {C}^{in} = \{\mathcal {C}_1, \mathcal {C}_2, \cdots , \mathcal {C}_m\}\) is a set of input clusterings on X. In a co-association matrix based method, an input clustering \(\mathcal {C}_k (1 \le k \le m)\) is commonly represented by an \(n \times n\) matrix. Moreover, the input clusterings are widely assumed to be hard clusterings where a data point belongs to exactly one cluster. Thus, the entries in a matrix \(\mathcal {C}_k (1 \le k \le m)\) are formally defined as: for \(1 \le i \le n\) and \(1 \le j \le n\),

$$\begin{aligned} \mathcal {C}_k (i,j)=\left\{ \begin{array}{ll} 1, &{} \mathrm{~~if~} x_i \mathrm{~and~} x_j \mathrm{~are~clustered~together},\\ 0, &{} \mathrm{~~otherwise}. \end{array}\right. \end{aligned}$$
(1)

Based on the set \(\mathbb {C}^{in}\), a simple way to construct the co-association matrix \(M_{n \times n}\) is to use the proportion of input clusterings where the two corresponding data points are associated, which is the evidence accumulation framework proposed in [7]. Accordingly, M is constructed as: for \(1 \le i \le n\) and \(1 \le j \le n\),

$$\begin{aligned} M(i, j) = \frac{1}{m} \sum _{k=1}^{m} \mathcal {C}_k(i,j). \end{aligned}$$
(2)

More complex measures are proposed to construct the matrix by taking into account more information. The Connected-Triple based Similarity (CTS) and SimRank based Similarity (SRS) [12] consider the transitivity property of clustering data points. A Weighted Co-Association Matrix is presented in [21] which takes into consideration the size of the clusters containing the two data points and the total number of clusters in the corresponding input clustering. The Probability Accumulation Matrix [22] considers the size of the clusters containing the two data points and the number of attributes used to describe the data points.

To cluster the data points based on the co-association matrix, two hierarchical clustering methods are proposed in [7, 13]. A graph based method proposed in [19] generates a similarity graph based on the matrix and obtains the final clustering by partitioning the graph. Two threshold based methods are presented in [6, 7].

The co-association matrix based methods are advantageous in several aspects. They use the co-association idea to avoid the labeling correspondence problem which is a common difficulty in some popular categories of current consensus clustering methods. For instance, in the relabeling and voting based methods [20], the first step is to relabel the input clusters in all the input clusterings where the labeling correspondence problem needs to be solved in order to find the correspondence between clusters in different clusterings. The labeling correspondence problem can only be solved, with certain accuracy, when the input clusterings have the same number of clusters, which is a very restrictive condition in these methods. Besides, the co-association matrix based methods are easy to understand and implement since the constructions of the matrix and the basic clustering methods are usually quite intuitive. However, since they need to compute the value for each data-point pair to construct the co-association matrix, they usually have high computational cost with big datasets, which restricts their applications.

3 A Sequential Three-Way Approach to Constructing a Co-association Matrix

Based on a general framework of sequential three-way decisions proposed in [26], we present a sequential three-way approach to progressively constructing a co-association matrix in multiple stages.

3.1 An \((\alpha , \beta )\)-cut of a Co-association Matrix

The values in a co-association matrix quantitatively evaluate how likely two data points are clustered together. In order to decide whether two data points should be clustered together in the final clustering, it may be sufficient to qualitatively know whether they are likely enough to be associated, that is, whether the corresponding value in the matrix is large enough. Similarly, to decide whether they should be separated into different clusters, a qualitatively small enough value may be sufficient. Based on this idea, we can use a pair of thresholds to cut the values and divide the data-point pairs into three decision regions. The matrix is then updated by assigning different values to the pairs in different regions.

Suppose \((\alpha , \beta )\) is a pair of thresholds with \(0 \le \beta < \alpha \le 1\) and \(eval: X \times X \rightarrow [0,1]\) is a measure to evaluate how likely two data points are associated based on a set of input clusterings (e.g., Eq. (2)). By using the pair \((\alpha , \beta )\) to cut the evaluation values, the set of data-point pairs \(\mathbb {X} = X \times X\) is divided into three disjoint positive POS, negative NEG and boundary BND regions:

$$\begin{aligned} \mathrm{POS}(\mathbb {X})= & {} \{ (x_i,x_j) \in \mathbb {X} \mid eval(x_i,x_j) \ge \alpha \}, \nonumber \\ \mathrm{NEG}(\mathbb {X})= & {} \{ (x_i,x_j) \in \mathbb {X} \mid eval(x_i,x_j) \le \beta \}, \nonumber \\ \mathrm{BND}(\mathbb {X})= & {} \{ (x_i,x_j) \in \mathbb {X} \mid \beta< eval(x_i,x_j) < \alpha \}. \end{aligned}$$
(3)

The entries in the co-association matrix \(M_{n \times n}\) are accordingly determined as:

\((M^\mathrm{P})\) :

If \((x_i,x_j) \in \mathrm{POS}(\mathbb {X})\), then \(M(i,j) = 1\),

\((M^\mathrm{N})\) :

If \((x_i,x_j) \in \mathrm{NEG}(\mathbb {X})\), then \(M(i,j) = 0\),

\((M^\mathrm{B})\) :

If \((x_i,x_j) \in \mathrm{BND}(\mathbb {X})\), then \(M(i,j)=eval(x_i,x_j)\) or a constant value \(v \in (0,1)\).

As a result, for two data points \(x_i\) and \(x_j\), if their evaluation value \(eval(x_i,x_j)\) is high enough to indicate that they are associated (i.e., \(eval(x_i,x_j) \ge \alpha \)), then we cluster them together by assigning the largest evaluation value 1 to the entry M(ij). If the evaluation value is low enough to indicate that they are not associated (i.e., \(eval(x_i,x_j) \le \beta \)), then we separate them into different clusters by assigning the smallest evaluation value 0 to the entry M(ij). Otherwise, we cannot make a definite decision due to insufficient information. The entry M(ij) may take the original evaluation value or a default constant value \(v \in (0,1)\) such as 0.5.

3.2 An l-stage Sequential Three-Way Approach to Constructing a Co-association Matrix

In the \((\alpha ,\beta )\)-cut discussed in the previous subsection, a definite decision cannot be made on the data-point pairs in the boundary region due to insufficient information provided by the input clusterings. By involving more input clusterings, we may be able to refine the boundary region, which results in a sequential three-way approach to constructing a co-association matrix.

Suppose we have the following sequence of sets of input clusterings:

$$\begin{aligned} \mathbb {C}^{in}_1 \subsetneq \mathbb {C}^{in}_2 \subsetneq \cdots \subsetneq \mathbb {C}^{in}_l. \end{aligned}$$
(4)

The proper subset relationship \(\mathbb {C}^{in}_k \subsetneq \mathbb {C}^{in}_{k+1} (1 \le k < l)\) ensures that \(\mathbb {C}^{in}_{k+1}\) contains at least one more input clustering than \(\mathbb {C}^{in}_k\), which gives more information about the clustering of data points. By using these sets one by one, we can obtain an l-stage sequential three-way approach to constructing the co-association matrix. Suppose X is the given dataset and \(\mathbb {X}_k\) is the set of data-point pairs considered in the kth stage. The three regions in the kth stage are constructed as: let \(\mathbb {X}_1=X \times X\) and \(\mathbb {X}_k = \mathrm{BND}_{k-1}(\mathbb {X}_{k-1}) ( 1 < k \le l)\),

$$\begin{aligned} \mathrm{POS}_k(\mathbb {X}_k)= & {} \{ (x_i,x_j) \in \mathbb {X}_k \mid eval(x_i,x_j | {\mathbb {C}^{in}_k}) \ge \alpha _k \}, \nonumber \\ \mathrm{NEG}_k(\mathbb {X}_k)= & {} \{ (x_i,x_j) \in \mathbb {X}_k \mid eval(x_i,x_j | {\mathbb {C}^{in}_k}) \le \beta _k \}, \nonumber \\ \mathrm{BND}_k(\mathbb {X}_k)= & {} \{ (x_i,x_j) \in \mathbb {X}_k \mid \beta _k< eval(x_i,x_j | {\mathbb {C}^{in}_k}) < \alpha _k \}, \end{aligned}$$
(5)

where \(eval(x_i,x_j | {\mathbb {C}^{in}_k})\) is the evaluation value of \(x_i\) and \(x_j\) calculated based on the set \({\mathbb {C}^{in}_k}\), and the thresholds satisfy the condition \(0 \le \beta _k < \alpha _k \le 1\). Accordingly, the entries in the co-association matrix \(M_{n \times n}\) are determined as follows:

$$\begin{aligned} (M_k^\mathrm{P})&\mathrm{If~} (x_i,x_j) \in \mathrm{POS}_k(\mathbb {X}_k), \mathrm{~then~} M(i,j) = 1, \nonumber \\ (M_k^\mathrm{N})&\mathrm{If~} (x_i,x_j) \in \mathrm{NEG}_k(\mathbb {X}_k), \mathrm{~then~} M(i,j) = 0, \nonumber \\ (M_k^\mathrm{B})&\mathrm{If~} (x_i,x_j) \in \mathrm{BND}_k(\mathbb {X}_k), \mathrm{~then~} M(i,j) = eval(x_i,x_j | {\mathbb {C}^{in}_k}). \end{aligned}$$

One may take special actions to deal with a nonempty final boundary region \(\mathrm{BND}_l(\mathbb {X}_l)\) instead of using the original evaluation values. For example, one may use a two-way process with a threshold r (e.g., 0.5) to clean up the boundary region or use a fixed value (e.g., 0.5) to replace the original evaluation values.

There are several assumptions in the above sequential three-way approach. Firstly, it is assumed that we are more biased towards putting the data-point pairs into the boundary region in an early stage where limited information is available. It leads to the relationships of all the thresholds [25]: \(0 \le \beta _1 \le \beta _2 \le \cdots \le \beta _l < \alpha _l \le \alpha _{l-1} \le \cdots \le \alpha _1 \le 1\). By using a more restrictive pair of thresholds in an early stage, a data-point pair is more likely to be put into the boundary region, which indicates a more conservative opinion due to limited information. A third assumption is that we do not go back to update the positive and negative regions constructed in earlier stages. In other words, the definite decisions associated with these regions are not updated although they might be inappropriate when more input clusterings are available in some stage later on. Consequently, in each stage, we only focus on refining the boundary region constructed in the previous stage.

Example 1

We illustrate the construction of a co-association matrix by the presented approach. Suppose the data set is \(X = \{o_1,o_2,o_3,o_4,o_5,o_6,o_7,o_8,o_9\), \(o_{10}\}\). The set \(\mathbb {C}^{in}\) of all input clusterings on X includes the following ten clusterings:

$$\begin{aligned} \mathcal {C}_1= & {} \big \{ \{o_1,o_2,o_8\}, \{o_3,o_9,o_{10}\}, \{o_4,o_6,o_7\}, \{o_5\} \big \},\\ \mathcal {C}_2= & {} \big \{ \{o_1,o_4,o_6\}, \{o_2,o_5,o_8\}, \{o_3,o_7,o_9,o_{10}\} \big \}, \nonumber \\ \mathcal {C}_3= & {} \big \{ \{o_1,o_4,o_6\}, \{o_2,o_8\}, \{o_3,o_5,o_9,o_{10}\}, \{o_7\} \big \},\\ \mathcal {C}_4= & {} \big \{ \{o_1,o_2,o_7,o_8\}, \{o_3,o_5,o_9,o_{10}\}, \{o_4,o_6\} \big \}, \nonumber \\ \mathcal {C}_5= & {} \big \{ \{o_1,o_2,o_7,o_8\}, \{o_3,o_9,o_{10}\}, \{o_4,o_5,o_6\} \big \}, \\ \mathcal {C}_6= & {} \big \{ \{o_1,o_4,o_6\}, \{o_2,o_3,o_5,o_9\}, \{o_7\}, \{o_8,o_{10}\} \big \}, \nonumber \\ \mathcal {C}_7= & {} \big \{ \{o_1,o_4,o_6,o_7\}, \{o_2,o_3,o_8\}, \{o_5,o_9,o_{10}\} \big \}, \\ \mathcal {C}_8= & {} \big \{ \{o_1,o_3,o_7,o_9,o_{10}\}, \{o_2,o_8\}, \{o_4,o_6\}, \{o_5\} \big \}, \nonumber \\ \mathcal {C}_9= & {} \big \{ \{o_1,o_2,o_4\}, \{o_3,o_5,o_9,o_{10}\}, \{o_6,o_7,o_8\} \big \}, \\ \mathcal {C}_{10}= & {} \big \{ \{o_1,o_4,o_6\}, \{o_2,o_7,o_8\}, \{o_3,o_5,o_9,o_{10}\} \big \}. \end{aligned}$$

We use Eq. (2) to calculate the evaluation values, which is a symmetric measure. Thus, we need to compute the entries in the top right half of the matrix, not including the diagonal line. Suppose \(\mathbb {C}^{in}_1 = \{\mathcal {C}_1, \mathcal {C}_2, \mathcal {C}_3, \mathcal {C}_4, \mathcal {C}_5, \mathcal {C}_6\}\). The evaluation values are given in Table 1(a). By using thresholds (1, 0), the entries with grey background are in the boundary region and the remaining entries are in either the positive or negative region. In stage 2, \(\mathbb {C}^{in}_2 = \mathbb {C}^{in}_1 \cup \{\mathcal {C}_7\}\). The evaluation values for the previous boundary region are modified and given in Table 1(b). By using thresholds (0.9, 0.1), the previous boundary region stays the same. In stage 3, \(\mathbb {C}^{in}_3 = \mathbb {C}^{in}_2 \cup \{\mathcal {C}_8\}\) and the evaluation values are given in Table 1(c). By using thresholds (0.8, 0.2), some entries in the previous boundary region are moved to either the positive or negative region and the corresponding values in the matrix are changed to either 1 or 0. This process goes on with stage 4 using \(\mathbb {C}^{in}_4 = \mathbb {C}^{in}_3 \cup \{\mathcal {C}_9\}\) and thresholds (0.7, 0.3) and stage 5 using \(\mathbb {C}^{in}_5 = \mathbb {C}^{in}\) and thresholds (0.6, 0.4). If we do not allow overlap between clusters (i.e., we consider the hard clusterings) and assume that two data points are clustered together if they are both clustered together with a third data point, then the nonempty boundary region in stage 5 can be cleaned up and the final consensus clustering is \(\big \{ \{o_1,o_4,o_6\}, \{o_2,o_8\}, \{o_3,o_5,o_9,o_{10}\} \big \}\).

Table 1. The construction of a co-association matrix in Example 1

3.3 Two Issues in the Presented Approach

The first issue in the presented sequential three-way approach is to avoid an easy agreement on a definite decision in early stages where we have limited input clusterings. In other words, the data-point pairs should be less likely to be put into the positive and negative regions in early stages. There are at least two possible solutions to this issue. One solution is to use very restrictive thresholds in early stages, such as (1, 0) in the first few stages. Another solution is to carefully select the input clusterings used in an early stage so that it is not easy for them to agree on a definite decision. This involves the determination of a proper total number of input clusterings and the selection of the basic clustering methods to generate the input clusterings. Intuitively, the group of input clusterings should be large enough since a small group is more likely to agree on a definite decision. The basic clustering methods that are used to generate the input clusterings should be as various as possible so that we can capture different views of clustering the data points. Repeated applications of the same method, such as k-means, are likely to produce similar clusterings although they start with different initial configurations. We should involve basic clustering methods in various categories, such as density-based clustering methods [5] that model clusters as areas with high density and EM algorithms [2] that model clusters as probability distributions.

The second issue is the determination of thresholds. The computation and interpretation of thresholds have been studied with respect to one-step three-way decisions, such as a probabilistic approach proposed in [24], a game-theoretic approach proposed in [9], and a decision-theoretic approach proposed in [3]. In order to apply these studies in the presented approach, we need to generalize the current methods with respect to the sequential case and the specific topic of consensus clustering.

These two issues can also be empirically solved by tuning related parameters in the experiments. For instance, one may use a fixed decreasing step and a fixed increasing step to update \(\alpha \) and \(\beta \) in each stage. The two step lengths can be tuned though experiments to find the optimal lengths.

4 Experiments

The experiments are implemented using R Studio (IDE) based on Microsoft R Open 3.4.2. The implemented algorithm, which is called a Sequential THREE-Way algorithm to Consensus Clustering based on Co-Association Matrix (S3WCC-CAM), constructs a co-association matrix based on a set of input matrices representing the input clusterings and applies a hierarchical clustering method to generate the final clustering. The main procedure in S3WCC-CAM is given as follows.

  • Input:

    • A set \(\mathbb {C}^{in}\) of \(n \times n\) matrices where n is the number of data points in the dataset. The values in these matrices are in the unit interval [0,1].

    • A number m of input matrices to be used in the first iteration.

    • A number \(r (r \ge 1)\) used to refine the thresholds.

  • Output: A hierarchical final clustering \(\mathcal {HC}\) of the dataset.

  • Step 1: Construct the co-association matrix \(M_{n \times n}\).

    1. (1)

      Generate a sequence Seq of thresholds refined by r.

    2. (2)

      Initialize all the entries in the co-association matrix \(M_{n \times n}\) to be \(N{\slash }A\) (i.e., not available) and the subset \(\mathbb {C}^{in}_{it}\) of input matrices used in the next iteration to be empty. As a result, \(\mathbb {C}^{in}_{it}\) is the set of visited input matrices in \(\mathbb {C}^{in}\) and \((\mathbb {C}^{in} - \mathbb {C}^{in}_{it})\) is the set of non-visited input matrices.

    3. (3)

      Perform the following steps iteratively until either the boundary region or the set \((\mathbb {C}^{in} - \mathbb {C}^{in}_{it})\) is empty:

      • Get the next pair of thresholds \((\alpha , \beta )\) from the sequence Seq.

      • If it is the first iteration, select a set of m matrices from \((\mathbb {C}^{in} - \mathbb {C}^{in}_{it})\) and add them to \(\mathbb {C}^{in}_{it}\). Otherwise, select one matrix from \((\mathbb {C}^{in} - \mathbb {C}^{in}_{it})\) and add it to \(\mathbb {C}^{in}_{it}\).

      • Based on the set \(\mathbb {C}^{in}_{it}\), update the evaluation values of all data-point pairs in the current boundary region, divide these pairs into three regions and update the entries in M accordingly.

    4. (4)

      If the boundary region is not empty, update all entries in the boundary region with 0.5.

  • Step 2: Generate the hierarchical clustering \(\mathcal {HC}\) by applying a hierarchical clustering method to M.

The input matrices in \(\mathbb {C}^{in}\) are produced by applying basic clustering algorithms to a dataset. These basic clustering algorithms include 12 algorithms implemented in the package diceR [1], namely, AP, BLOCK, CMEANS, GMM, SC, SOM, DIANA_Euclidean, HC_Euclidean, HDBSCAN, KM_Euclidean, NMF_Scd (or NMF_Lee), and PAM_Euclidean. Every clustering algorithm can be repeatedly applied with different sets of tuning parameters, such as a given number of clusters and a distance measure. In the current implementation, we only consider Euclidean distance and run each algorithm three times with the number of clusters as 3, 4, and 5, respectively. In total, they produce 36 clusterings represented by 36 \(n \times n\) matrices that comprise the input set \(\mathbb {C}^{in}\).

The sequence Seq of thresholds starts from the most restrictive pair (1, 0). The other pairs are generated according to two step lengths, one \(\delta _{\alpha }\) for decreasing \(\alpha \) and another \(\delta _{\beta }\) for increasing \(\beta \). In the current implementation, we consider a simple case where \(\delta _{\alpha } = \delta _{\beta } = \delta \). The step length \(\delta \) is calculated as:

$$\begin{aligned} \delta = \frac{1}{2*(|\mathbb {C}^{in}| - m + 1) -1} \cdot \frac{1}{r}, \end{aligned}$$
(6)

where the number \(|\mathbb {C}^{in}| - m+ 1\) is the maximum number of iterations.

Each iteration in (3) of Step 1 represents a stage in the presented sequential three-way approach. In order to use as various input clusterings as possible, when selecting matrices from \((\mathbb {C}^{in} - \mathbb {C}^{in}_{it})\), we prefer the matrices produced by non-visited clustering algorithms, that is, these algorithms do not produce any matrix in \(\mathbb {C}^{in}_{it}\) that is the set of visited matrices. If there are more candidate matrices than required, we randomly select a required number of matrices from them. To deal with a nonempty boundary region after the iterations, we update all the entries in the boundary region with a value 0.5. The hierarchical clustering method used in Step 2 adopts an agglomerative strategy using the average linkage (UPGMA) [18] to find and merge similar clusters, which is implemented in the package diceR [1].

The algorithm S3WCC-CAM is applied to two datasets, that is, irisFootnote 1 from UCI and hgscFootnote 2 from the diceR package. The dataset iris includes 150 data points described by 4 attributes. A fifth attribute of class labels is ignored in the clustering process and used as an external reference in the evaluations. The dataset hgsc includes 489 data points described by 321 attributes without an attribute of class labels. Due to the limitation of our experimental environments, the algorithm is not applied with large datasets in the current experiments. This might be a direction of our future work. The evaluation value of a data-point pair is computed as the proportion of times that the two data points are clustered together out of the times that they are chosen in the bootstrap resampling [1], which is implemented in the package diceR. Table 2 lists the configurations of m and r considered in our experiments.

Table 2. Configurations of m and r in the experiments

The results of S3WCC-CAM are compared with Cluster-based Similarity Partitioning Algorithm (CSPA) [19] and Link-based Cluster Ensemble method (LCE) [11]. The clustering results are measured by both internal and external indices implemented in the package diceR [1]. The internal indices include avg_within that measures the average distance within clusters, avg_between that measures the average distance between clusters and avg_silwidth that measures the average distance between clusters based on Silhouette width. Thus, a smaller avg_within, a bigger avg_between and a bigger avg_silwidth indicate a better clustering. The external indices measure the similarity of two clusterings by using the class labels as an external reference. The two external indices used in our experiments are the corrected Rand index (corrected_rand) [10] and Meila’s variation index (vi) [17]. The corrected Rand index ranges from \(-1\) to 1 with \(-1\) indicating no agreement and 1 indicating perfect agreement. The Meila’s variation index measures the variation of information for two clusterings based on mutual information. It has an upper bound \(\log n\) where n is the number of data points in the dataset. A smaller Meila’s variation index indicates a better clustering. Table 3 summarizes the results of all the above indices. Besides, Table 3 also shows the run time (run_time) and the percentage of boundary region when the iterations stop (BND_perc) in S3WCC-CAM. Since the dataset hgsc does not contain the class labels, only internal indices are evaluated.

Table 3. A summary of the experiment results

As shown in Table 3, S3WCC-CAM generally produces as good clustering results as CSPA and LCE based on the internal and external indices. In terms of the run time, S3WCC-CAM outperforms LCE with all the configurations and CSPA with most configurations, especially on the dataset hgsc. Different configurations of m and r in S3WCC-CAM have a significant influence on run_time and BND_perc. A further study, either experimental or theoretical, on the optimal configuration is necessary and might be a direction for future work.

5 Conclusions and Future Work

We present a sequential three-way approach to progressively constructing a co-association matrix in multiple stages. In each stage, we calculate the evaluation values based on a set of input clusterings. A pair of thresholds is then used to cut the evaluation values, and accordingly, the data-point pairs are divided into three disjoint positive, negative and boundary regions. The entries in the co-association matrix corresponding to the positive and negative regions are updated with the highest evaluation value 1 and the lowest evaluation value 0, respectively. Accordingly, a definite decision of either clustering two data points together or separating them is associated. By gradually involving more input clusterings, we are able to refine the evaluation values in the boundary regions and make a definite decision if possible. By determining some entries to be 1 or 0 once sufficient information can be obtained from the input clusterings, the presented approach makes quick definite decisions on the clustering of some data points in early stages. In this way, we may reduce the overall computational cost of constructing the co-association matrix and obtaining the final clustering.

One direction of the future work is to solve the two issues in the presented approach as mentioned. A second direction is to generalize the presented sequential approach with respect to other consensus clustering methods that do not use co-association matrix. A third direction is a further experimental study, including the optimal configuration of S3WCC-CAM as well as its applications on larger datasets.