1 Introduction

Co-clustering also known as biclustering or block-clustering involves simultaneous clustering of the set of observations and the set of features in a data matrix. By creating permutations of rows and columns, the co-clustering algorithms aim to reorganize the initial data matrix into homogeneous blocks. These blocks also called co-clusters can therefore be defined as subsets of the data matrix characterized by a set of observations and a set of features whose elements are similar. Other types of co-clustering approaches can be found in [11] and [12]. Co-clustering algorithms present several advantages: they reduce the initial matrix into a simpler form with the same basic structure and require far less computation when compared with separate processing of the same two sets; see for instance [8]. As a result, these methods are of interest to data mining.

In this work, we focus on co-clustering methods that seek a block diagonal structure, i.e. methods in which the number of clusters of rows is equal to the number of clusters of columns. An illustration is given in Fig. 1 where (a) represents an original binary matrix, (b) represents the same matrix after a proper permutation of rows whilst (c) adds a permutation of columns resulting in a clear block diagonal structure. These methods have proven efficient in dealing with the problem of document-word co-clustering. The objective is to group the documents based on the words within them and to group the words based on the documents in which they appear. The dataset is typically represented by a \(document\times words\) matrix. In [10], the author proposed a block diagonal algorithm to deal with binary data. This algorithm alternates the clustering of observations and features minimizing the error between the original data matrix and the reconstructed matrix based on the cluster structure. In [3], the author proposed a spectral based solution. He built a bipartite graph from the \(document\times words\) matrix which is partitioned in a way that minimize the cut objective function.

In this paper we propose a new diagonal co-clustering algorithm based on the minimization of an heterogeneity measure of blocks. This measure takes into account both the variance within blocks and a measure named the centroid effect [5] defined as the squared deviation from the mean entry in each block and the maximum entry in the input matrix. The proposed algorithm, in addition to be very efficient in terms of co-clustering on sparse data, are also faster than most of state-of-the art algorithms and therefore can deal with high dimensional data.

Fig. 1.
figure 1

Original binary data (a), (b) data reorganized according to rows, (c) reorganized according to rows and columns.

The remaining of this paper is organized as follows. Section 2 provides the needed background on the Double Kmeans (DKM) algorithm and presents the challenge of diagonal co-clustering. Section 3 presents the Diagonal Double Kmeans (DDKM) algorithm that we propose. Section 4 is devoted to numerical experiments on real datasets showing the appropriateness of our contribution for binary and continuous data. The final section sums up the study and gives recommendations for further research.

Notation. Let \(\mathbf {X}:=\{x_{ij};i \in I; j \in J\}\) be a data matrix of size \(n \times p\) where \(I=\{1,\ldots ,n\}\) and \(J=\{1,\ldots ,p\}\). The set I corresponds to the set of n objects and the set J to the set of p attributes. In the sequel, our aim consists in obtaining co-clustering of \(\mathbf {X}\). Let \(\mathbf {Z}=\{z_{1},\ldots ,z_{n}\}\) be a label vector, where \(z_{i} \in \{1,\ldots ,K\}\), denotes the partition of I into K clusters and \(\mathbf {W}=\{w_{1},\ldots ,w_{p}\}\) where \(w_{j} \in \{1,\ldots ,H\}\) denotes the partition of J into H clusters. The partition of I (respectively J) can be represented by a matrix of elements in \(\{0,1\}^K\) (respectively \(\{0,1\}^H\)) satisfying \(\sum _{k=1}^K z_{ik}=1\) (respectively \(\sum _{h=1}^H w_{jh}=1\)). Finally, to simplify the notation, the sums relating to rows, columns, row and column clusters will be subscripted respectively by the letters \((i=1,\ldots ,n, j=1,\ldots ,p\) and, \(k=1,\ldots ,K\), without indicating the implicit limits of variation. For example, the sum \(\sum _{i,k}\) stands for \(\sum _{i=1}^n\sum _{k=1}^K\).

2 Co-clustering and Diagonal Block Structure

The co-clustering can be formulated as the search for a good matrix approximation of the original data matrix \(\mathbf {X}\). The quality is determined by the approximation error which can be measured by a large class of loss functions like the square Euclidean distances. This approximation is generally achieved through an alternate least square minimization process (see for instance [1, 7]). The Double Kmeans algorithm [16] is based on this principle.

2.1 Double Kmeans Algorithm

Formally, the aim is to minimize an objective function \(J(\mathbf {Z},\mathbf {W},\mathbf {G})\) where \(\mathbf {Z}\) and \(\mathbf {W}\) are the partitions and \(\mathbf {G}:=\{g_{kh}; k \in \{1,\ldots ,K\}, h \in \{1,\ldots ,H\}\) is a \(K \times H\) matrix which can be viewed as a summary of the data matrix X (see Fig. 2).

Fig. 2.
figure 2

Original data matrix \(\mathbf {X}\) and its summary after co-clustering into 6 co-clusters.

Each element \(g_{kh}\) of \(\mathbf {G}\) is called a prototype of co-cluster \(\mathbf {X}_{kh}:=\{x_{ij}; z_{ik}w_{jh}=1\}\). Double Kmeans (DK) adopts the squared Euclidean distance to measure the dissimilarity between the matrix \(\mathbf {X}\) and the structure described in \(\mathbf {Z}\),\(\mathbf {W}\) and \(\mathbf {G}\). Therefore, \(J(\mathbf {Z},\mathbf {W},\mathbf {G})\) is given by

$$\begin{aligned} \mathcal {J}(\mathbf {Z},\mathbf {W},\mathbf {G})= \sum _{i,k,j,h}z_{ik}\times w_{jh}(x_{ij} - g_{kh})^{2}=||\mathbf {X}-\mathbf {Z}\mathbf {G}\mathbf {W}^T||^2, \end{aligned}$$
(1)

where ||.|| denotes the Frobenius norm. It is easy to see that for a fixed \((\mathbf {Z};\mathbf {W})\) the optimal values of \(\mathbf {G}\) are the means of \(\mathbf {X}_{kh}\)’s. The optimal partitions \(\mathbf {Z}\) and \(\mathbf {W}\) are found using an iterative algorithm. A version of double Kmeans is presented in Algorithm 1 where \(z_{.k}\) (resp. \(w_{.h}\)) represents the cardinality of the k\(^{th}\) cluster (resp. h\(^{th}\) cluster).

figure a

2.2 Block Diagonal Structure

The DKM algorithm appears to be not efficient when looking for a one-to-one correspondence between two partitions \(\mathbf {Z}\) and \(\mathbf {W}\). In order to deal with this specific case, we have to consider \(w_{jk}\) instead of a \(w_{jh}\); in other words, we assume that \(H=K\). Secondly, the diagonal structure involve to impose some constraints on \(\mathbf {G}\); for instance by taking \(g_{kk}=\delta \) \(\forall k\). This leads us to consider the following criterion:

$$\begin{aligned} J(\mathbf {X}, \mathbf {Z},\mathbf {W})=\sum _{i,j,k}z_{ik}w_{jk}(x_{ij}-\delta )^2, \end{aligned}$$
(2)

where \(\delta \) is assumed to be known. The choice of this parameter will be discussed in the next section. The couple of partitions \((\mathbf {Z},\mathbf {W})\) optimizing the criterion given in Eq. 2 are found using the following iterative algorithm

  1. 1.

    Update \(\mathbf {Z}\) the partition of objects, with \(\mathbf {W}\) fixed. This leads to the following formula

    $$\begin{aligned} z_{i}=\mathop {{{\mathrm{arg\,min}}}}\limits _{k}\sum _{j}w_{jk}(x_{ij}-\delta )^2, \end{aligned}$$
  2. 2.

    Update \(\mathbf {W}\), the partition of features, with \(\mathbf {Z}\) fixed. This leads to the following formula

    $$\begin{aligned} w_{j}=\mathop {{{\mathrm{arg\,min}}}}\limits _{k}\sum _{i}z_{ik}(x_{ij}-\delta )^2. \end{aligned}$$

From these formulae, one observes that seeking the diagonal structure of blocks indirectly introduces a strong dependency between object assignments (respectively features assignments) to a block and the number of features that belong to this block (respectively the number of objects). If we consider object assignments, we have \((x_{ij}-\delta )^2\ge 0, \forall i,j\); therefore a higher number of features in a block will decrease the chance that an object will be assigned to this particular block. The same phenomenon occurs in the feature assignments. This leads to take into account the size of each co-cluster in order to avoid empty blocks.

3 Diagonal Double Kmeans

3.1 Criterion and Proposed Algorithm

In order to correct the bias introduced by the diagonal structure and to avoid certain blocks from vanishing, we propose a modified criterion that takes into account the number of elements in a block. This criterion takes the following form:

$$\begin{aligned} J(\mathbf {X}, \mathbf {Z},\mathbf {W})=\sum _{k}\frac{1}{z_{.k}w_{.k}}\sum _{i,j}z_{ik}w_{jk}(x_{ij}-\delta )^2. \end{aligned}$$
(3)

where \(z_{.k}\) and \(w_{.k}\) denote respectively the number of objects and the number of features in the k-th block. Furthermore, it is interesting to note that the criterion given in Eq. 3 may be expressed depending on the variance of a given block \((\mathbf {Z}_k,\mathbf {W}_k)\) and the squared deviation of its mean from the maximum input of the data:

$$\begin{aligned} J(\mathbf {X}, \mathbf {Z},\mathbf {W})= & {} \sum _{i,j,k}\frac{z_{ik}w_{jk}}{z_{.k}w_{.k}}(x_{ij}-\overline{x}_{k})^2 + \sum _{i,j,k}\frac{z_{ik}w_{jk}}{z_{.k}w_{.k}}(\overline{x}_{k}-\delta )^2 \nonumber \\= & {} \sum _{k} \overline{s}_{k}^2+ \sum _{k}(\overline{x}_{k}-\delta )^2. \end{aligned}$$
(4)

where

$$\overline{x}_{k}=\frac{1}{z_{.k}w_{.k}}\sum _{i,j}z_{ik}w_{jk}x_{ij} \text{ and } \overline{s}_{k}^2=\frac{1}{z_{.k}w_{.k}}\sum _{i,j}z_{ik}w_{jk}(x_{ij}-\overline{x}_{k})^2$$

denote the mean and the variance within the k-th block respectively. The first term of Eq. 4 ensures the homogeneity of each block while the second one provides the homogeneity between centers of the blocks and \(\delta \). This objective function (Eq. 3) is optimized by an alternating optimization of two conditional criteria given \(\mathbf {W}\) and \(\mathbf {Z}\) respectively.

$$\begin{aligned} \tilde{J}_1(\mathbf {X}, \mathbf {Z}|\mathbf {W})=\sum _{k} \frac{1}{z_{.k}}\sum _{i}z_{ik}\frac{1}{w_{.k}}\sum _{j}w_{jk}(x_{ij}-\delta )^2 \end{aligned}$$

and

$$\begin{aligned} \tilde{J}_2(\mathbf {X}, \mathbf {W}|\mathbf {Z})=\sum _{k}\frac{1}{w_{.k}}\sum _{j}w_{jk}\frac{1}{z_{.k}}\sum _{i}z_{ik}(x_{ij}-\delta )^2. \end{aligned}$$

The optimization of \(\tilde{J}_1\) and \(\tilde{J}_2\) lead to the following update rules:

$$\begin{aligned} z_{i}=\mathop {{{\mathrm{arg\,min}}}}\limits _{k}\frac{1}{w_{.k}}\sum _{j}w_{jk}(x_{ij}-\delta )^2, \end{aligned}$$
(5)
$$\begin{aligned} w_{j}=\mathop {{{\mathrm{arg\,min}}}}\limits _{k}\frac{1}{z_{.k}}\sum _{i}z_{ik}(x_{ij}-\delta )^2. \end{aligned}$$
(6)

The proposed algorithm Diagonal Double Kmeans (DDKM) (Algorithm 2) is computationally efficient and its complexity can be shown to be \(O(\tau \times npK)\) where \(\tau \) denotes the number of iterations required to obtain the convergence, n, p and K are the number of objects (i.e. rows), features (i.e. columns) and clusters respectively.

figure b

3.2 Choice of \(\delta \)

Herein, we discuss the choice of \(\delta \). Specifically, we compare between the optimal value of \(\delta \) and the maximum entry of the matrix.

  1. 1.

    If we consider \(\delta \) as an unknown parameter, its optimal value for the criterion minimized is equal to the average of blocks means. Indeed, with \(\mathbf {Z}\) and \(\mathbf {W}\) fixed and by setting the derivative of J (Eq. 3) to zero we obtain \( \delta =\frac{1}{K}\sum _{k}\overline{x}_{k} \) where \(\overline{x}_{k}\) denotes the mean of the k-th block. Although this value of \(\delta \) is the optimal, we can observe that in the context of sparse data i.e. when the data matrix contains a high percentage of 0, its value will tend to 0 leading to a diagonal structure of blocks of 0. An illustration of the resulting co-clustering obtained with this value on the CSTR dataset (described in the numerical experiments section) is given in Fig. 3(c).

  2. 2.

    Another way to proceed is to set the value of \(\delta \) at the initialisation step. DDKM aims at grouping objects and features with the strongest association possible. For instance, in the case of a binary data matrix \(\mathbf {X}\), the strongest association between an object i and a feature j is expressed in \(x_{ij}=1\) which is the maximum value of the entry of \(\mathbf {X}\). As a matter of fact, choosing the maximum allows to guarantee the homogeneity of diagonal blocks while ensuring blocks of 0 outside. In [5, 13], the authors proposed hierarchical algorithms based on this idea. We use the same example as for the optimal value to show the result on Fig. 3(b). It is important to stress that this approach requires for values of a data matrix to be comparable. This is the case for binary data or normalized data as we will see in the next section devoted to the document-word partitioning.

Fig. 3.
figure 3

(a) CSTR the original dataset, (b) CSTR reorganised according to the partitions when \(\delta =\max _{i,j} x_{ij}\), (c) CSTR reorganised according to the partitions when \(\delta \) estimated by \(\frac{1}{K}\sum _{k}\overline{x}_{k}\).

4 Numerical Experiments

4.1 Performance Evaluation

In order to assess and to compare the performance of the proposed algorithm, we use commonly adopted metrics: the Accuracy, the Normalized Mutual Information [15] and the Adjusted Rand Index [9]. We focus only on the quality of row clustering. Clustering accuracy noted Acc is one of the most widely used evaluation criterion and is defined as:

$$\begin{aligned} \text {Acc}=\frac{1}{n}max\;\left[ \sum _{\mathcal {C}_{k},\mathcal {L}_{\ell }}T(\mathcal {C}_{k},\mathcal {L}_{\ell })\right] \end{aligned}$$

where \(\mathcal {C}_{k}\) is the k\(^{th}\) cluster in the final results, and \(\mathcal {L}_{\ell }\) is the true \(\ell ^{th}\) class. \(T(\mathcal {C}_{k},\mathcal {L}_{\ell })\) is the proportion of objects that were correctly recovered by the clustering algorithm i.e. \(T(\mathcal {C}_{k},\mathcal {L}_{\ell })=\mathcal {C}_{k}\cap \mathcal {L}_{\ell }\). Accuracy computes the maximum sum of \(T(\mathcal {C}_{k},\mathcal {L}_{\ell })\) for all pairs of clusters and classes, and these pairs have no overlaps. The second measure employed is the Normalized Mutual Information (NMI) and is calculated as follows:

$$\begin{aligned} \text {NMI}=\frac{\sum _{k,\ell }\frac{n_{k\ell }}{n}\log \frac{n_{k\ell }}{n_{k}\hat{n}_{\ell }}}{\sqrt{(\sum _{k}\frac{n_{k}}{n}\log \frac{n_{k}}{n}) (\sum _{\ell }\frac{\hat{n}_{k}}{n}\log \frac{\hat{n}_{\ell }}{n})}} \end{aligned}$$

where \(n_{k}\) denotes the number of data contained in the cluster \(\mathcal {C}_{k}(1 \le k \le K)\), \(\hat{n}_{\ell }\), the number of data belonging to the class \(\mathcal {C^{\prime }}_{k}(1 \le \ell \le K)\), and \(n_{k\ell }\), the number of data that are in the intersection between the cluster \(\mathcal {C}_{k}\) and the class \(\mathcal {C^{\prime }}_{k}\). The last measure Adjusted Rand noted ARI measures the similarity between two clustering partitions. From a mathematical standpoint, the Rand index is related to the accuracy. The adjusted form of the Rand Index is defined as:

$$\begin{aligned} \text {ARI}=\frac{\sum _{k,\ell }\left( {\begin{array}{c}n_{k\ell }\\ 2\end{array}}\right) -\left[ \sum _{k}\left( {\begin{array}{c}n_{k}\\ 2\end{array}}\right) \sum _{\ell }\left( {\begin{array}{c}\hat{n}_{\ell }\\ 2\end{array}}\right) \right] /\left( {\begin{array}{c}n\\ 2\end{array}}\right) }{\frac{1}{2}\left[ \sum _{k}\left( {\begin{array}{c}n_{k}\\ 2\end{array}}\right) +\sum _{\ell }\left( {\begin{array}{c}\hat{n}_{\ell }\\ 2\end{array}}\right) \right] -\left[ \sum _{k}\left( {\begin{array}{c}n_{k}\\ 2\end{array}}\right) \sum _{\ell }\left( {\begin{array}{c}\hat{n}_{\ell }\\ 2\end{array}}\right) \right] /\left( {\begin{array}{c}n\\ 2\end{array}}\right) }. \end{aligned}$$

The value for these three metrics are between 0 and 1, a value close to 1 means a good result in terms of clustering.

4.2 Compared Algorithms

We compare against state-of-the-art (co)-clustering methods including Spherical Kmeans (SpKM)[4], Double Kmeans (DKM) and Spectral Co-Clustering (SpCo) [3]. We also report the clustering results by Kmeans and the Nonnegative Matrix Factorization (NMF) [2] as baseline. The Spherical Kmeans algorithm is basically a Kmeans algorithm that use the cosine dissimilarity instead of the Euclidean distance. It is known to be very efficient on sparse dataset and to converge quickly. We use the matlab implementation for kmeans and NMF. For SpCo algorithm we use the implementation proposed by Assaf GottliebFootnote 1. We use the SpKM implementation given in [14]. Finally, we implement CROEUC [6], a fast version of Double Kmeans (DKM); its advantage is due to the use of intermediate matrices of reduced sizes rather than the original data.

4.3 Datasets and Results

We study the effectiveness of our algorithm for some well-known text datasets with different sizes and balancesFootnote 2: CSTR, Classic3, WebKB4 and 2 subsets of the 20 Newsgroups dataset. The 20 Newsgroups dataset is organized into 20 topics. Some of the topics are closely related while other are highly unrelated. We describe the topics included in the two subsets in Table 1. The NG2 dataset includes two topics not related (rec.motorcycles and sci.crypt,sci.space) while NG5 includes topics closely related involving a situation with overlapping clusters (rec.sport.baseball, sci.crypt, sci.med, talk.religion.misc, comp.windows.x, soc.religion.christian, talk.politics.mideast). A detailed description of all datasets can be found in Table 1.

Table 1. Description of the datasets in terms of size (\(n\times p\)), number of clusters (K), sparsity (\(\%0\)) and size of the cluster. A partition is assumed balanced if the balance coefficient is close to 1 and unbalanced otherwise.

Originally each cell of these datasets denotes the number of occurrences of a word in a document. As we are interested in evaluating our algorithm on both binary and continuous data, we use one version of the datasets on which the data were converted into binary i.e. each cell having a value higher to 0 is considered equal to 1 and 0 otherwise, and a second version where a TF-IDF (Term Frequency - Inverse Document Frequency) transformation is applied. The TF-IDF normalization is one of the most used in text mining, and is defined as \(x^{\prime }_{ij}=tf_{ij}\times \log \frac{n}{df_{j}}\) where \(tf_{ij}\) denotes the number of occurrence of the j-th word in the i-th document and \(df_{j}\) denotes the number of documents containing the j-th word.

We set the number of clusters as the true number of classes on all datasets. For each method, given the number of clusters, no parameter selection is needed. We run the algorithms 100 times and report the best result (in percentage), i.e. the one that corresponds to a local minimum of the criterion of all trials in Tables 2 and 3. Several observations can be made based on these results: the proposed algorithm outperforms the other method on each dataset whether it is the TF-IDF version or the binary one, except on the TF-IDF version of WebKB4. On NG2 and NG5 whose classes are not well separated, the performance difference is all the more important. We can also note that on the TF-IDF version of NG2, Kmeans is unable to find a partition into two clusters as required.

Table 2. Accuracy, Normalized Mutual Information and Adjusted Rand Index obtained on tf-idf datasets.
Table 3. Accuracy, Normalized Mutual Information and Adjusted Rand Index obtained on binary datasets.
Fig. 4.
figure 4

Running time in seconds of the compared algorithms on the binary and the tf-idf versions of CSTR (a) and classic3 (b) datasets.

4.4 Computational Complexity

We study the computational complexity of the compared clustering and co-clustering algorithms. We repeat clustering 100 times for each algorithm on each dataset. We report the average convergence time in Fig. 4 for the CSTR (a) and Classic 3 (b) datasets. The obtained results show that the proposed algorithm DDKM is only very slightly slower than NMF method while it requires far less time to converge than all other state-of-the-art algorithms. The same observations were made on the other datasets presented in this article.

5 Conclusion

In this paper we presented DDKM, a fast co-clustering algorithm that looks for homogeneous diagonal blocks. Compared with other methods, we demonstrate that our proposed algorithm is more effective for document-word partitioning datasets and especially in presence of classes having a high degree of overlap. In addition, DDKM requires less time to converge; up to 20 times less time than DKM and 40 times less time than SpCo commonly used in the domain of document clustering. In real world application, the knowledge of the number of co-clusters is mostly required. For further research, it will be worthwhile to investigate an efficient way to assess this parameter.