Introduction

Co-clustering (or bi-clustering [10]) is a widely used and powerful unsupervised learning solution that simultaneously performs clustering on rows and columns of a data matrix to explore inter-correlated patterns. Unlike traditional clustering methods [1] that aim to group rows or columns of the data matrix into clusters, co-clustering is intended to reorganize the original data matrix into blocks (i.e., co-clusters). Specifically, co-clustering describes the partitioning of the original data matrix into k row-clusters and l column-clusters (i.e., the total number of co-clusters is \(k\times l\)) using similarity measures according to a certain evaluation criterion. According to the criterion, the similarity between two instances from the same cluster is higher than that of instances from different clusters [21]. This approach differs from subspace clustering, which focuses on selecting a quantity of original dimensions in some unsupervised manner such that cluster structures become more evident in this subspace [12]. Since co-clustering algorithms utilize the relations between sample clusters and feature clusters, they make the data sets more predictable and the co-clustering performance more excellent compared with traditional one-side clustering.

However, it is always a challenging task to achieve high-performance co-clustering quality without background information. To consider the prior information, semi-supervised co-clustering was proposed [34]. Current semi-supervised co-clustering methods focus on incorporating the known prior knowledge into the co-clustering algorithms so that the co-clustering performance can be improved. Computationally, most semi-supervised co-clustering algorithms lack flexibility because the constraints must be satisfied at each stage. Moreover, those semi-supervised co-clustering methods only consider the given constraints to be satisfied and clustering in the original data space. They are incapable of transforming the original data into a lower-dimensional data space guided by constraints. Constraint projection is a prevalent technique to address this problem in semi-supervised clustering [13].

Constraint projection (CP) aims to transform the original data into a lower-dimensional data space guided by constraints (usually pairwise constraints). The reduced data can still remain as the original class information. Recently, CP technology has also been integrated with semi-supervised co-clustering algorithms. Constraint co-projection can transform the original sample space and feature space into a low-dimensional space through simultaneously performing the sample CP (SCP) as well as feature CP (FCP) [19]. CP has been successful in incorporating prior knowledge into the representation learning process of the original data. However, a major challenge facing the CP is its high dependence on prior knowledge. In many tasks, even a small amount of prior knowledge is difficult to obtain owing to the high cost of the data-labeling process. It is desirable to apply the technology of CP to situations where prior knowledge cannot be directly accessed.

Adjustment learning is a simple and efficient approach that uses chunklets for unsupervised cluster learning in specific scenarios [33]. Chunklets are small groups of points that come from the unknown but the same class. Unlike labels, chunklets can sometimes be automatically obtained without human intervention. For each chunklet, the class label is consistent but unknown for all the data points belonging to it. This means that each class of data points consists of one or more chunklets in adjustment learning. Since chunklet information is not extracted from prior labeled data, adjustment learning is cast into the domain of unsupervised learning. However, adjustment learning can be easily extended to semi-supervised and supervised learning situations. In the first case, chunklets directly come from background information, i.e., we know in advance that a small number of samples or features (one or more) belong to the same class. In the second case, chunklets are extracted from the fully labeled data. Therefore, chunklet constraints can be regarded as another form of positive constraints. Unlike the must-link constraints, which hold that two instances must be clustered in the same cluster, the chunklet constraints specify instances belonging to the same cluster that do not necessarily appear in pairs; rather, they appear in groups.

In this paper, a novel co-clustering framework, named co-adjustment learning for co-clustering (CALCC), is proposed. Development of the framework was inspired by the following two aspects: i) Chunklets as a new form of positive constraint information can be effectively incorporated into the representation learning process of the CP model. ii) Simultaneous use of sample chunklets and feature chunklets can be helpful to obtain better insight into the data. In order to transform the original data space into another space with the property of better separability such that the clustering performance can be effectively enhanced, a novel co-projection model, named co-adjustment learning (CAL), is proposed. The sample chunklet constraints and feature chunklet constraints can be simultaneously used to guide the representation learning process in the proposed CAL model. After CAL is defined, we find that it is naturally born for co-clustering, so the CALCC co-clustering framework is proposed.

The proposed CALCC framework can be simultaneously used in unsupervised, semi-supervised and supervised learning situations. i) Unsupervised CALCC is designed to handle a frequently encountered problem. That is, in many tasks, it is difficult to find the clear range of a cluster, whereas it is easy to find the clear range of a chunklet, regardless of whether it resides in sample space or feature space. Thus, we can use different unsupervised clustering algorithms to generate clearly structured chunklets on both dimensions of the data matrix according to different characteristics of the data. ii) Semi-supervised CALCC can effectively address the situation: We know in advance that some samples or features belong to the same class (the groups of instances belonging to the same class can be seen as chunklets). iii) As mentioned above, chunklets are small groups of instances from the unknown but the same class. In supervised CALCC, we divide the fully labeled samples and features into sample chunklets and feature chunklets, respectively. The entire framework of CALCC is illustrated in Fig. 1.

Fig. 1
figure 1

Framework of CALCC. \(I+\star =\) semi-supervised CALCC, \(II+\star =\) unsupervised CALCC and \(II+\star =\) supervised CALCC. Colored connecting lines in different colors are used to represent different ways to obtain chunklets

To the best of our knowledge, this is the first work to improve the expression ability of the data space by incorporating the chunklet information into the representation learning process of the CP model. Several aspects of the paper are highlighted as follows:

  1. i)

    We focus on utilizing the valuable chunklet information to simultaneously enhance the separability of the sample space and feature space of the original data. And the valuable chunklet information can be generated in unsupervised, semi-supervised and supervised settings.

  2. ii)

    The proposed CAL model can simultaneously perform the sample projection as well as feature projection under the guidance of those chunklet information. In the transformed new representation space, the relations between samples and features are properly preserved and the input data become more predictable.

  3. iii)

    Because unsupervised CALCC is the most difficult and valuable case in our framework, the representative CALCC-KM algorithm was proposed for unsupervised clustering task. In addition, we performed a comparative experiment on the three modalities of unsupervised CALCC. The results on benchmark data sets show the superior performance of the proposed framework.

The remainder of this paper is organized as follows: In Section 2, we introduce existing works on which our approach is based. In Section 3, we provide a detailed illustration of the proposed CAL model. Experimental results are shown in Section 4. The paper is concluded in Section 5.

Related Work

In this section, we review previous research closely related to our work. We first basically overview representative co-clustering algorithms. Then, we briefly introduce the technology of constraint projection.

Co-clustering has received significant attention from researchers since it was first proposed in the early 1970s  [16]. Co-clustering is the most widely employed clustering approach in the fields of gene expression [7, 10, 14], natural language processing [8, 38] and recommender systems [17, 24].

Dhillon et al. [12] proposed a spectral co-clustering method (SCC) by which the document collection is modeled as a document-word bipartite graph. Accordingly, the co-clustering problem is regarded as a bipartite-graph partitioning problem. The background information is considered so that the clustering performance can be improved. Shi et al. [34] proposed a novel semi-supervised spectral co-clustering method (SCM). SCM can efficiently solve the poor clustering performance problem of most co-clustering algorithms caused by the sparsity of data and presence of noise. Numerous co-clustering methods are founded on information-theory-based models. The information theoretic co-clustering algorithm (ITCC) [11] seeks to enhance interrelated mutual information by performing simultaneous clustering on both column and row dimensions at each stage. Banerjee et al. [2] proposed a more general co-clustering framework wherein any Bregman divergence can be used in the objective function. Soon thereafter, Bekkerman et al. [3] extended the co-clustering framework to multi-way clustering to cluster a set of objects by simultaneously clustering their heterogeneous components. Moreover, Bayesian co-clustering (BCC) [31] and nonparametric Bayesian co-clustering ensembles [35] enable a mixed membership in column-clusters and row-clusters. Non-negative matrix factorization (NMF) and its graph-regularized extensions have received tremendous research interest over the past several years [18]. Chen et al. [9] presented a semi-supervised NMF method for co-clustering (SSNMFCC). In this scheme, relational matrices can be computed through simultaneous modality selection and distance metric learning. More recently, Kumar et al. [23] presented a model-based co-clustering transfer learning algorithm to address data shifting problems. Whang et al. [37] presented non-exhaustive, overlapping co-clustering (NEO-CC), an effective solution to non-exhaustive, overlapping problems in co-clustering. Huang et al. [20] proposed a document co-clustering framework with adaptive local structure learning (ALSLCC) in which tri-factorization and intrinsic structure learning can be simultaneously performed. Nie et al. [27] proposed a novel co-clustering algorithm (SOBG) to learn a structured optimal bipartite graph with exactly k connected components from the original data, where k equals the number of clusters.

A large majority of CP methods aim to transform the original instances into a new low-dimensional space guided by the given constraints such that the clustering or classification performance can be improved in the new representation space. Zhang et al. [40] applied CP to ensemble learning. They first transformed the original data points into a new data space by using CP. Then, the base classifiers are built in this new space. Combined with pairwise constraints, a two-side CP method called constraint co-projection was furthermore proposed [19]. Constraint co-projection simultaneously performs SCP and FCP to project the original sample space and feature space into low-dimensional space. Moreover, relying on fully labeled data, a supervised subspace projections method was presented for constructing ensembles of classifiers [15]. The algorithms based on using CP for classification are described in  [29] and  [28].

Despite the widespread application of CP in semi-supervised clustering, research on unsupervised CP remains limited and preliminary. Some previous research focused on using equivalence constraints as side information [22, 32]. Chunklets are a kind of equivalence constraint that can be automatically obtained without human intervention in many specific tasks. Adjustment learning is an approach that uses chunklets for unsupervised learning. However, only in specific scenarios in which chunklets can be naturally generated can adjustment learning be classified as an unsupervised setting. It is desirable to extend it to general scenarios to enable chunklet constraints to be used for CP. Furthermore, it is easy to find chunklets with an evident structure in both a sample space and feature space using appropriate unsupervised clustering methods. The proposed framework can flexibly employ different unsupervised clustering algorithms according to different data characteristics. Accordingly, it can generate sample chunklets and feature chunklets on the sample and feature dimensions of the data matrix, respectively. Based on the chunklets we obtain, original samples and features can be simultaneously projected into low-dimensional space using chunklet co-projection. Hence, in the new low-dimensional representation space, instances from the same chunklet are close to each other.

Co-Adjustment Learning Model

In this section, the proposed method is described in detail. Specifically, we formulate the objective function of the CAL model and find the solution to it. The algorithm of unsupervised CALCC is also detailed at the end of this section.

Problem Formulation

As defined above, co-clustering seeks coherent blocks of rows and columns to explore inter-correlated patterns of the data. To devise a good co-clustering framework, one must first characterize the “goodness” of a co-clustering framework. Let \(X=(x_{ij})_{n \times p}\) be a data matrix of n rows and p columns in some input space. Let the two collections, \(R = (R_1,R_2,...,R_{k})\) and \(C = (C_1,C_2,...,C_{l})\), respectively, denote the partition of the set of rows and the partition of the set of columns of the data matrix, where k is the row-cluster number and l is the column-cluster number. At this point, the pair \((R_{\imath },C_{\jmath })\) is called the co-cluster (\(\imath =1,2,...,k\), \(\jmath =1,2,...,l\)). Let \(N_{cc}=k \times l\) be the total number of co-clusters, |.| denote the cardinality of a set. The quality of the co-clustering method is assessed by the total variance of co-clusters, which is denoted as \(T(R_{\imath },C_{\jmath })\). Here,

$$\begin{aligned} T(R_{\imath },C_{\jmath }) = \sum _{\tau =1}^{N_{cc}} \sum _{i\in R_{\imath } } \sum _{j\in C_{\jmath }}(x_{ij}-{\mu _\tau })^2 \end{aligned}$$
(1)

where \({\mu _\tau }\) is the average value in the \(\tau\)-th co-cluster and

$$\begin{aligned} {\mu _\tau } = \frac{\sum _{i\in R_{\imath } } \sum _{j\in C_{\jmath }}x_{ij}}{|R_{\imath }||C_{\jmath }|} \end{aligned}.$$

Generally speaking, a co-clustering framework having a zero total variance is considered perfect, and a co-clustering framework having a lower total variance is better than that with a higher total variance [5].

To obtain a lower total variance for our framework, we strive to simultaneously project the original samples and features into low-dimensional representation space guided by the chunklet constraints. Accordingly, the instances from the same chunklet are close to each other in the new low-dimensional space. Let the collection \({C} = \{x_{11}, ... , x_{1m_1}, \dots , x_{c1}, ... , x_{cm_c}, \dots , x_{\Omega 1}, ... , x_{\Omega m_{\Omega }}\}\) denote the set of data points consisting of \(\Omega\) chunklets, where \(c=1,...,\Omega\) and \(x_{ci}\) denotes the i-th point in the c-th chunklet and \(m_c\) is the number of data points in the c-th chunklet. Let \(m = \sum _{c=1}^{\Omega }m_c\). The within-chunklet covariance matrix W is defined as

$$\begin{aligned} W = \sum _{c=1}^\Omega \sum _{i=1}^{m_c}(x_{ci}-x_c)(x_{ci}-x_c)^T \end{aligned}$$
(2)

where \({x_c}\) is the mean of the c-th chunklet, \({\theta }^T\) denotes the transposition of vector \({\theta }\).

The basic idea of the CAL model is to simultaneously transform the original samples and features into low-dimensional space guided by constraint information extracted from chunklets. Let \({C}_s = \{(s_i,s_j)|s_i\) and \(s_j\) be samples belonging to the same chunklet\(\}\), \({C}_f = \{(\ f_i,f_j)|f_i\) and \(f_j\) are features belonging to the same chunklet\(\}\). We define the sample within-chunklet covariance matrix \(W_s\) as follows:

$$\begin{aligned} W_s= & \ \frac{1}{|{C}_s|} \sum _{(s_i,s_j)\in {C}_s} [(s_i- \frac{s_i+s_j}{2})(s_i- \frac{s_i+s_j}{2})^T \nonumber \\&+ (s_j- \frac{s_i+s_j}{2})(s_j- \frac{s_i+s_j}{2})^T] \nonumber \\= & \ \frac{1}{2|{C}_s|} \sum _{(s_i,s_j)\in {C}_s} (s_i-s_j){(s_i-s_j)}^T. \end{aligned}$$
(3)

Similarly, the feature within-chunklet covariance matrix \(W_f\) is defined as:

$$\begin{aligned} W_f&= \frac{1}{2|{C}_f|} \sum _{(\ f_i,f_j)\in {C}_f} (\ f_i-f_j){(\ f_i-f_j)}^T. \end{aligned}$$
(4)

The objective of the CAL model is to learn two sets of co-projective matrix \(U=[u_1,u_2,...,u_L]\in {R^{p\times L}}\) and \(M=[m_1,m_2,...,m_G]\in {R^{n\times G}}\), which can simultaneously map the original samples and features into low-dimensional representation space. Hence, the original class information can be most reliably retained in the reduced data. At the same time, instances in the same sample (or feature) chunklet are close in the new low-dimensional sample (or feature) space. Thus, the CAL optimization problem is defined as minimizing the objective function J(UM). Moreover,

$$\begin{aligned} J(U,M)= & \ \frac{1}{2| {C}_s|} \sum _{(s_i,s_j)\in {C}_s} \parallel U^T (s_i-s_j) \parallel ^2 \nonumber \\&+ \frac{1}{2| {C}_f|} \sum _{(\ f_i,f_j)\in {C}_f} \parallel M^T (\ f_i-f_j) \parallel ^2. \end{aligned}$$
(5)

The objective seeks to explore the inter-correlated patterns of samples and features. Based on it, we can learn two sets of co-projective matrix U and M, which can effectively project the original samples and features into low-dimensional sample space and feature space, respectively. We found that the new data space learned by the CAL model can not only capture data distributions, but also render the entities of the same chunklet become closer, while the entities of the different chunklets become farther apart.

Inference

To enhance the convenience of the subsequent discussions, we define \(S_s\),\(S_f\) as follows:

$$\begin{aligned} S_s = \frac{1}{2|{C}_s|} \sum _{(s_i,s_j)\in {C}_s} \parallel U^T (s_i-s_j) \parallel ^2 \end{aligned}$$
(6)

and

$$\begin{aligned} S_f =\frac{1}{2| {C}_f|} \sum _{(\ f_i,f_j)\in {C}_f} \parallel M^T (\ f_i-f_j)\parallel ^2. \end{aligned}$$
(7)

From objective function (5), if we intend to minimize J(UM), both \(S_s\) and \(S_f\) should be minimized. Furthermore, an analytical solution can be obtained for simultaneously finding the optimal co-projective matrix U in (6) and M in (7). First, we rewrite (6) as:

$$\begin{aligned} S_s= & {} \frac{1}{2| {C}_s|} \sum _{(s_i,s_j)\in {C}_s} \parallel U^T (s_i-s_j) \parallel ^2 \nonumber \\= & {} \frac{1}{2| {C}_s|} \sum _{(s_i,s_j)\in {C}_s} \sum _l^L U_l^T (s_i-s_j){(s_i-s_j)}^T U_l \nonumber \\= & {} \sum _l^L U_l^T \bigg ( \frac{1}{2| {C}_s|} \sum _{(s_i,s_j)\in {C}_s} (s_i-s_j){(s_i-s_j)}^T \bigg ) U_l \nonumber \\= & {} \sum _l^L U_l^T W_s U_l. \end{aligned}$$
(8)

According to (8), \(S_f\) is rewritten as:

$$\begin{aligned} S_f= & \ \frac{1}{2|\mathcal {C}_f|} \sum _{(\ f_i,f_j)\in \mathcal {C}_f} \parallel M^T (\ f_i-f_j)\parallel ^2 \nonumber \\= & \ \sum _g^G M_g^T W_f M_g. \end{aligned}$$
(9)

In this paper, we call \(S_s\) and \(S_f\) the chunklet scatter values. The most important step for the proposed CAL model is to seek two sets of co-projective matrix \(U=[u_1,u_2,...,u_L]\in {R^{p\times L}}\) and \(M=[m_1,m_2,...,m_G]\in {R^{n\times G}}\), so that the information in the sample chunklets and feature chunklets can be most effectively retained in the low-dimensional representation space. From (8) and (9), the objective function (5) is transformed into

$$\begin{aligned} J(U,M)= & {} \frac{1}{2|\mathcal {C}_s|} \sum _{(s_i,s_j)\in \mathcal {C}_s} \parallel U^T (s_i-s_j) \parallel ^2 \nonumber \\&+ \frac{1}{2|\mathcal {C}_f|} \sum _{(\ f_i,f_j)\in \mathcal {C}_f} \parallel M^T (\ f_i-f_j) \parallel ^2 \nonumber \\= & {} \sum _l^L U_l^T W_s U_l + \sum _g^G M_g^T W_f M_g. \end{aligned}$$
(10)

To simplify the problem, we assume \(L = G\). Nevertheless, by a simple parameter setting, the proposed framework can easily solve the special case of \(L \ne G\), which is discussed in Section 3.4.

According to simple algebraic theory [40], we further rewrite the objective function in (10) as:

$$\begin{aligned} J(U,M)= & \ \sum _l^L U_l^T W_s U_l + \sum _g^G M_g^T W_f M_g \nonumber \\= & \ Trace\big (U^T W_s U\big )+Trace\big (M^T W_f M\big ) \nonumber \\= & \ Trace\bigg ( {\begin{bmatrix} U \\ M \\ \end{bmatrix} }^T \begin{bmatrix} W_s &{} 0\\ 0 &{} W_f\\ \end{bmatrix} \begin{bmatrix} U \\ M \\ \end{bmatrix} \bigg ) \nonumber \\= & \ Trace\big (V^TWV\big ) \end{aligned}$$
(11)
$$s.t {\quad } V^TV=I$$

where \(V=[U;M]\), and

$$W = \begin{bmatrix} W_s &{} 0\\ 0 &{} W_f\\ \end{bmatrix}.$$

To find the solution to this optimization problem, the traditional Lagrange multiplier optimization technique [19] is used. The Lagrangian can be denoted as:

$$\begin{aligned} L_{{V_1},...,{V_\eta }}=\hat{J}\big (V_1,...,V_\eta \big )-\sum _{\varepsilon =1}^{\eta }\delta _\varepsilon \big (V_\varepsilon ^TV_\varepsilon -1\big ). \end{aligned}$$
(12)

We calculate the partial derivative of \(L_{{V_1},...,{V_\eta }}\) with respect to each \(V_\varepsilon\) and set it to zero. Hence,

$$\begin{aligned} \frac{\partial L}{\partial V_\varepsilon }= & \ 2WV_\varepsilon -2\delta _\varepsilon V_\varepsilon =0,{\quad } \forall _\varepsilon =1,...,\eta \nonumber \\\Rightarrow & \ WV_\varepsilon =\delta _\varepsilon V_\varepsilon ,{\quad } \forall _\varepsilon =1,...,\eta . \end{aligned}$$
(13)

It is apparent from (13) that the solution \(V_\varepsilon\) is an eigenvector of W, and \(\delta _\varepsilon\) is the corresponding eigenvalue of W. Therefore, to minimize J, V must be composed of the first \(\lambda\) eigenvectors of W, which makes J the sum of the \(\lambda\) smallest eigenvalues of W.

Suppose \(\Lambda _d = [\small {\gamma _1\le ,...,\le \gamma _d }]\) is the solution to (13), and \(V_d=[\small {V }_1,...,\small {V }_d]\) is the corresponding eigenvector matrix. Denote the within-chunklet covariance matrix \(W^{'} = diag(\gamma _1,\gamma _2,...,\gamma _d).\) Thus, the optimization problem is transformed into a trace minimization problem. Its solution is to choose i for:

$$\begin{aligned} Minimize\bigg ( \sum _{i}\gamma _i\bigg ). \end{aligned}$$
(14)

Equation (14) indicates that it can be minimized when we choose the non-negative eigenvalues of \(\gamma _i\) for the sum. That is, we take the first d smallest non-negative eigenvalues of W, and we minimize J(UM) by constructing \(V_d=[U;M]\) using the corresponding eigenvectors. Additionally, we have:

$$\begin{aligned} W^{'} = V_d^TWV_d=\Lambda _d. \end{aligned}$$
(15)

Now, we obtain the co-projection matrices U and M, and the solution to the optimization is found.

However, to obtain the optimal rescaling transformation [36], it is desirable to let the within-chunklet covariance matrix be fixed, i.e., we let it be the identify matrix. To that end, equation (15) is reconstructed as:

$$\begin{aligned} W^{'}= & \ V_d^TWV_d \nonumber \\= & \ \Lambda _d^{-\frac{1}{2}}(V_d^TWV_d) \Lambda _d^{-\frac{1}{2}} \nonumber \\= & \ (V_d{\Lambda _d}^{-\frac{1}{2}})^TW(V_d{\Lambda _d}^{-\frac{1}{2}}) = I. \end{aligned}$$
(16)

Thus, the final co-projective matrix of CALCC is defined as \(F^T\), and

$$\begin{aligned} F=V_d{\Lambda _d}^{-\frac{1}{2}}. \end{aligned}$$
(17)

Based on F, we can obtain the final co-projection matrices, U and M. Specifically, \({U } = {F }[\small {F }_1;{F }_2;...;{F }_p]\) (\(\small {F }_i\) denotes the i-th row of \({F })\), which is the sample projection matrix with p rows and d columns. Meanwhile, \({M } = {F }[\small {F }_{p+1};{F }_{p+2};...;{F }_{p+n}]\) is the feature projection matrix with n rows and d columns. After obtaining the final co-projective matrix \({U }=[{u }_1,{u }_2,...,{u }_d]\in {R^{p\times d}}\) and \({M }=[{m }_1,{m }_2,...,{m }_d]\in {R^{n\times d}}\), and by performing the sample-projection \(Y_s = XU\) and feature-projection \(Y_f = X^TM\), we can transform the original sample space and feature space into a low-dimensional sample space and feature space. Instances from the same chunklet become closer to each other in the learned representation space. In the new sample space and feature space, we can obtain the row partition matrix and column partition matrix from which coherent co-clusters can be found by running the unsupervised clustering algorithm. Consequently, we obtain the final co-clusters with the lowest total variance.

CALCC Co-Clustering Framework

Under the guidance of chunklet information, the proposed CAL model can transform the original sample space and feature space into a new meaningful sample space and feature space, respectively. So it is naturally born for co-clustering framework. Hence, a novel co-clustering framework, named co-adjustment learning for co-clustering (CALCC), is proposed.

The proposed framework can be used in unsupervised, semi-supervised and supervised learning situations. In unsupervised CALCC, we flexibly choose the befitting unsupervised clustering algorithm to generate sample chunklets and feature chunklets according to the characteristics of the input data. In semi-supervised CALCC, chunklets directly come from prior knowledge. Chunklets can be extracted from the fully labeled samples and features in supervised CALCC.

While the performance of supervised CALCC is notable, reliance on such a large number of labeled data items limits the utility of CALCC in many domains where such data sets are not available. On the other hand, in many situations, there is no a priori knowledge that can be exploited, which makes semi-supervised learning useless. Thus, we herein focus on unsupervised CALCC.

Algorithm Description

According to the previously given inference, the detailed algorithm procedure for unsupervised CALCC is summarized in Algorithm 1. It should be noted that we assume \(L = G\) for simplicity in this paper. In many cases, however, the dimensions of samples and features tend to be different for different data sets. Here, by a simple parameter setting in step 5, the corresponding dimensions of U and M can be differently set in accordance with specific needs.

figure a

Experiments

This section evaluates unsupervised CALCC using several real-world data sets. In detail, CALCC-KM, a representative unsupervised case of CALCC, which uses k-means clustering for chunklet generation, is first introduced to make a performance comparison with several related co-clustering methods. Then, we evaluate the effectiveness of three modalities in unsupervised CALCC.

Datasets

We describe our experiments performed on several real-world data sets, including 10 image data sets from Microsoft Research Asia Multimedia (MSRA-MM) [25] and 9 real data sets from the University of California, Irvine (UCI), machine learning repository [4]. The summary of these 19 data sets is given in Table 1. The first 12 data sets (including 10 image data sets and 2 big real data sets) in the table were used to conduct a performance comparison of the representative CALCC-KM and several related co-clustering methods, and the last 7 data sets were used to demonstrate the effectiveness of unsupervised CALCC in various modalities.

Table 1 Summary of data sets

To evaluate the availability of the proposed framework, we perform an extensive comparison of the effective CALCC-KM with several related methods which stay either classic or state of the art, specifically including

  • SCC [12]: a spectral co-clustering method, in which the document collection is modeled as a document-word bipartite graph.

  • ITCC [11]: an information theoretic co-clustering method aiming to enhance inter-related mutual information.

  • BCC [2]: a Bayesian co-clustering method that enables a mixed membership in column-clusters and row-clusters.

  • SOBG [27]: a co-clustering method that seeks to learn a structured optimal bipartite graph.

  • ALSLCC [20]: an adaptive local structure learning method for document co-clustering.

  • SCM [34]: a semi-supervised SCC for solving the problems of sparse data and noise in co-clustering.

  • SSNMFCC [9]: a semi-supervised NMF method for co-clustering, in which relational matrices are computed through simultaneous modality selection and distance metric learning.

  • SNCC [26]: a sparse neighbor constrained co-clustering for alleviating the misclassification of close instances.

  • CoCE [39]: a co-clustering ensemble approach for learning robust co-clusters by combining multiple base co-clusterings.

Classic methods SCC, ITCC, BCC and SCM are widely used as baseline co-clustering algorithms. Other state-of-the-art methods SOBG, ALSLCC, SSNMFCC, SNCC, CoCE are either extensions of these classical methods or are most closely related to our work.

In unsupervised CALCC-KM, chunklets are obtained by performing the k-means clustering, and in order to prove the proposed two-step CALCC framework can yield better clustering performances than that of the one-step clustering approach, k-means is also introduced and compared with other co-clustering algorithms. For each data set, clustering results are obtained by applying the k-means algorithm to sample representations learned by each co-clustering algorithm.

Furthermore, in our experiments, we set the number of sample clusters k as equal to the number of feature clusters l. The number of sample chunklets \(\tilde{k}\) was also set to be the same as the number of feature chunklets \(\tilde{l}\). Particularly, we let \(\tilde{k}\) = \(4\times k\) (i.e., \(\tilde{l}\) = \(4\times l\)) in our experiments. The default percentage of pairwise constraints is 5% for semi-supervised co-clustering methods. For each parameter setting, we repeated the experiments 20 times and recorded the average result for comparison.

Evaluation Metric

It is a commonly used method for measuring co-clustering quality by comparing the row (sample) clustering quality or column (feature) clustering quality between different co-clustering algorithms  [19, 20, 34]. For the evaluation, we used clustering accuracy (ACC) to measure the experimental results. ACC is a widely used standard measure for clustering [6]. It measures the frequency with which all data points from the same class label reside in the same cluster [33]. ACC is defined as:

$$\begin{aligned} ACC=\frac{\sum _{i=1}^{n}\psi (y_i,map(\hat{y_i}))}{n} \end{aligned}$$
(18)

where \(\psi (x,y)\) equals 1 if \(x=y\) and 0 otherwise, n is the number of instances of data set X, and \(y_i\) and \(\hat{y_i}\) are the true label and predicted label corresponding to instance \(x_i\), respectively. Additionally, \(map(\hat{y_i})\) is the permutation mapping function that changes predicted labels to match the true labels by using the Kuhn–Munkres algorithm.

Results

The clustering results of all 11 methods on 12 data sets are shown in Table 2. It is obvious that CALCC-KM outperforms 10 other related methods with the best average ACC of 0.6082. We observe that compared with other clustering algorithms, CALCC-KM achieves the best clustering results most of the time. In specific terms, among the 11 algorithms, CALCC-KM achieves the 7 best clustering results on the 12 data sets, while the other 10 methods only achieve 5 of the best clustering results. Another important observation is that SCM achieves a better clustering quality than SCC, which verifies the assumption that prior information can reduce the noise and effectively enhance the clustering performance. We also observe that on average, most of co-clustering methods outperform the k-means clustering, this is because co-clustering algorithms can utilize the relations between sample clusters and feature clusters, as a consequence, they make the data sets more predictable and the co-clustering performance more excellent compared with traditional one-side clustering.

Table 2 Accuracy results of the experiment

The Friedman-aligned test [19] is introduced to make a further comparison among those algorithms. Table 3 shows the aligned observations and the aligned ranks in the parentheses with consideration of the known 11 algorithms and 12 data sets. As shown in the table, on average, CALCC-KM ranks first at 24.58. The Friedman aligned test can be used to check whether the measured sum of aligned ranks is different from the total aligned ranks \(\hat{R}_j\ =\ 731\) at the high level of significance expected under the null hypothesis:

$$\begin{aligned} \sum _{j=1}^{n}\hat{R_{i,.}^2}= & \ 767^2 + 699^2 + ... + 711^2 + 755^2 = 6,429,978 \\ \sum _{j=1}^{k}\hat{R_{.,j}^2}= & \ 1172^2 + 1164^2 +... + 407^2 + 295^2= 8,224,698 \end{aligned}$$
$$\begin{aligned} T= & \ \frac{ \quad (11-1)[8,224,698-(11\cdot 12^2/4)(11\cdot 12+1)^2] \quad }{11 \times 12(11 \times 12 + 1)(2 \times 11 \times 12 + 1)/6 - 6,429,978/11} \\= & \ 59.68 \end{aligned}$$

With 11 algorithms and 12 data sets, T is distributed according to the Chi-square distribution with \(11-1=10\) degrees of freedom. The p-value computed by using the \(\chi ^2(10)\) distribution is 0.00000025; thus, the null hypothesis is significantly rejected. It is obvious that the value is far less than 0.05, which shows that the results of the algorithms are significantly different.

Table 3 Aligned observations of 11 algorithms selected in the experimental study. Ranks in parentheses are used in the computation of the Friedman aligned ranks test

Unsupervised CALCC-KM versus Semi-Supervised Co-Clustering

To further illustrate the superior performance of the proposed framework, for each of the 12 data sets, we plot the average ACC value against the increasing percentage of pairwise constraints for our unsupervised CALCC-KM algorithm and two other semi-supervised SSNMFCC and SCM algorithms. Since our CALCC-KM is an unsupervised co-clustering algorithm, the accuracy of CALCC-KM is not affected by the percentage of pairwise constraints and is always a constant. The ACC results are shown in Figs. 2 and 3. Surprisingly, as observed in Fig. 2, our unsupervised CALCC-KM significantly outperforms the other two semi-supervised SCM and SSNMFCC methods in most cases. In Fig. 3, it is obvious that the average ACC of CALCC-KM on 12 data sets is close to 61%. It is almost 8% and 20% higher than that of SCM and SSNMFCC, respectively. Furthermore, the average accuracies of SCM and SSNMFCC consistently increase with the gradual increase of the pairwise constraint percent (from 5 to 30 percent). Particularly, the ACC curve of SCM rises relatively gently as the pairwise constraints increase, which may be caused by the problem of constraint conflicts. In general, as an unsupervised co-clustering algorithm, CALCC-KM can generate far superior clustering quality than some of related semi-supervised co-clustering algorithms.

Fig. 2
figure 2

Experimental results for unsupervised co-clustering CALCC-KM and semi-supervised co-clustering SCM and SSNMFCC on 12 data sets against the increasing percentage of pairwise constraints

Fig. 3
figure 3

Average ACC of 12 data sets for unsupervised co-clustering CALCC-KM and semi-supervised co-clustering SCM and SSNMFCC against the increasing percentage of pairwise constraints

Two-Side CALCC versus Its One-Side Counterpart

The proposed CALCC is a two-side co-clustering framework. In this framework, not only chunklets from sample side, but also chunklets from feature side are used to guide the learning process of the CAL model. In order to prove the availability of the information extracted from those feature chunklets, we evaluate the clustering performance of one-side CALCC with only minimizing the sample chunklet scatter values. In this case, the objective function (5) is reduced to \(J(U)=S_s\). And the sample-projective matrix U can be easily obtained according to equations (10)-(17) (by simply setting \(V=U\)).

A novel one-side clustering method named CALCC-KM1 was proposed to make a comparison with its two-side counterpart CALCC-KM. K-means clustering was used to generate sample chunklets from the original data set in CALCC-KM1. The clustering results of the k-means, CALCC-KM and CALCC-KM1 algorithms over 12 data sets are shown in Table 4.

Table 4 The clustering results of the K-means, CALCC-KM and CALCC-KM1 algorithms over 12 MSRA-MM and UCI data sets. Note that the highest score is denoted by bold font, and the second is underlined

The table shows that two-side CALCC can achieve better performance than its one-side counterpart, supporting the view that two-side co-clustering can obtain better insight into the data and make the input data more predictable than traditional one-side clustering. We also notice that the use of sample chunklets can improve the results as well, which also verifies the effectiveness of the chunklet information.

Discussion on Chunklet Number

In this subsection, we describe our empirical evaluation of the impact of different numbers of chunklets on the performance of unsupervised CALCC. Considering the different characteristics of the original data matrix and the experiment effectiveness, we represent different numbers of chunklets as different multiples of cluster numbers. The parameter \(\alpha\) is used to represent multiples of the cluster number, i.e., \(\tilde{k}\) = \(\alpha \times k\), \(\tilde{l}\) = \(\alpha \times l\). In our experiments, the value of \(\alpha\) was varied from one to ten in steps of one. The clustering results on 12 data sets are plotted in Fig. 4. As can be seen from the figure, CALCC-KM achieves varying clustering quality at different \(\alpha\) values for different data sets. To summarize, data sets with different distributions may be composed of different numbers of chunklets that can easily find boundaries.

Significantly, for the simplicity of the experiments, we only considered pursuing the optimal \(\alpha\) in the case of \(\alpha \le 10\). Thus, the optimal \(\alpha\) obtained is actually the local optimal \(\alpha\). However, the experimental results show that even in the case of the local optimum, our CALCC-KM has achieved the best results over the other related co-clustering algorithms. This means that a better parameter setting can help achieve a better clustering quality than that reported in this paper.

Fig. 4
figure 4

Cluster performance of all 12 data sets with different cluster-number multiples

Effectiveness of Our Model with Different Modalities

Since the proposed unsupervised CALCC is formulated for more than one modality, we evaluated its effectiveness in the three modalities on the last 7 data sets in Table 1. In addition to CALCC-KM, we introduce two other modalities, CALCC-DP and CALCC-AP. More specifically, CALCC-DP and CALCC-AP indicate that density peaks (DP) [30] and AP clustering are, respectively, applied to generate chunklets on sample and feature dimensions. The results are shown in Fig. 5. We can observe that all of three modalities achieve a high cluster quality over the seven data sets. In addition, CALCC-DP outperforms the other two modalities in Spect and Credi data sets, and CALCC-AP achieves the best performance in Ionosphere, Sona, Vote, and Wdbc data sets. However, only in the Syncon data set does CALCC-KM obtain the highest accuracy. The reason for that might be because compared with DP and AP clustering k-means is not the most suitable clustering method to generate clearly structured chunklets on the other six data sets. In general terms, in unsupervised CALCC, different unsupervised clustering algorithms for chunklet generation can achieve significantly different co-clustering performances in accordance with the different data characteristics.

Fig. 5
figure 5

Experimental results for three CALCC modalities CALCC-KM, CALCC-DP and CALCC-AP over 7 UCI data sets

Study on Time Complexity

The time complexity of the CAL model is mainly determined by the singular value decomposition (SVD) process of the covariance matrix W in Eq. (11). The matrix W is a square matrix of order \(n+p\) for a data matrix \(\mathbf{X} _{n\times p}\) with n rows and p columns. It is known that exact SVD of a q-order matrix has time complexity \(O(q^3)\), so the time complexity of the model is \(O((n+p)^3)\). For the proposed CALCC-KM algorithm, it costs another \(O(n\tilde{k}g_{s}+p\tilde{l}g_{f})\) time complexity to construct chunklets and another \(O(n{k}c_s+p{l}c_f)\) time complexity to obtain clustering results, where \(\tilde{k}\) (k), \(\tilde{l}\) (l) are the number of row-chunklets (row-clusters) and column-chunklets (column-clusters), respectively; \(g_{s}\) (\(g_{f}\)) is the number of sample (feature) chunklet generation iterations, \(c_s\) (\(c_f\)) is the number of sample (feature) clustering iterations.

Conclusion

In this paper, a novel co-clustering framework named co-adjustment learning for co-clustering (CALCC) was proposed. The proposed CALCC can be flexibly used in unsupervised, semi-supervised and supervised learning situations. A constraint co-projection model, co-adjustment learning (CAL), was first introduced. The CAL model not only makes full use of the informative chunklet constraints, but also transforms the original data into a new discriminative representation space by simultaneously performing sample projection as well as feature projection. In the transformed space, the co-clusters with lowest total variance can be efficiently found. To the best of our knowledge, the presented CAL model that exploits constraint information from chunklets for co-projection is the first to be introduced. In order to prove the availability of our framework, an unsupervised case of CALCC was introduced to make an extensive comparison with several related co-clustering methods on several image and real data sets. Besides, we also performed a comparative experiment on the three modalities of unsupervised CALCC. The experimental results revealed the superior performance of the CAL model in discovering discriminative representations and demonstrated the effectiveness of the proposed framework.