1 Co-Clustering

Clustering methods are well-known tools for analyzing and structuring data, intensively investigated in statistics, machine learning and data science, and broadly used in many application domains such as as market and consumer research, psychology and social sciences, microbiology and bioinformatics. The basic problem consists in grouping a given set of objects into homogeneous classes (clusters) on the basis of empirical data that allow to quantify the ‘similarity’ or ‘dissimilarity’ of the objects and so to define the homogeneity within, or the separation between, the classes. In the most simple case there is an \(n \times p\) data matrix \(X = (x_{ij})\), where values \(x_{ij}\) are recorded for n objects and p variables and we look for an appropriate partition \(\mathcal{A} = (A_1,...,A_k)\) of the set of objects \(\mathcal{O} = \{1,...,n\}\) (the rows of X) with classes \(A_1,...,A_k\) such that similar row vectors (objects) are united in the same class while row vectors from different classes are hopefully quite dissimilar. Depending on the context, the detected or constructed clusters will be interpreted as (personality) types, consumer groups, music styles, families of plants, gene clusters, etc. Since the early 1960s when clustering methods came up, a large variety of clustering models and clustering algorithms have been developed for different data types, see, e.g., Bock [7, 9], Jain and Dubes [26], Miyamoto, Ichihashi, and Honda [32], McLachlan and Krishnan [31], Basu, Davidson, and Wagstaff [5], Aggarwal and Reddy [1], and Hennig, Meila, Murtagh, and Rocci [25].

Co-clustering (biclustering, two-way clustering, block clustering) means the simultaneous (i.e., not separate) clustering of the rows and columns of a data matrix by determining an appropriate partition \(\mathcal{A} = (A_1,...,A_k)\) of the rows together with an appropriate partition \(\mathcal{B} = (B_1,...,B_\ell )\) of the set of columns \(\mathcal{M} = \{1,...,p\}\) such that both row and column clusters are ‘homogeneous’ and reflect the hidden interplay between row and column effects. Biclustering provides an aggregated view on the similarity structure within the sets \(\mathcal{O}\) and \(\mathcal{M}\) of objects and columns, respectively, and also can serve in order to reduce a large data table with \(n\cdot p\) entries to a manageable size with only \(k\cdot \ell \) blocks \(A_k\times B_t\) together with their characterizations (data compression). Often the aggregated view on the blocks will provide a better insight into the latent relationships and interactions that may exist between objects and variables than a detailed analysis of the numerous entries \(x_{ij}\). Many applications underline the usefulness of co-clustering methods, e.g., in marketing (Arabie, Schleutermann, Daws, & Hubert [3]; Gaul & Schader [18]; Baier, Gaul, & Schader [4]), psychology and social sciences (Kiers, Vicari, & Vichi [27]; Schepers, Bock, & Van Mechelen [38]), bioinformatics and gene analysis (Cheng & Church [16]; Madeira & Oliveira [28]; Turner, Bailey, Krzanowski, & Hemmingway [39]; Alfò, Martella, & Vichi [2]; Martella, Alfò, & Vichi [30]; Cho & Dhillon [17]; Martella & Vichi [29]; Pontes, Giràldez, & Aguilar-Ruiz [33]), and text mining (Dhillon [19]).

A range of co-clustering methods have been proposed in the past, see, e.g., the surveys in Van Mechelen, Bock, and De Boeck [40], Madeira and Oliveira [28], Charrad and Ben Ahmed [14], Govaert and Nadif [23, 24] and methodological articles such as Bock [8, 10,11,12], Govaert [20], Vichi [41], Govaert and Nadif [21, 22, 24], Rocci and Vichi [34], Salah and Nadif [35], Schepers, Bock, and Van Mechelen [38] and Schepers and Hofmans [36]. These methods differ, e.g.,

  • By the type of observed data values \(x_{ij}\), e.g., real-valued, integer, categorical, binary, mixed, etc.

  • by the meaning of the entries \(x_{ij}\), e.g., association values, measurements, frequencies, etc.

  • by the classification structure, e.g., hierarchical versus nonhierarchical, hard versus fuzzy classifications, and mixtures.

  • by the modeling approach using, e.g., probabilistic models, empirical concepts, optimization criteria, and algorithms, etc.

  • by the practical meaning of the rows and columns.

Concerning this latter issue, we may distinguish between cases where rows and columns denote the categories of two given nominal factors (e.g., the crop variety i with the fertilizer j yields \(x_{ij}\) tons of cereals), and cases of the object \(\times \) variable type mentioned in the first paragraph above (e.g., object i realizes the value \(x_{ij}\) for variable j). While the two-factor case is typically symmetric insofar as clustering of both rows and columns is (or may be) based on the nearness of corresponding entries in the rows and columns, respectively, this may be misleading in the second unsymmetric case since, differently from the objects (rows), the similarity of variables (columns) is typically expressed in terms of mutual dependencies or interrelationships. Insofar, in the object\(\times \)variable case, clustering of rows and columns should typically be based on different distance or similarity indices that must be integrated into a joint two-way clustering model.

In this paper, we consider two situations of this latter type and provide, as a paradigm for more complex situations, suitable probabilistic co-clustering models and corresponding k-means type algorithms: In Sect. 2 we describe a two-way two-parameter biclustering model where the row partition \(\mathcal{A}\) refers to the first parameter (cluster means) while the column partition \(\mathcal{B}\) is induced by the values of the second one (class-specific variances). A more sophisticated and novel co-clustering model is described in Sect. 4, where object classes are characterized by a class-specific mean value (main effect) while additionally each class of variables is characterized by a class-specific latent factor that is estimated together with the column partition. As a prelude for this latter two-way model we consider in Sect. 3 a (one-way) clustering algorithm for variables only, proposed by Vigneau and Qannari [43] that is related to correlation and latent factor concepts, and show that it can be derived from a probabilistic one-way clustering model. In Sect. 4 this model will be integrated in the two-way clustering case. Section 5 concludes with some remarks and possible extensions.

2 Co-clustering with Class-Specific Variances in the Variable Clusters

We have emphasized in Sect. 1 that for an object \(\times \) variable matrix \(X = (x_{ij})\), clustering of variables (columns of X) may be inspired by other purposes or characterizations than when clustering objects (rows of X). In this section, we consider a simple example for such a co-clustering problem and describe a model where object clusters are characterized by cluster means (main effects) while clusters of variables are distinguished by different variability of the data. More specifically, we consider the following probabilistic co-clustering model for independent normally distributed random variables \(X_{ij}\):

$$\begin{aligned} \phantom {m} X_{ij} \ \sim \ \mathcal{N}(\mu _s, \sigma ^2_t) \quad \text{ for } \quad i \in A_s, j \in B_t, s=1,...,k, t=1,...,\ell \end{aligned}$$
(1)

with the k-partition \(\mathcal{A} = (A_1,...,A_k)\) of the n rows, the \(\ell \)-partition \(\mathcal{B} = (B_1,..., B_\ell )\) of the p columns of the matrix \(X = (X_{ij})\), where row clusters \(A_s\) are characterized by cluster-specific expectations \(\mu _s\) while column classes \(B_t\) are characterized by class-specific variances \(\sigma _t^2\). In this situation, maximum likelihood estimation of the unknown parameters \(\mathcal{A}, \mathcal{B}, \mu = (\mu _1,..., \mu _k)\), and \(\sigma = (\sigma _1^2,..., \sigma _\ell ^2)\) (for fixed k and \(\ell \)) is equivalent to the minimization of the co-clustering criterion

$$\begin{aligned} Q(\mathcal{A}, \mathcal{B}, \mu , \sigma ; X) := \sum _{s=1}^k \sum _{t=1}^\ell \sum _{i\in A_s} \sum _{j \in B_t} \left[ \frac{(x_{ij} - \mu _s)^2}{\sigma _t^2} + log\ \sigma _t^2 \right] \ \rightarrow \min _{\mathcal{A}, \mathcal{B}, \mu , \sigma }. \end{aligned}$$
(2)

Equating to zero the partial derivatives w.r.t. \(\mu _s\) and \(\sigma _t^2\) yields the (implicit) formulas for the estimates \(\hat{\mu }_s\) and \(\hat{\sigma }^2_t\):

$$\begin{aligned} \mu _s= & {} \left[ \sum _{t=1}^\ell \frac{|B_t|}{\sigma ^2_t} \ \bar{x}_{A_s, B_t} \right] / \left[ \sum _{t=1}^\ell \frac{|B_t|}{\sigma ^2_t} \right] \end{aligned}$$
(3)
$$\begin{aligned} \sigma ^2_t= & {} \frac{1}{n\cdot |B_t|} \cdot \sum _{j\in B_t} \sum _{s=1}^k \sum _{i\in A_s} (x_{ij} - \mu _s)^2 \\= & {} \frac{1}{n\cdot |B_t|} \cdot \sum _{j\in B_t} \sum _{s=1}^k \left[ \sum _{i\in A_s} (x_{ij} - \bar{x}_{A_s,j})^2 + |A_s| \cdot (\bar{x}_{A_s,j} - \mu _s)^2 \right] . \nonumber \end{aligned}$$
(4)

Here \(|A_s|\), \(|B_t|\) are the class sizes, and we use largely self-explanatory notations such as

$$\begin{aligned} \bar{x}_{A_s,j}:= & {} \sum _{i\in A_s} \ x_{ij}/|A_s|, \phantom {mmmmmmm} \bar{x}_{i, B_t} := \sum _{j \in B_t} \ x_{ij}/|B_t| \nonumber \\ \bar{x}_{A_s,B_t}:= & {} \sum _{i\in A_s} \sum _{j\in B_t} \ x_{ij}/(|A_s|\cdot |B_t|), \quad \bar{x}_{\cdot , \cdot } := \sum _{i=1}^n \sum _{j=1}^n \ x_{ij}/(n\cdot p). \nonumber \end{aligned}$$

So the estimate \(\hat{\mu }_s = \mu _s\) is a weighted mean of the \(\ell \) block means \(\bar{x}_{A_s, B_t}\) (with weights inversely proportional to \(\sigma ^2_t/|B_t|\), the variance of the mean \(\bar{X}_{i,B_t}\) in the class \(B_t\)) and the estimate \(\hat{\sigma }^2_t = \sigma ^2_t \) comprises terms that measure the variability within \(A_s\) (for the variables \(j \in B_t\)) and the distance between the individual means \(\bar{x}_{A_s, j}\) from the class-specific estimated expectations \(\mu _s\).

Since it is impossible to obtain explicit formulas for both estimates we propose to resolve the co-clustering problem (2) by the following iterative algorithm of the k-means type:

  1. 1.

    Begin with two initial partitions \(\mathcal{A}\) and \(\mathcal{B}\) and an initial estimate for \(\sigma \) (e.g., with \(\sigma _t^2\) the empirical variance of the data values in the \(|B_t|\) columns of X corresponding to \(B_t\));

  2. 2.

    Estimate the object-class-specific expectations \(\mu _s\) by (3) (i.e., minimize Q w.r.t. \(\mu \));

  3. 3.

    Estimate the variable-class-specific variances \(\sigma ^2_t\) by (4) (i.e., minimize Q w.r.t. \(\sigma \));

  4. 4.

    For given \(\mathcal{B}\), \(\mu \), and \(\sigma \) minimize Q w.r.t. the k-partition \(\mathcal{A}\) of the set of objects \(\mathcal{O} = \{1,..., n\}\). An elementary argumentation shows that the minimum is obtained by the generalized minimum-distance partition \(\widetilde{\mathcal{A}}\) with object (row) clusters

    $$\begin{aligned} \widetilde{A}_s := \{\ i \in \mathcal{O}\ | \ s = argmin_{s'=1,...,k} \ d(i, \mu _{s'} | \mathcal{B}, \sigma ) \ \} \ \ s = 1,...,k \nonumber , \end{aligned}$$

    where the distance d is defined by

    $$\begin{aligned} d(i, \mu _{s'}| \mathcal{B}, \sigma ) := \sum _{t=1}^\ell \sum _{j\in B_t} (x_{ij} - \mu _{s'})^2/\sigma _t^2 \ .\nonumber \end{aligned}$$
  5. 5.

    Update the parameter estimates \(\mu \), \(\sigma \) by repeating Steps 2. and 3. for the current partitions \(\widetilde{\mathcal{A}}\) and \(\mathcal{B}\).

  6. 6.

    Given \(\widetilde{\mathcal{A}}\), \(\mu \), and \(\sigma \), minimize Q w.r.t. the \(\ell \)-partition \(\mathcal{B}\) of the set of variables \(\mathcal{M} = \{1,...,p\}\); the solution is given by the generalized minimum-distance partition \(\widetilde{\mathcal{B}}\) with variable (column) clusters

    $$\begin{aligned} \widetilde{B}_t := \{\ j \in \mathcal{M}\ | \ t = argmin_{t'=1,...,\ell }\ \delta (j, \sigma ^2_{t'}| \widetilde{\mathcal{A}}, \mu )\ \} \quad t = 1,...,\ell \nonumber , \end{aligned}$$

    where the distance \(\delta \) is defined by

    $$\begin{aligned} \delta (j, \sigma ^2_{t'}| \widetilde{\mathcal{A}}, \mu ) := \sum _{s=1}^k \sum _{i \in \widetilde{A}_s} (x_{ij} - \mu _s)^2/\sigma _{t'}^2 \ + \ n \cdot \log \ \sigma _{t'}^2 \ . \nonumber \end{aligned}$$
  7. 7.

    Iterate 2. to 6. until convergence.

Obviously this algorithm decreases successively the criterion Q, (2), and insofar approximates a solution to the stated co-clustering problem. Note that ties, empty classes, local optima, and oscillating partitions may be possible and must be considered or avoided in a corresponding computer program.

3 Clustering of Variables Around Latent Factors

In this section, we describe a method for one-way clustering of the p variables (columns) of a data matrix \(X = (x_{ij})\) that has been proposed by Vigneau and Qannari [43] and uses squared correlations for measuring the similarity between two variables. In fact, we show that this method and the related clustering criterion can be derived, as a special case, from a relatively general probabilistic clustering model that characterizes each cluster of variables by a class-specific latent factor. In the following Sect. 4 this model will be integrated in our co-clustering models (12) and (13) for the objects and variables of X.

In many practical contexts, the similarity of two random variables \(Y_j, Y_{j'}\) is measured by their squared correlation \(r^2(Y_j, Y_{j'}) := Corr^2(Y_j, Y_{j'})\). Similarly, in case of a \(n \times p\) data matrix \(X = (x_{ij}) = (y_1,...,y_p)\), where the j-th column \(y_j = (x_{1j},...,x_{nj})^\top \) represents the j-th variable, the similarity of \(y_j\) and \(y_{j'}\) (or j and \(j'\)) is measured by the square of the empirical correlation

$$\begin{aligned} \quad r(y_j, y_{j'}) := \frac{ s_{y_j, y_{j'}} }{ \sqrt{ s_{y_j, y_j} s_{y_{j'}, y_{j'}} } } \nonumber \end{aligned}$$

with

$$\begin{aligned} s_{y_j, y_{j'}} := (1/n) \ \sum _{i=1}^n (x_{i,j} - \bar{x}_{\cdot ,j}) (x_{i,j'} - \bar{x}_{\cdot ,j'}) = (1/n)\ y_j^\top y_{j'} \nonumber , \end{aligned}$$

where \(\bar{x}_{\cdot , j} := (\sum _{i=1}^n x_{ij})/n\) is the mean of the n entries in the column j of X and the last equality sign holds for centered columns \(y_j\) (i.e., \(\bar{x}_{\cdot , j} = 0\)).

Vigneau and Qannari have integrated this similarity concept into the search for an optimal \(\ell \)-partition \(\mathcal{B} = (B_1,..., B_\ell )\) of the set \(\mathcal{M}\) of variables (columns of X). In order to formulate a corresponding clustering criterion, they define, for each class \(B_t\), a suitable ‘prototype variable’ or ‘class representative’. Instead of choosing one of the observed variables (columns) \(y_j\) from \(B_t\) (medoid approach), they construct a synthetic one, i.e., a virtual column \(c \in \mathfrak {R}^n\) in X. More specifically (and for centered columns \(y_j\)), they define the prototype vector \(c_{B_t} := (c_{t1},..., c_{tn})^\top \in \mathfrak {R}^n\) to be the vector \(c \in \mathfrak {R}^n\) that is most ‘similar’ to the variables in \(B_t\) in the sense

$$\begin{aligned} S(c; B_t) := \sum _{j\in B_t} r^2(y_j, c) = (1/n)\ c^\top X_{B_t} X^\top _{B_t} c \quad \rightarrow \quad \max _{c \in \mathfrak {R}^n, ||c|| = 1}, \end{aligned}$$
(5)

where \(X_{B_t}\) is the data matrix X restricted to the variables (columns) of \(B_t\). Classical eigenvalue theory shows that the solution \(c_{B_t}\) is given by the standardized eigenvector \(v_t\) that belongs to the largest eigenvalue \(\lambda _t\) of \(X_{B_t} X^\top _{B_t}\) (and also \(X^\top _{B_t} X_{B_t}\)), i.e., by the first principal component in \(B_t\). Finally, Vigneau and Qannari formulate the following criterion for clustering variables:

$$\begin{aligned} g_3(\mathcal{B}; X) := \sum _{t=1}^\ell \sum _{j \in B_t} \ r^2(y_j,c_{B_t}) \quad \rightarrow \quad \max _\mathcal{B} \end{aligned}$$
(6)

that is equivalent to the two-parameter correlation clustering criterion

$$\begin{aligned} g_4(\mathcal{B}, \mathcal{C}; X) := \sum _{t=1}^\ell \sum _{j \in B_t} \ r^2(y_j, c_t)^2 \quad \rightarrow \quad \max _{\mathcal{B}, \mathcal{C}}, \end{aligned}$$
(7)

where maximization is also with respect to the choice of the system \(\mathcal{C} = \{c_1,..., c_\ell \} \) of \(\ell \) standardized class-specific prototype variables (vectors) \(c_1, ..., c_\ell \in \mathfrak {R}^n\).

From its definition as a two-parameter optimization problems it is evident that for the variable clustering problem (7) a (sub-)optimum \(\ell \)-partition \(\mathcal{B}\) of variables can be obtained by a generalized k-means algorithm:

  1. (1)

    Begin with an initial partition \(\mathcal{B} = (B_1,..., B_\ell )\) of \(\mathcal{M} = \{1,..., p\}\).

  2. (2)

    Partially optimize the clustering criterion with respect to the class prototype system \(\mathcal{C}\) for the classes \(B_t\) according to (5), thus yielding the class-specific eigenvector solutions \(c_{B_t}\) (class-specific principal components).

  3. (3)

    Build a new \(\ell \)-partition \(\mathcal{B}\) of the variables by assigning each variable \(y_j\) to the ‘most similar’ \(c_{B_t}\), i.e., the one with the largest value of \(r^2(y_j, c_{B_t})\).

  4. (4)

    Iterate (2) and (3) until convergence.

Defining the similarity of variables by a correlation coefficient involves implicitly the concept of a linear regression. In fact, the correlation clustering criterion (7) above can be obtained from a probabilistic clustering model in which any variable \(y_j= (x_{1j},..., x_{nj})^\top \) of a class \(B_t\) is generated, up to a random normal error, from the same latent factor (prototype variable) \(c_t = (c_{t1},..., c_{tn})^\top \in \mathfrak {R}^n\) by a linear regression. The corresponding regression-type variable clustering model is given by

$$\begin{aligned} X_{ij} = a_j + b_j c_{ti} + e_{ij} \quad \text{ for } \quad i=1,..., n; j \in B_t \end{aligned}$$
(8)

with variable-specific intercepts \(a_j\), slopes \(b_j\), and independent normal errors \(e_{ij} \sim \mathcal{N}(0, \sigma ^2)\). Estimating the unknown \(a_j, b_j\), the prototype system \(\mathcal{C} = (c_1,..., c_\ell )\) and the \(\ell \)-partition \(\mathcal{B}\) by maximizing the likelihood of \(X = (x_{ij})\) is equivalent to the optimization problem

$$\begin{aligned} g_5(\mathcal{B}, \mathcal{C}, a, b; X) := \sum _{t=1}^\ell \sum _{j\in B_t} \sum _{i=1}^n \ (x_{ij} - a_j - b_j c_{ti})^2 \quad \rightarrow \quad \min _{\mathcal{B}, \mathcal{C}, a, b} \ . \end{aligned}$$
(9)

Partially optimizing the inner sum of \(g_5\) with respect to \(a_j, b_j\) yields the classical regression estimates

$$\begin{aligned} \widehat{b}_j := \frac{s_{y_j c_t}}{s_{y_j y_j}} \quad \quad \text{ and } \quad \quad \widehat{a}_j = \bar{x}_{\cdot , j} - \widehat{b}_j \bar{c}_{t,\cdot } \quad \text{ for } \ j \in B_t \end{aligned}$$
(10)

in \(B_t\) with, e.g., \(\bar{c}_{t,\cdot } := \sum _{i=1}^n c_{ti}/n\), and the partial minimum of the two inner sums of (9) is given by

$$\begin{aligned} h(B_t, c_t) := \sum _{j\in B_t} \sum _{i=1}^n \ (x_{ij} - \widehat{a}_j - \widehat{b}_j c_{ti})^2 = \sum _{j\in B_t} \ n \cdot s_{y_j y_j} (1 - r^2(y_j, c_t)) \nonumber \end{aligned}$$

and characterizes the homogeneity of \(B_t\) for a given prototype variable \(c_t\). Finally, the multiparameter clustering problem (9) reduces to the two-parameter mixed continuous-discrete optimization problem for \((\mathcal{B}, \mathcal{C})\):

$$\begin{aligned} g_6(\mathcal{B}, \mathcal{C}; X):= & {} min_{a, b} \ g_5(\mathcal{B}, \mathcal{C}, a, b; X) = \sum _{t=1}^\ell \ h(B_t, c_t) \nonumber \\= & {} \sum _{t=1}^\ell \sum _{j\in B_t} \ n \cdot s_{y_j y_j} (1 - r^2(y_j, c_t)) \quad \rightarrow \quad \min _{\mathcal{B}, \mathcal{C}}. \end{aligned}$$
(11)

For the special case of standardized column variables \(y_j\), i.e., for \(x_{\cdot , j} = 0\) and \(s_{y_j y_j} = ||y_j||^2/n = 1\), this criterion is equivalent to the criterion (6) proposed by Vigneau and Qannari [43]. Insofar we have shown that their criterion (6) can be derived from a probabilistic clustering model. A software program in R is given by Chavent, Liquet, Kuentz-Simonet, and Saracco [15]. In the next section, a similar model will be used for modeling the co-clustering problem.

4 Co-Clustering, Where Variable Clusters are Characterized by Class-Specific Factors

In this section, we propose a co-clustering model for an \(n \times p\) object \(\times \) variable data table \(X = (x_{ij})\) with normally distributed entries where the clusters of objects (rows) are distinguished only by their levels (main effects) while each cluster of variables (columns) is, additionally, characterized by a cluster-specific factor with a high correlation to the variables within this class. Thereby we adopt the basic idea that has been followed in Sect. 3 when clustering the variables only. More specifically, with the notation of former sections and as an extension of the one-way clustering model (8), we consider the model

$$\begin{aligned} X_{ij} \ = \ \mu + \alpha _s + a_j + b_j c_{ti} + e_{ij}\quad&{\text {for}} \ i \in A_s, j \in B_t, \\&s=1,...,k, t = 1,...,\ell \nonumber , \end{aligned}$$
(12)

where \(\mathcal{A} = (A_1,...,A_k)\) and \(\mathcal{B} = (B_1,...,B_\ell )\) are the unknown partitions of rows and columns, respectively (with known k and \(\ell \)), \(\mu \) is a general effect and \(\alpha _s\) the ‘main effect’ of row class \(A_s\). In this model, the vector \(c_t = (c_{t1},..., c_{tn})^\top \) represents a cluster-specific latent factor that acts, in cluster \(B_t\), as an explicative variable in the regression model that explains the n observations of variable j in the j-th column \(y_j = (x_{1j},...,x_{nj})^\top \) of X, up to the main effects, by a linear regression \(a_j + b_j c_{ti}\) on the components of \(c_t\) with unknown variable-specific coefficients \(a_j\) and \(b_j\). As before, \(e_{ij}\) are independent \(\mathcal{N}(0, \sigma ^2)\) errors.

The clustering problem then consists in finding estimates for the parameters \(\mu \), \(\alpha = (\alpha _1,...,\alpha _k)\), \(a = (a_1,...,a_p)\), \(b = (b_1,...,b_p), \sigma ^2\), the set of factors \(\mathcal{C} = (c_1,...,c_\ell )\), and the partitions \(\mathcal{A}\) and \(\mathcal{B}\) (under suitable norming constraints). In the model (12), the intercepts \(a_j\) of the linear regression part are specified separately for the variables j. In the following, we consider the more specialized co-clustering model where these intercepts are the same, \(\beta _t\) say, for all variables j from the same class \(B_t\). This is described by the more specific co-clustering model

$$\begin{aligned} X_{ij} \ = \ \mu + \alpha _s + \beta _t + b_j c_{ti} + e_{ij} \quad&{\text {for}} \ i \in A_s, j \in B_t, \\&s=1,...,k, t = 1,...,\ell \nonumber \end{aligned}$$
(13)

with the constraints

$$\begin{aligned} \widetilde{\alpha }_\cdot := \displaystyle {\sum _{s=1}^k \ \frac{|A_s|}{n} \alpha _s} = 0, \quad \widetilde{\beta }_\cdot := \displaystyle { \sum _{t=1}^\ell \frac{|B_t|}{p} \beta _t = 0}, \quad ||c_t||^2 = 1 \end{aligned}$$
(14)

It describes a situation with additive class-specific main effects \(\alpha _s\) and \(\beta _t\) while interactions are cell-specific with the product form \(b_j c_{ti}\) (factor model).

Invoking the maximum likelihood approach for estimating the parameters in (13), we obtain the following factor-induced co-clustering problem:

$$\begin{aligned} Q(\mathcal{A},&\mathcal{B}; \mu , \alpha , \beta , b, \mathcal{C}; X) \nonumber \\&\,\, := \sum _{s=1}^k \sum _{t=1}^\ell \sum _{i\in A_s} \sum _{j\in B_t} \ (x_{ij} - \mu - \alpha _s - \beta _t - b_j c_{ti})^2 \ \rightarrow \ \min \end{aligned}$$
(15)

where minimization is over all parameters under the constraints (14) (the model (12) may be treated similarly). In the following, we propose a generalized alternating k-means-type algorithm for solving this problem where, in each step, we partially optimize the criterion Q, (15), with respect to the involved parameters in turn.

Step 1: Choose an initial configuration \((\mathcal{A}, \mathcal{B}, \mu , \alpha , \beta , \mathcal{C})\). A reasonable choice might be \(\mu = \bar{x}_{\cdot , \cdot }\), \(\alpha _s = \bar{x}_{A_s, \cdot } - \bar{x}_{\cdot , \cdot }\), \(\beta _t = \bar{x}_{\cdot , B_t}- \bar{x}_{\cdot , \cdot }\), while \(\mathcal{A}\) and \(\mathcal{B}\) could be obtained by separately clustering the rows and columns of X, e.g., by the classical k-means algorithm. Moreover, the class-specific factors \(c_1,...,c_\ell \) might be chosen randomly from the unit sphere in \(\mathfrak {R}^n\).

Step 2: For fixed \((\mathcal{A}, \mathcal{B}, \mu , \alpha , \beta , \mathcal{C})\), determine the optimum regression coefficients \(b_1,..., b_\ell \) that minimize the criterion Q, (15). For notational convenience, we introduce the ‘adjusted’ \(n \times p\) data matrix \(Z = (z_{ij}(\mathcal{A}, \mathcal{B}))\) with entries

$$\begin{aligned} z_{ij}(\mathcal{A}, \mathcal{B}) \ := \ z_{ij}(\mathcal{A}, \mathcal{B}; \mu ,&\alpha , \beta )\ := \ x_{ij} - \mu - \alpha _s - \beta _t \nonumber \\&\,\,\, {\text {for}} \ i \in A_s, j\in B_t, s=1,...,k, t = 1,..., \ell \nonumber \end{aligned}$$

(where main effects are eliminated) such that this partial optimization problem takes the form

$$\begin{aligned} Q = \sum _{t=1}^\ell \sum _{j \in B_t} \sum _{i=1}^n \ (z_{ij}(\mathcal{A}, \mathcal{B}) - b_j c_{ti})^2 \ = \ \sum _{t=1}^\ell \sum _{j\in B_t} \ Q_j \quad \rightarrow \quad \min _{b_1,...,b_\ell } . \end{aligned}$$
(16)

Minimizing, separately for each \(j \in B_t\), the inner sum \(Q_j\) yields the estimates:

$$\begin{aligned} \widehat{b}_j = \frac{\sum _{i=1}^n \ z_{ij}(\mathcal{A}, \mathcal{B}) c_{ti}}{\sum _{i=1}^n c_{ti}^2 } = z_j(\mathcal{A}, \mathcal{B})^\top c_t \quad \quad j \in B_t, t=1,...,\ell \nonumber \end{aligned}$$

(with \(z_{j}(\mathcal{A}, \mathcal{B})\) the j-th column of Z; note that \(||c_t||^2 = 1\)) and the partial minimum

$$\begin{aligned} \widetilde{Q}(\mathcal{C}):= & {} \min _{b_1,...,b_\ell } \ Q \ \ = \ \ \sum _{t=1}^\ell \sum _{j \in B_t} \sum _{i=1}^n \ (z_{ij}(\mathcal{A}, \mathcal{B}) - \widehat{b}_j c_{ti})^2 \nonumber \\= & {} \sum _{t=1}^\ell \sum _{j\in B_t} \left( ||(z_{j}(\mathcal{A}, \mathcal{B})||^2 - (z_{j}(\mathcal{A}, \mathcal{B})^\top c_t)^2 \right) \end{aligned}$$
(17)

Step 3: Looking now for the factors \(c_t\) we have to minimize the criterion (17) with respect to \(\mathcal{C} = (c_1,...,c_\ell )\). This amounts to maximize, separately for each class \(B_t\), the criterion

$$\begin{aligned} \sum _{j\in B_t} (z_{j}(\mathcal{A}, \mathcal{B})^\top c_t)^2 = c_t^\top \underbrace{ \left[ \sum _{j\in B_t} z_{j}(\mathcal{A}, \mathcal{B}) z_{j}(\mathcal{A}, \mathcal{B})^\top \right] }_{\displaystyle {{S_t}}} c_t =: c_t^\top S_t c_t \nonumber \end{aligned}$$

with respect to \(c_t\) under the constraint \(||c_t||=1\). As in Sect. 3 the solution of this problem is given by the normalized eigenvector \(\widehat{c}_t\) of the \(n \times n\) matrix \(S_t = S_t(\mathcal{A}, \mathcal{B}, \mu , \alpha , \beta )\) that belongs to the largest eigenvector of \(S_t\) (first principal component in \(B_t\)).

Step 4: After having obtained the coefficients \(b_j = \hat{b}_j\) and the factors \(c_t = \widehat{c}_t\) we substitute these estimates in the original co-clustering criterion Q, (15), and minimize it with respect to the global and main effects \(\mu \), \(\alpha \), and \(\beta \) under the norming constraints (14). A brief calculation yields the estimates:

$$\begin{aligned} \widehat{\mu }= & {} \bar{x}_{\cdot , \cdot } - \sum _{t=1}^\ell \frac{|B_t|}{p}\ \bar{b}_{B_t} \bar{c}_{t,\cdot } \quad \quad \text{ with }\ \bar{b}_{B_t} := \sum _{j\in B_t} \frac{b_t}{|B_t|} \nonumber \\ \widehat{\alpha }= & {} \bar{x}_{A_s, \cdot } - \bar{x}_{\cdot , \cdot } - \sum _{t=1}^\ell \frac{|B_t|}{p}\ (\bar{c}_{t, A_s} - \bar{c}_{t, \cdot }) \nonumber \\ \widehat{\beta }_t= & {} \bar{x}_{\cdot , B_t} - \bar{x}_{\cdot , \cdot } - \bar{b}_{B_t} \bar{c}_{t, \cdot } + \sum _{\tau = 1}^\ell \frac{|B_\tau |}{p} \ \bar{b}_{B_\tau } \bar{c}_{t, \cdot } \nonumber \end{aligned}$$

While in Steps 2.–4. we have obtained the estimates for the effects \(\mu \), \(\alpha \), \(\beta \), the coefficients \(b_j\) and the factors \(c_t\), i.e., the configuration \((\mathcal{A}, \mathcal{B}; \hat{\mu }, \hat{\alpha }, \hat{\beta }, \hat{b}, \hat{\mathcal{C}})\) for a fixed choice of the partitions \(\mathcal{A} = (A_1,...,A_k)\) of objects and \(\mathcal{B} = (B_1,...,B_\ell )\) of variables, we now update these partitions by consecutively minimizing the criterion Q, (15), with respect to \(\mathcal{B}\) (Step 5.) and \(\mathcal{A}\) (Step 6.).

Step 5: Concerning first the partition \(\mathcal{B}\) of variables, the new and partially optimum \(\ell \)-partition \(\widehat{\mathcal{B}} = (\widehat{B}_1,..., \widehat{B}_\ell )\) for Q is the minimum-distance partition of \(\mathcal{M} = \{1,...,p\}\) with the classes

$$\begin{aligned} \widehat{B}_t:= & {} \{ j \in \mathcal{M} \}\ | \ t = argmin_{\tau = 1,...,\ell } \ \delta (j, \tau ; \mathcal{A}, \mathcal{B}, \hat{\mu }, \hat{\alpha }, \hat{\beta }, \hat{b}, \hat{\mathcal{C}}) \} \end{aligned}$$
(18)

for \(t = 1,...,\ell \) where the distance measure \(\delta \) is defined by

$$\begin{aligned} \delta (j, \tau ; \mathcal{A}, \mathcal{B}, \hat{\mu }, \hat{\alpha }, \hat{\beta }, \hat{b}, \hat{\mathcal{C}}):= & {} ||(z_{j}(\mathcal{A}, \mathcal{B})||^2 - (z_{j}(\mathcal{A}, \mathcal{B})^\top c_\tau )^2 \end{aligned}$$
(19)

for \( j=1,...,p, \tau =1,..., \ell \) with \(z_{ij}(\mathcal{A}, \mathcal{B}) = z_{ij}(\mathcal{A}, \mathcal{B}; \hat{\mu }, \hat{\alpha }, \hat{\beta }, \hat{b}, \hat{\mathcal{C}}))\). In fact, a look at (17) shows that the best partition \(\hat{\mathcal{B}}\) has to minimize the distance \(\delta \), (19), with respect to \(\tau \) for all variables j. Note that it follows from the original formula (15) for Q that the same partition is obtained when using the expression

$$\begin{aligned} \widetilde{\delta }(j, \tau ; \mathcal{A}, \mathcal{B}, \hat{\mu }, \hat{\alpha }, \hat{\beta }, \hat{b}, \hat{\mathcal{C}}):= & {} \sum _{s=1}^k \sum _{i\in A_s} (x_{ij} - \hat{\mu } - \hat{\alpha }_s - \hat{\beta }_\tau - \hat{b}_j \hat{c}_{\tau i})^2 \nonumber \end{aligned}$$

for \(j=1,...,p, \tau =1,..., \ell \) instead of \(\delta \) in (18).

Step 6: Starting with the partition pair \(\mathcal{A}, \widehat{\mathcal{B}}\) and the current parameters \(\hat{\mu }, \hat{\alpha }, \hat{\beta }, \hat{b}, \hat{\mathcal{C}}\), the estimation Steps 2.–4. are now repeated and will result in new estimates \(\mu ^*, \alpha ^*, \beta ^*, b^*, \mathcal{C}^*\). With these estimates and the partition \(\widehat{\mathcal{B}}\) of variables, the k-partition \(\mathcal{A}\) of the set of objects \(\mathcal{O}\) is updated next: the new k-partition \(\widehat{\mathcal{A}}\) that partially minimizes the criterion \(Q(\mathcal{A}, \widehat{\mathcal{B}}; \mu ^*, \alpha ^*, \beta ^*, b^*, \mathcal{C}^*; X)\), is the minimum-distance partition with classes

$$\begin{aligned} \widehat{A}_s:= & {} \{ i \in \mathcal{O} \ | \ s = argmin_{\sigma = 1,...,k} \ d(i, \sigma ; \mathcal{A}, \hat{\mathcal{B}}, \mu ^*, \alpha ^*, \beta ^*, b^*, \mathcal{C}^*) \} \end{aligned}$$
(20)

for \(s = 1,...,k\), where the distance measure d is defined by

$$\begin{aligned} d(i, \sigma ; \mathcal{A}, \hat{\mathcal{B}}, \mu ^*, \alpha ^*, \beta ^*, b^*, \mathcal{C}^*) := \sum _{t=1}^\ell \sum _{j \in B_t} (x_{ij} - \mu ^* - \alpha _\sigma ^* - \beta _t^* - b_j^* c_{ti}^*)^2 \end{aligned}$$
(21)

for \(i=1,...,n, \sigma =1,...,k\).

Step 7: The Steps 2.–6. are repeated until convergence of the two partitions.

Finally, we have obtained the partitions \(\mathcal{A}\) and \(\mathcal{B}\) of objects and variables (rows and columns), together with their characterizations, i.e.,

  • the main effects \(\alpha _s\) of the classes \(A_s\) of objects;

  • the main effects \(\beta _t\) of the classes \(B_t\) of variables together with the factors   (prototype variables) \(c_1,..., c_\ell \in \mathfrak {R}^n\) of these classes.

The components of each factor \(c_t\) describe the contribution of the n objects to the composition of the column clusters \(B_t\) and the object\(\times \)variable interaction terms \(b_j c_{ti}\). For easily interpreting the numerical results we can, e.g.,

  • display, for each variable j from class \(B_t\), the n points \((c_{ti}, y_{ij})\), \(i=1,...,n\), in \(\mathfrak {R}^2\) that should be close to the corresponding regression line \(\eta = \beta _t + b_j c\);

  • display and compare the latent factors \(c_1,...,c_\ell \) with the discrete curves \((i, c_{ti})\), \(i=1,...,n\), in \(\mathfrak {R}^2\), where the object labels i are arranged such object classes form contiguous segments; and

  • visualize the \(\ell \) factors \(c_1,...,c_\ell \in \mathfrak {R}^n\) in a two-dimensional principal component display.

5 Discussion and Extensions

In this paper, we have proposed two probabilistic approaches for clustering simultaneously the objects (rows) and the variables (columns) of a data matrix. In contrast to other approaches where, e.g., ANOVA models or information distances are considered (see, e.g., Bock [8, 10,11,12]), our approach considers situations where the characterization of object clusters is different from the characterization of clusters of variables. In Sect. 2 this has been illustrated for the case when object clusters are characterized by class-specific means while variable clusters are characterized by class-specific variances. Moreover, in Sect. 4 we have introduced co-clustering models where object clusters were defined by main effects, and variable clusters by their main effects and a class-specific factor that explains the variables via a class-specific regression. This latter model was suggested after analyzing, in Sect. 3, a clustering method for variables only (proposed by Vigneau & Qannari [43]) and formulating a corresponding probabilistic model from which our new model can be derived.

For both co-clustering models, we have proposed an appropriate generalized k-means algorithm that proceeds by successively updating model parameters and partitions. These methods can be modified into various ways, e.g., by discussing the initial settings and the order of partial optimization steps. In this respect, this paper does not provide final results and lends itself to various investigations in the future. Basically, our models should be seen as a prototype for approaches that combine clustering of objects and clustering of variables in a simultaneous, probability-based framework. They can be extended to other two-parameter distributions, to the case of factor hyperplanes (involving higher principal components in each column class) and also to co-clustering models for three-way data similarly as in Bocci, Vicari, and Vichi [6], Schepers, Van Mechelen, and Ceulemans [37], Vichi, Rocci, and Kiers [42], or Wilderjans and Cariou [44], Wilderjans and Cariou [13].