1 Introduction

The simultaneous partitioning of features and objects into consistent homogeneous blocks, referred as to co-clustersFootnote 1, is a successful extension of one-sided clustering that can make large datasets easier to analyze (Hartigan 1972; Bock 1979; Govaert 1983; Vichi 2001; Van Mechelen et al. 2004; Rocci and Vichi 2008; Govaert and Nadif 2008, 2013; Bock 2020). Starting from a data matrix, a co-cluster can be defined as a submatrix whose elements have a particular pattern in common. The basic idea behind co-clustering is to identify a structure that is shared by objects and features through their permutations. A variety of co-clustering methods have been applied in different areas, such as in bioinformatics (Madeira and Oliveira 2004; Cho and Dhillon 2008; Tanay et al. 2005; Hanczar and Nadif 2012) to group genes and experimental conditions, in collaborative filtering (Hofmann and Puzicha 1999; Deodhar and Ghosh 2010) to group users and items, and in text mining (Dhillon 2003; Ailem et al. 2017a; Govaert and Nadif 2018; Salah and Nadif 2019; Role et al. 2019) to group wordsFootnote 2 and documents. Through its ability to relate rows and columns, co-clustering generally gives better results than clustering along a single dimension. Besides, co-clustering makes an implicit adaptive dimensionality reduction that allows the use of efficient scalable algorithms for high-dimensional sparse text data. This is crucial in text mining, since the exponential growth of online documents has created an urgent need for effective methods in handling and interpreting high-dimensional sparse document-term matrices, i.e., matrices where documents are represented in the space of terms, and vice versa. Most importantly, text co-clustering can identify the most discriminating words that characterize topics in document classes.

Standard text-focused co-clustering approaches seek to relate documents and words. They do not usually attempt to incorporate side information such as semantic relationships between words, or similarities in document content. The clustering of documents relating to the same topics might nevertheless benefit from additional information about word similarities, since these documents can be expected to contain semantically related terms. Conversely, word clustering might usefully harness side information about (similarities in) document content, given that it is the documents that provide the context for the words. Side information on document latent space and on word latent space could together improve the co-clustering of document-text data.

2 Related work

Inspired by the recent success of neural word embedding models, Ailem et al. (2017b) proposed performing NMF (Non-negative Matrix Factorization) jointly on the document-word and word-context matrices, with shared word factors. Recently, an extension of NMTF-based (Non-negative Matrix Tri-Factorization) co-clustering, namely WC-NMTF (Word Co-Occurence regularized NMTF) (Salah et al. 2018), a technique that takes account of semantic relationships between terms, has successfully been applied on various text datasets. As well as being high-dimensional and sparse, text data are also heavily unbalanced, and co-clustering methods that focus on document-term matrices need to take this into account. The DCC algorithm (Directional Co-clustering with a Conscience) (Salah and Nadif 2019) has been shown to be particularly suited to tackling this issue. DCC uses the von Mises–Fisher (vMF) mixture model and introduces a conscience mechanism (DeSieno 1988; Ahalt et al. 1990) to avoid empty or highly unbalanced clusters (Banerjee and Ghosh 2004). It exploits the fact that text data are naturally directional, which means that only the directions of data vectors are relevant, and not their magnitude (Mardia and Jupp 2009). In contrast to WC-NMTF, DCC does not use any regularization.

In this work, we harness the directional property of text data and describe a Regularized Bi-Directional Co-clustering (RBDCo) algorithm for document-term data. The bi-directional aspect of our approach resides in the use of side information for the two dimensions of the document-term matrix. The primary contribution of this work is a general framework based on a matrix formulation of vMF-based co-clustering. A significant outcome of this novel formulation is a very rich, flexible framework for text co-clustering that allows an easy multiplicative regularization on both the word–word semantic relationships and the document–document content similarities. In contrast to existing methods, which generally rely on additive incorporation of similarities, we propose a bi-directional multiplicative regularization that better encapsulates the underlying text data structure. Another contribution of this work is an original method for evaluating the coherence of word clusters. Experimental results on various real-life datasets provide clear empirical evidence of the effectiveness of our co-clustering framework.

The rest of the paper is organized as follows: After reviewing the von Mises–Fisher-based clustering method in Sect. 3, we introduce a matrix view of a derived co-clustering algorithm, namely Directional Co-Clustering with a Conscience (DCC), in Sect. 4. We then show in Sect. 5 how a generalized regularization framework can be built from the von Mises–Fisher model while taking into account the directional property of text data, and this section also looks at how our generalized framework is linked to a variety of other co-clustering approaches. Section 6 is devoted to comparative numerical experiments that demonstrate the effectiveness of our generalized regularization framework. We conclude and suggest future paths in Sect. 7.

2.1 Notation

Let \({\mathbf {X}}= (x_\mathrm{ij})\) be a data matrix of size \(n \times d\), \(x_\mathrm{ij}\in {\mathbb {R}}\). The ith row of this matrix is represented by a vector \({\mathbf {x}}_i=(x_{i1},\ldots ,x_\mathrm{id})^\top \), where \(\top \) denotes the transpose. The partition of the set of rows into g clusters can be represented by a classification matrix \({\mathbf {Z}}\) of elements \(z_\mathrm{ik}\) in \(\{0,1\}\) satisfying \(\sum _{k=1}^g z_\mathrm{ik}=1\). We denote by \(z_{.k}=\sum _{i}z_\mathrm{ik}\) the cardinality of the kth row cluster. The notation \({\mathbf {z}}=(z_1,\ldots ,z_n)^\top \), where \(z_i \in \{1,\ldots ,g\}\) corresponds to the cluster label of i, will be also used. Similarly, the notations \({\mathbf {W}}= (w_\mathrm{jk})\), \(w_\mathrm{jk}\in \{0,1\}\) satisfying \(\sum _{k=1}^g w_\mathrm{ik}=1\), \({\mathbf {w}}=(w_1,\ldots ,w_d)\), where \(w_j \in \{1,\ldots ,g\}\) represents the cluster label of j represented by the vector \({\mathbf {x}}^j\), and \(w_{.k}=\sum _{j}w_\mathrm{jk}\) the cardinality of the kth column cluster, will be used to represent the partition of the set of columns.

3 Directional co-clustering

Mixture models have undoubtedly made a very useful contribution to clustering in that they offer considerable flexibility (McLachlan and Peel 2004). A mixture of von Mises–Fisher (vMF) distributions can be a wise choice (Banerjee et al. 2005; Salah and Nadif 2017b) when dealing with directional data distributed on a unit hypersphere \({\mathbb {S}}\). In fact, this model is one of the most appropriate models for clustering high-dimensional sparse data such as the document-term matrices encountered in text mining. In this kind of application, it has been empirically demonstrated that vMF-based clustering methods perform better than a number of existing approaches, see, e.g., (Zhong and Ghosh 2005; Gopal and Yang 2014).

Fig. 1
figure 1

Graphical model. The parameters \(\mu _{z_i}\) and \(\kappa _{z_i}\) are the mean direction and concentration parameter of vMF distribution \(f({\mathbf {x}}_i|\varvec{\mu }_{z_i}^{{\mathbf {w}}},\kappa _{z_i}) =c_d(\kappa )\exp [\kappa _{z_i}(\varvec{\mu }_{z_i}^{{\mathbf {w}}})^\top {\mathbf {x}}_i]\), respectively. The normalization term takes the following form \( c_d(\kappa ) = \kappa ^{\frac{d}{2}-1}(2\pi )^{\frac{d}{2}}I_{\frac{d}{2}-1}(\kappa ) \), where \(I_r(\kappa )\) represents the modified Bessel function of the first kind and order r. For more details on the vMF distribution, the reader can refer to Mardia and Jupp (2009)

In Salah and Nadif (2017a, 2019), the authors proposed a vMF mixture model for co-clustering. The graphical model is depicted in Fig. 1, and its probability density function is given by

$$\begin{aligned} f({\mathbf {x}}_i|\varTheta ) = \sum _{k=1}^g \pi _k f_k({\mathbf {x}}_i|\varvec{\mu }_{z_i}^{{\mathbf {w}}},\kappa _{z_i}), \end{aligned}$$

where \(\varTheta = \{ \varvec{\pi }=(\pi _1,\ldots ,\pi _g)\), \(\varvec{\mu }^{{\mathbf {w}}}=(\varvec{\mu }_1^{{\mathbf {w}}},\ldots ,\varvec{\mu }_g^{{\mathbf {w}}})\), \(\kappa _1,\ldots , \kappa _g\}\). Note that \(\varvec{\mu }^{{\mathbf {w}}}\) depends on \({\mathbf {w}}\), i.e., \(w_j = k\) if the jth column belongs to kth cluster, that is “associated” with the kth row cluster.

Note that with this model the d-dimensional centroids \(\varvec{\mu }_k^{{\mathbf {w}}} = (\mu _{k1},\ldots ,\mu _{k1},\ldots ,\mu _\mathrm{kg},\ldots ,\mu _\mathrm{kg})^\top \) such that \(\mu _\mathrm{kh}\) is repeated \(w_{.h}\) times, and \(\mu _\mathrm{kh} = 0\) for all \(k\ne h\) are assumed to be orthonormal. The parameter \(\kappa _k\) denotes the concentration of the kth distribution. The proportion of points \({\mathbf {x}}_i\) generated from the kth component is denoted by the parameter \(\pi _k\), such that \(\sum _k \pi _k = 1\) and \(\pi _k>0\), \(\forall k \in \{1,\ldots ,g\}\). The complete data log-likelihood is thereby given by

$$\begin{aligned} \begin{aligned} L_c(\varTheta |{\mathbf {X}},{\mathbf {Z}})&= \sum _{k}{z_{.k}\log {\pi _k}} + \sum _{k}{z_{.k}\log (c_d(\kappa _k))}\\&\quad + \sum _{i,k}z_\mathrm{ik} \kappa _k (\varvec{\mu }_{k}^{{\mathbf {w}}})^\top {\mathbf {x}}_{i}. \end{aligned} \end{aligned}$$

Assuming that all the mixing proportions are equal, i.e., \(\pi _k = \frac{1}{g}, \;\forall k\) (this does not penalize the quality of clustering as a result of the conscience mechanism) and for high dimensionality, i.e., large order \(r = d/2-1\), a small \(\kappa _k\) (due to the sparsity) gives \( 4(r+1) + \kappa _k^{2} \approx 4(r+1)\) and then \( \log {c_d(\kappa _k)} \approx -\frac{d}{2}\log {2\pi } - \log {c} \) where \(c = \frac{4(r+1)}{2^{r+2}(r+1)!}\); for details, the reader can refer to Salah and Nadif (2017a). Thus, \(L_c(\varTheta |{\mathbf {X}},{\mathbf {Z}})\) becomes

$$\begin{aligned} L_c(\varTheta |{\mathbf {X}},{\mathbf {Z}}) = \sum _{i,k}z_\mathrm{ik} \kappa _k (\varvec{\mu }_{k}^{{\mathbf {w}}})^\top {\mathbf {x}}_{i} + \mathrm{constant}. \end{aligned}$$
(1)

The concentration parameter \(\kappa _k\) is made inversely proportional to the root square of the number of elements in cluster k, i.e., \(\kappa _k = 1/\sqrt{z_{.k}}\) where the row assignments are done by maximizing a weighted Skmeans-like criterion where the weights \(1/\sqrt{z_{.k}}\) (\(k \in \{1,\ldots ,g\}\)) discourage the absorption of new objects by larger clusters. This is also the case for the column assignments, where \(w_{.k}\) is the cardinality of the kth column cluster and \(\varvec{\mu }_k^{{\mathbf {z}}}=(\mu _{k1},\ldots ,\mu _{k1},\ldots ,\mu _\mathrm{kg},\ldots ,\mu _\mathrm{kg})^\top \) its n-dimensional centroid. To sum up, following Salah and Nadif (2017a) we have,

$$\begin{aligned} \begin{aligned} L_c(\varTheta |{\mathbf {X}},{\mathbf {Z}})&\equiv \sum _{i,k}z_\mathrm{ik}\frac{1}{\sqrt{z_{.k}}} (\varvec{\mu }_{k}^{{\mathbf {w}}})^\top {\mathbf {x}}_{i}\\&\equiv \sum _{j,k} w_\mathrm{jk} \frac{1}{\sqrt{w_{.k}}}(\varvec{\mu }_{k}^{{\mathbf {z}}})^\top {\mathbf {x}}^{j}. \end{aligned} \end{aligned}$$
(2)

\(A\equiv B\) means that optimizing A is equivalent to optimizing B. The maximizing of mean directions (2) is defined as follows: \(\mu _\mathrm{kh} = 1/\sqrt{w_{.k}}\) if \(k=h\), and \(\mu _\mathrm{kh} = 0\) for all \(k\ne h\). Similarly, we deduce \(\varvec{\mu }_k^{{\mathbf {w}}}\) from \(1/\sqrt{z_{.k}}\). Note that \(\varTheta \) is now reduced to \(\varvec{\mu }^{{\mathbf {w}}}\) and \(\varvec{\mu }^{{\mathbf {z}}}\) the centers of row and column clusters.

The authors have derived a co-clustering algorithm that we refer to as Directional Co-clustering with a Conscience (DCC), tailored to high-dimensional sparse data (Alg. 1). The DCC algorithm intertwines row and column clusterings at each step so as to optimize \(L_c(\varTheta |{\mathbf {X}},{\mathbf {Z}})\). Integrating the conscience mechanism makes it possible to avoid highly skewed solutions with empty or very small/large clusters. Applied on unbalanced document-term matrices, DCC proves more suitable than most existing co-clustering approaches for handling directional data distributed on the surface of a unit-hypersphere.

figure a

4 Matrix view of the DCC model

In this section, we propose a matrix formulation of DCC. To this end, we first make use of the matrix formulation of \(\varvec{\mu }^{{\mathbf {w}}} = (\varvec{\mu }^{{\mathbf {w}}}_1,\ldots ,\varvec{\mu }^{{\mathbf {w}}}_k,\ldots ,\varvec{\mu }^{{\mathbf {w}}}_g) \in {\mathbb {R}}^{d \times g}\) and \(\varvec{\mu }^{{\mathbf {z}}} = (\varvec{\mu }^{{\mathbf {z}}}_1,\ldots ,\varvec{\mu }^{{\mathbf {z}}}_k,\ldots ,\varvec{\mu }^{{\mathbf {z}}}_g) \in {\mathbb {R}}^{n \times g}\) . Let us consider the binary classification matrices \({\mathbf {Z}}\in \{0,1\}^{n \times g}\) and \({\mathbf {W}}\in \{0,1\}^{d \times g}\), where the cluster sizes of \({\mathbf {Z}}\) and \({\mathbf {W}}\) are on the diagonal of \({\mathbf {D}}_{\mathbf {z}}={\mathbf {Z}}^\top {\mathbf {Z}}\) and \({\mathbf {D}}_{\mathbf {w}}={\mathbf {W}}^\top {\mathbf {W}}\), respectively. We therefore have

$$\begin{aligned} \varvec{\mu }^{{\mathbf {w}}}={\mathbf {W}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}= {\widetilde{{\mathbf {W}}}}. \quad \text{ and } \quad \quad \varvec{\mu }^{{\mathbf {z}}}={\mathbf {Z}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}= {\widetilde{{\mathbf {Z}}}}. \end{aligned}$$
(3)

Using the above matrix formulations, and given a document-term matrix \({\mathbf {X}}\), the optimization of the complete data log-likelihood of \({\mathbf {X}}\) \(L_c(\varTheta |{\mathbf {X}},{\mathbf {Z}})\) in (2) leads to

$$\begin{aligned} \begin{aligned} \sum _{i,k} z_\mathrm{ik}\frac{1}{\sqrt{z_{.k}}} (\varvec{\mu }_{k}^{{\mathbf {w}}})^\top {\mathbf {x}}_{i}&\equiv Tr({\mathbf {Z}}^\top {\mathbf {X}}{\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5})\\&\equiv Tr({\mathbf {W}}^\top {\mathbf {X}}^\top {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}). \end{aligned} \end{aligned}$$
(4)

In virtue of (3), the formulas for updating the algorithm DCC can be rewritten in the matrix form:

$$\begin{aligned} \begin{aligned}&{\mathbf {Z}}= \mathbf{Binarize} \left( {\mathbf {X}}{\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\right) \\&\text{ and }\\&{\mathbf {W}}= \mathbf{Binarize} \left( {\mathbf {X}}^\top {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\right) , \end{aligned} \end{aligned}$$
(5)

where Binarize\(({\mathbf {B}})\), means \( \forall i; {\mathbf {b}}_\mathrm{ik}=\arg \,\max _{k'} {\mathbf {b}}_{ik'} \).

The update rules show the mutual interaction between the set of documents and the set of words. If a word w is common to many documents associated with a co-cluster \({\mathbf {C}}_i\), then the word w will be associated with the co-cluster \({\mathbf {C}}_i\). Conversely, if a document contains many words that are associated with a co-cluster \({\mathbf {C}}_i\), then the document will be associated with the co-cluster \({\mathbf {C}}_i\). To find the desired solution for the partitions \({\mathbf {Z}}\) and \({\mathbf {W}}\), we can alternate the two rules (5) until a fixed point is reached.

5 Regularized bi-directional co-clustering

Text data co-clustering relies on the duality between the document and word spaces, i.e., documents can be grouped based on their distribution with respect to words, while words can be grouped based on their distribution with respect to documents. Existing co-clustering algorithms generally rely on the input document-term matrix \({\mathbf {X}}\). While some of them consider also pure word–word semantic correlations, such as proposed by Salah et al. (2018), co-clustering methods fail to consider side information arising from both word–word semantic correlations and document–document similarities. To fill this gap, we propose a Regularized Bi-directional Co-clustering (RBDCo) based on an appropriate matrix formulation. We construct two similarity matrices—the first, \({\mathbf {S}}_r\), for similarities in document content, and the second, \({\mathbf {S}}_c\), for semantic correlations between words: see Sect. 5.5—in order to exploit hidden structures in documents and words. Our co-clustering method is then formulated as an iterative matrix multiplication process with two similarity matrices as regularizers, which means that the partitions of documents and words need to be smoothed with respect to document similarities and semantic correlations of words.

Formally, let us consider the block matrix \(\left[ {\mathbf {Z}}~ {\mathbf {W}}\right] ^\top \). Utilizing the diagonal structure of \(\left[ \begin{array}{cc} 0&{} {\mathbf {X}}\\ {\mathbf {X}}^{\top }&{} 0 \\ \end{array} \right] \), we can write the update rules of the aforementioned DCC as an iterative matrix multiplication procedure based on the appropriate block matrices:

$$\begin{aligned} \left[ \begin{array}{c} {\mathbf {Z}}\\ {\mathbf {W}}\\ \end{array} \right] \leftarrow \left[ \begin{array}{cc} 0&{} {\mathbf {X}}\\ {\mathbf {X}}^{\top }&{} 0 \\ \end{array} \right] \left[ \begin{array}{c} {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\\ {\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\\ \end{array} \right] = \left[ \begin{array}{c} {\mathbf {X}}{\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\\ {\mathbf {X}}^\top {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\\ \end{array} \right] . \end{aligned}$$
(6)

This formulationFootnote 3 clearly shows how DCC utilizes the duality between document and word spaces. The document clustering \({\mathbf {Z}}\) is derived as a weighted projection of the data matrix \({\mathbf {X}}\) on the subspace spanned by the word partition \({\mathbf {W}}\). Similarly, the word partition is derived as a weighted projection of the data matrix \({\mathbf {X}}\) on the subspace spanned by the document partition \({\mathbf {Z}}\).

5.1 RBDCo method

We propose a regularized bi-directional data co-clustering method, RBDCo, that draws advantage from our block matrix formulation of DCC (Eq. 6) and harnesses two regularized data matrices, \({\mathbf {M}}_{\mathbf {z}}\) and \({\mathbf {M}}_{\mathbf {w}}\), with values in \(\{ {\mathbf {X}}, {\mathbf {S}}_r {\mathbf {X}}, {\mathbf {X}}{\mathbf {S}}_c, {\mathbf {S}}_r{\mathbf {X}}{\mathbf {S}}_c\}\) (see Sect. 5.5). The objective of RBDCo is to optimize the following trace criterion:

$$\begin{aligned} J_{RBDCo}\equiv & {} \frac{1}{2} Tr\left( \left[ \begin{array}{c} {\mathbf {Z}}\\ {\mathbf {W}}\\ \end{array} \right] ^\top \left[ \begin{array}{cc} 0&{} {\mathbf {M}}_{\mathbf {z}}\\ {\mathbf {M}}_{\mathbf {w}}^{\top }&{} 0 \\ \end{array} \right] \left[ \begin{array}{c} {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\\ {\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\\ \end{array} \right] \right) \nonumber \\\equiv & {} \frac{1}{2} Tr\left( {\mathbf {Z}}^\top {\mathbf {M}}_{\mathbf {z}}{\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5})+Tr({\mathbf {W}}^\top {\mathbf {M}}_{\mathbf {w}}^\top {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\right) \nonumber \\\equiv & {} \frac{1}{2} Tr\left( {\widetilde{{\mathbf {Z}}}}^\top ({\mathbf {M}}_{\mathbf {z}}+{\mathbf {M}}_{\mathbf {w}}){\widetilde{{\mathbf {W}}}}\right) . \end{aligned}$$
(7)

The data co-clustering task is carried out by iteratively computing \({\mathbf {Z}}\) and \({\mathbf {W}}\) based on the interplay between the two updating rules derived from the maximization of the objective criterion \(J_{RBDCo}\),

$$\begin{aligned} \left[ \begin{array}{c} {\mathbf {Z}}\\ {\mathbf {W}}\\ \end{array} \right] \leftarrow \left[ \begin{array}{cc} 0&{} {\mathbf {M}}_{\mathbf {z}}\\ {\mathbf {M}}_{\mathbf {w}}^{\top }&{} 0 \\ \end{array} \right] \left[ \begin{array}{c} {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\\ {\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\\ \end{array} \right] = \left[ \begin{array}{c} {\mathbf {M}}_{\mathbf {z}}{\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\\ {\mathbf {M}}_{\mathbf {w}}^\top {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\\ \end{array} \right] . \end{aligned}$$
(8)

If we set \({\mathbf {M}}_{\mathbf {z}}={\mathbf {M}}_{\mathbf {w}}={\mathbf {S}}_r {\mathbf {X}}{\mathbf {S}}_c\), this leads to,

$$\begin{aligned} \left[ \begin{array}{c} {\mathbf {Z}}\\ {\mathbf {W}}\\ \end{array} \right] \leftarrow \left[ \begin{array}{c} {\mathbf {S}}_r {\mathbf {X}}{\mathbf {S}}_c{\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5} \\ {\mathbf {S}}_c {\mathbf {X}}^\top {\mathbf {S}}_r {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5} \\ \end{array} \right] . \end{aligned}$$
(9)

The RBDCo updating rules in (8) mutually exploit the duality of the documents and words and reinforce their joint clustering with bi-directional multiplicative regularizations using \({\mathbf {S}}_c\) and \({\mathbf {S}}_r\). By generating explicit assignments of words, RBDCo produces interpretable descriptions of the resulting co-clusters. In addition, by iteratively alternating between the two updating rules, RBDCo performs an implicit adaptive word selection at each iteration and flexibly measures the distances between documents. It therefore works well with high-dimensional sparse data. The conscience mechanism embedded in RBDCo also means that it performs well with unbalanced document-term data (see Sect. 6). Algorithm 2 details the alternating procedure of RBDCo.

In the case of a symmetric regularization, we set \({\mathbf {M}}_{\mathbf {z}}={\mathbf {M}}_{\mathbf {w}}={\mathbf {M}}\), i.e., the same regularization is applied to the update rules for both \({\mathbf {Z}}\) and \({\mathbf {W}}\). The objective of RBDCo is then reduced to

$$\begin{aligned} J_{RBDCo}\equiv & {} Tr\left( {\mathbf {Z}}^\top {\mathbf {M}}{\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\right) . \end{aligned}$$
(10)

If \({\mathbf {M}}_{\mathbf {z}}={\mathbf {M}}_{\mathbf {w}}={\mathbf {X}}\), then RBDCo is equivalent to the particular case DCC. In fact, comparing (8) and (6), it is easy to see that RBDCo generalizes DCCDCC being RBDCo with all similarity matrices equal to \({\mathbf {I}}\) –.

figure b

5.2 A generalized regularization framework

RBDCo offers a highly flexible framework in the context of text data co-clustering for the integration of supplementary information embedded in matrices that encapsulate similarities between documents and semantic correlations between words. We distinguish two types of regularization: (i) symmetric regularization, which consists in the application of the same regularization for the update of \({\mathbf {Z}}\) and \({\mathbf {W}}\) (\({\mathbf {M}}_{\mathbf {z}}={\mathbf {M}}_{\mathbf {w}}\)), and (ii) asymmetric regularization, which considers different regularizations for the update of \({\mathbf {Z}}\) and \({\mathbf {W}}\) (\({\mathbf {M}}_{\mathbf {z}}\ne {\mathbf {M}}_{\mathbf {w}}\)). Table 1 summarizes the different symmetric and asymmetric configurations covered by RBDCo.

Table 1 Description of RBDCo regularization schemes

The different regularization schemes described in Table 1 highlight the flexibility of the proposed model and the connections with other approaches that can derive from it (Sect. 5.3). In our study, we indicated and justified the choice of the model retained for the case of document-term data on which we focused (see Particular cases and Sect. 6.3.1). For other types of data, the user may favor one model over another. An automatic model selection could be part of an interesting future study.

Particular cases The high degree of flexibility offered by RBDCo for multiplicative bi-directional regularizations gives rise to a variety of versions. For instance, if the identity matrix is assigned to the right-hand side of the regularization matrices \({\mathbf {M}}_{\mathbf {z}}\) and \({\mathbf {M}}_{\mathbf {w}}^\top \), we obtain the asymmetric uncross case:

$$\begin{aligned} \left[ \begin{array}{c} {\mathbf {Z}}\\ {\mathbf {W}}\\ \end{array} \right] \leftarrow \left[ \begin{array}{cc} 0&{} {\mathbf {S}}_r{\mathbf {X}}\\ {\mathbf {S}}_c{\mathbf {X}}^{\top }&{} 0 \\ \end{array} \right] \left[ \begin{array}{c} {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\\ {\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\\ \end{array} \right] = \left[ \begin{array}{c} {\mathbf {S}}_r{\mathbf {X}}{\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5} \\ {\mathbf {S}}_c{\mathbf {X}}^\top {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\\ \end{array} \right] .\nonumber \\ \end{aligned}$$
(11)

Similarly, if the identity matrix is assigned to the left-hand side of the \({\mathbf {M}}_{\mathbf {z}}\) and \({\mathbf {M}}_{\mathbf {w}}^\top \) regularization matrices, we obtain the asymmetric cross case:

$$\begin{aligned} \left[ \begin{array}{c} {\mathbf {Z}}\\ {\mathbf {W}}\\ \end{array} \right] \leftarrow \left[ \begin{array}{cc} 0&{} {\mathbf {X}}{\mathbf {S}}_c \\ {\mathbf {X}}^{\top }{\mathbf {S}}_r &{} 0 \\ \end{array} \right] \left[ \begin{array}{c} {\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\\ {\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\\ \end{array} \right] = \left[ \begin{array}{c} {\mathbf {X}}{\mathbf {S}}_c{\widetilde{{\mathbf {W}}}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\\ {\mathbf {X}}^\top {\mathbf {S}}_r{\widetilde{{\mathbf {Z}}}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\\ \end{array} \right] .\nonumber \\ \end{aligned}$$
(12)

This second particular case, RBDCo\(_{[S_c,S_r]}\), usually produces the best performance with document-text data (see Sect. 6). Here, row/document clustering \({\mathbf {Z}}\) is regularized with the word co-occurrence information \({\mathbf {S}}_c\), and column/word clustering \({\mathbf {W}}\) is regularized with the document content similarities \({\mathbf {S}}_r\). This cross-regularization is the most natural bi-directional regularization, reflecting the iterative alternating projections of words in the document space, and vice versa.

5.3 Connection to matrix decomposition

5.3.1 Connection to NMF

Basically, \( {\widetilde{{\mathbf {Z}}}}={\mathbf {Z}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\) denotes the likelihood of documents being associated with document clusters, and \( {\widetilde{{\mathbf {W}}}}={\mathbf {W}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\) the likelihood of words being associated with word clusters. The \({ ij}\)th entry of \( {\widetilde{{\mathbf {Z}}}} {\widetilde{{\mathbf {W}}}}^\top \) therefore indicates the possibility that the jth word will be present in the ith-document, computed as the dot product of the ith row of \( {\widetilde{{\mathbf {Z}}}}\) and the jth row of \( {\widetilde{{\mathbf {W}}}}\). Hence, \( {\widetilde{{\mathbf {Z}}}} {\widetilde{{\mathbf {W}}}}^\top \) can be interpreted as the approximation of the original data \({\mathbf {X}}\). Our goal is then to find a \({\mathbf {Z}}\) and a \({\mathbf {W}}\) that minimize the squared error between \({\mathbf {X}}\) and its approximation \( {\widetilde{{\mathbf {Z}}}} {\widetilde{{\mathbf {W}}}}^\top \). From a Nonnegative Matrix Factorization (NMF) perspective (Lee and Seung 2001), we have

$$\begin{aligned} \min _{{\mathbf {Z}},{\mathbf {W}}}||{\mathbf {X}}- {\mathbf {Z}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}{\mathbf {D}}_{\mathbf {w}}^{-0.5}{\mathbf {W}}^\top ||_F^2\equiv & {} \min _{ {\widetilde{{\mathbf {Z}}}}, {\widetilde{{\mathbf {W}}}}}||{\mathbf {X}}- {\widetilde{{\mathbf {Z}}}} {\widetilde{{\mathbf {W}}}}^\top ||_F^2\\\equiv & {} \max _{ {\widetilde{{\mathbf {Z}}}}, {\widetilde{{\mathbf {W}}}}} Tr\left( {\widetilde{{\mathbf {Z}}}}^\top {\mathbf {X}}{\widetilde{{\mathbf {W}}}}\right) \\\equiv & {} \max _{\varTheta ,{\mathbf {Z}}} L_c(\varTheta |{\mathbf {X}},{\mathbf {Z}}). \end{aligned}$$

It will be remarked that, by construction, both \( {\widetilde{{\mathbf {Z}}}}\) and \( {\widetilde{{\mathbf {W}}}}^\top \) are non-negative and orthogonal; we have \({\widetilde{{\mathbf {Z}}}}^\top {\widetilde{{\mathbf {Z}}}}={\mathbf {D}}_{\mathbf {z}}^{-0.5}{\mathbf {Z}}^\top {\mathbf {Z}}{\mathbf {D}}_{\mathbf {z}}^{-0.5} ={\mathbf {I}}\), and similarly, we also have \( {\widetilde{{\mathbf {W}}}}^\top {\widetilde{{\mathbf {W}}}}={\mathbf {D}}_{\mathbf {w}}^{-0.5}{\mathbf {W}}^\top {\mathbf {W}}{\mathbf {D}}_{\mathbf {w}}^{-0.5} ={\mathbf {I}}\). Our proposed generalized regularization framework RBDCo therefore allows us to see that in its basic configuration, vMF model-based co-clustering with a conscience mechanism is in fact equivalent to a double orthogonal NMF applied to spherical data.

5.3.2 Link to NMTF

In a similar way, we can identify the link to Non-negative Matrix Tri-Factorization. Let us consider the weighting matrix \({\mathbf {D}}={\mathbf {D}}_{\mathbf {z}}^{-0.5} {\mathbf {D}}_{\mathbf {w}}^{-0.5}\), which is diagonal by construction and where each diagonal value \({\mathbf {D}}_{kk}\) represents the square root of the geometric mean of document and word cluster sizes in block k. It follows that \(\min _{{\mathbf {Z}},{\mathbf {W}}}||{\mathbf {X}}- {\mathbf {Z}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}{\mathbf {D}}_{\mathbf {w}}^{-0.5}{\mathbf {W}}^\top ||_F^2\) is equivalent to

$$\begin{aligned} \min _{ {\mathbf {Z}}, {\mathbf {W}}, {\mathbf {D}}={\mathbf {D}}_{\mathbf {z}}^{-0.5} {\mathbf {D}}_{\mathbf {w}}^{-0.5}}||{\mathbf {X}}- {\mathbf {Z}}{\mathbf {D}}{\mathbf {W}}^\top ||_F^2\equiv & {} \max _{ {\mathbf {Z}}, {\mathbf {W}}, {\mathbf {D}}} Tr\left( {\mathbf {Z}}^\top {\mathbf {X}}{\mathbf {W}}{\mathbf {D}}\right) \\\equiv & {} \max _{\varTheta ,{\mathbf {Z}}} L_c(\varTheta |{\mathbf {X}},{\mathbf {Z}}) \end{aligned}$$

which is also equivalent to fast NMTF proposed in (Wang et al. 2011), with an additional constraint on the centroid matrix \({\mathbf {D}}\) in order to meet the requirement of directional data.

5.3.3 Link to spectral co-clustering

If, on the other hand, we relax the non-negativity constraint on both \({\widetilde{{\mathbf {Z}}}}\) and \({\widetilde{{\mathbf {W}}}}\), we have

$$\begin{aligned} \max _{\varTheta ,{\mathbf {Z}}} L_c(\varTheta |{\mathbf {X}},{\mathbf {Z}})\equiv & {} \max _{ {\widetilde{{\mathbf {Z}}}}^\top {\widetilde{{\mathbf {Z}}}}={\mathbf {I}},{\widetilde{{\mathbf {W}}}}^\top {\widetilde{{\mathbf {W}}}}={\mathbf {I}}} Tr\left( {\widetilde{{\mathbf {Z}}}}^\top {\mathbf {X}}{\widetilde{{\mathbf {W}}}}\right) , \end{aligned}$$

where \({\widetilde{{\mathbf {Z}}}}={\mathbf {Z}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}\) and \({\widetilde{{\mathbf {W}}}}={\mathbf {W}}{\mathbf {D}}_{\mathbf {w}}^{-0.5}\). It is easy to verify that \({\widetilde{{\mathbf {Z}}}}\) and \({\widetilde{{\mathbf {W}}}}\) satisfy the orthogonality constraint, i.e., \({\widetilde{{\mathbf {Z}}}}^\top {\widetilde{{\mathbf {Z}}}}={\mathbf {I}} \text{ and } {\widetilde{{\mathbf {W}}}}^\top {\widetilde{{\mathbf {W}}}}={\mathbf {I}}. \) This optimization problem can be transformed using Lagrange multipliers into an eigenvalue problem. Then, given \(svd({\mathbf {X}})={\widetilde{{\mathbf {Z}}}}\varSigma {\widetilde{{\mathbf {W}}}}^\top \), the discrete co-clustering is obtained by performing k-means on the concatenated data \(\left[ {\mathbf {Z}}~ {\mathbf {W}}\right] ^\top \). This is equivalent to the spectral co-clustering method proposed in (Dhillon and Modha 2001).

5.4 Link to block seriation

The basic idea of block co-clustering consists in modelling the simultaneous row and column partitions using a block seriation relation \({\mathbf {Q}}\) defined on \(I \times J\) (where I is the set of objects and J the set of attributes). Given that \({\mathbf {Q}}={\mathbf {Z}}{\mathbf {W}}^{T}\), the general term can be expressed as follows: \({\mathbf {q}}_\mathrm{ij}=1\) if object i is in the same block as attribute j, and \({\mathbf {q}}_\mathrm{ij}=0\) otherwise. Thus we have

$$\begin{aligned} {\mathbf {q}}_\mathrm{ij}=\sum _{k=1}^g{\mathbf {z}}_\mathrm{ik}{\mathbf {w}}_\mathrm{jk}=\left( {\mathbf {Z}}{\mathbf {W}}^\top \right) _\mathrm{ij}. \end{aligned}$$
(13)

The matrix \({\mathbf {Q}}\) represents a block seriation relation (see (Marcotorchino 1991) for further details) that must respect the following properties:

  • Binarity \({\mathbf {q}}_\mathrm{ij} \in \{0,1\}, \forall (i,j)\in I \times J.\)

  • Assignment constraints These constraints ensure the bijective correspondence between classes in two partitions, meaning that each class in the partition of I has one corresponding class in the partition of J, and vice versa. These constraints are expressed linearly as follows:

    $$\begin{aligned} \left\{ \begin{array}{ll} \sum _{j \in J}{\mathbf {q}}_\mathrm{ij} \ge 1&{} \hbox {} \forall i\in I\\ \sum _{i \in I}{\mathbf {q}}_\mathrm{ij}\ge 1 &{} \hbox {} \forall j\in J.\\ \end{array} \right. \end{aligned}$$
  • Triad impossible The role of these constraints is to ensure the disjoint structure of the blocks, which is expressed by the following system inequality:

    $$\begin{aligned} \left\{ \begin{array}{ll} {\mathbf {q}}_\mathrm{ij}+{\mathbf {q}}_{ij'}+{\mathbf {q}}_{i'j'}-{\mathbf {q}}_{i'j}-1 \le 1 &{} \hbox {} \\ {\mathbf {q}}_{i'j'}+{\mathbf {q}}_{i'j}+{\mathbf {q}}_\mathrm{ij}-{\mathbf {q}}_{ij'}-1 \le 1 &{} \hbox {} \\ {\mathbf {q}}_{i'j}+{\mathbf {q}}_\mathrm{ij}+{\mathbf {q}}_{ij'}-{\mathbf {q}}_{i'j'}-1 \le 1 &{} \hbox {} \\ {\mathbf {q}}_{ij'}+{\mathbf {q}}_{i'j'}+{\mathbf {q}}_{i'j}-{\mathbf {q}}_\mathrm{ij}-1 \le 1. &{} \hbox {} \\ \end{array} \right. \end{aligned}$$

These constraints also generalize transitivity for non-symmetric data. In the case where \(I=J\), it is easy to show that the block seriation relation \({\mathbf {Q}}\) becomes an equivalence relation, i.e., \({\mathbf {Q}}= {\mathbf {Z}}{\mathbf {Z}}^\top \) or \({\mathbf {Q}}= {\mathbf {W}}{\mathbf {W}}^\top \) .

It will be remarked that (13) is not balanced in terms of the cluster sizes for rows and columns, meaning that a cluster might become small when affected by outliers. For this reason, we propose a new scaled block seriation relation that considers both row and column cluster sizes:

$$\begin{aligned} \tilde{{\mathbf {q}}}_\mathrm{ij}=\sum _{k=1}^g \frac{{\mathbf {z}}_\mathrm{ik}{\mathbf {w}}_\mathrm{jk}}{\sqrt{{\mathbf {z}}_{.k}{\mathbf {w}}_{.k}}}=\sum _{k=1}^g{\tilde{{\mathbf {z}}}}_\mathrm{ik}{\tilde{{\mathbf {w}}}}_{jk} =\left( {\widetilde{{\mathbf {Z}}}}{\widetilde{{\mathbf {W}}}}^T\right) _\mathrm{ij}~. \end{aligned}$$
(14)

A new measure, which we call scaled block seriation criterion, is defined as follows:

$$\begin{aligned} \min _{{\mathbf {Z}},{\mathbf {W}}}||{\mathbf {X}}- {\mathbf {Z}}{\mathbf {D}}_{\mathbf {z}}^{-0.5}{\mathbf {D}}_{\mathbf {w}}^{-0.5}{\mathbf {W}}^\top ||_F^2\equiv & {} \min _{ {\widetilde{{\mathbf {Z}}}}, {\widetilde{{\mathbf {W}}}}}||{\mathbf {X}}- {\widetilde{{\mathbf {Z}}}} {\widetilde{{\mathbf {W}}}}^\top ||_F^2\\\equiv & {} \min _{{\widetilde{{\mathbf {Q}}}}}||{\mathbf {X}}- {\widetilde{{\mathbf {Q}}}}||_F^2\\\equiv & {} \max _{ {\widetilde{{\mathbf {Q}}}}} Tr\left( {\mathbf {X}}{\widetilde{{\mathbf {Q}}}}^\top \right) \\\equiv & {} \max _{\varTheta ,{\mathbf {Z}}} L_c(\varTheta |{\mathbf {X}},{\mathbf {Z}}). \end{aligned}$$

This is a scaled variant of the block seriation method (Marcotorchino 1991).

5.5 RBDCo regularization matrices

The regularization matrices \({\mathbf {S}}_c\) and \({\mathbf {S}}_r\) are built from the original document-term matrix \({\mathbf {X}}\in {\mathbb {R}}^{n \times d}\). We first consider \({\mathbf {S}}_c\), for which we use a nonlinear transformation of the word co-occurrences, namely the Pointwise Mutual Information (\(\mathrm{PMI}\)). It must be emphasized that the \(\mathrm{PMI}\) has been shown to be strongly correlated with human assessment for word relatedness (Newman et al. 2009; Role and Nadif 2011). However, in other contexts, the user can easily introduce his own specific information about words/documents meaning. The \(\mathrm{PMI}\) between words \(w_i\) and \(w_j\) is defined as \(\log \left( p(w_i,w_j)/p(w_i)p(w_j)\right) \). Assuming that the documents are the context in which words co-occur, and using the matrix \({\mathbf {C}}={\mathbf {X}}^\top {\mathbf {X}}\), we can empirically estimate the \(\mathrm{PMI}\) as follows:

$$\begin{aligned} \mathrm{PMI_C}(w_i, w_j) = \log \frac{c_\mathrm{ij} \times c_{..}}{c_{j.}c_{.j}}, \end{aligned}$$
(15)

where \(c_{..}=\sum _\mathrm{ij}c_{ij}\), \(c_{i.}=\sum _{j}c_\mathrm{ij}\) and \(c_{.j}=\sum _{i}c_\mathrm{ij}\). \(\mathrm{PMI}\) values can be positive or negative. Positive values indicate that a word pair co-occurs more than by chance. Negative values are harder to interpret, since they would seem to indicate word pairs that co-occur less than by chance. A generally accepted approximation consists in replacing all negative values with 0, giving the Positive Pointwise Mutal Information (\(\mathrm{PPMI}\)). One advantage of the \(\mathrm{PPMI}\) is that it reduces the density of the \(\mathrm{PMI}\) matrix. In RBDCo, we consider the \(\mathrm{PPMI_c}\) matrix as our word regularization matrix \({\mathbf {S}}_c\). Similarly, we can compute a matrix \({\mathbf {R}}\) such that \({\mathbf {R}}={\mathbf {X}}{\mathbf {X}}^\top \). In virtue of (15), we can define a \(\mathrm{PMI_r}(d_i,d_j)\) between documents \(d_i\) and \(d_j\) that gives the co-occurrence frequency of two documents in the latent space of words. Just like in the case of \({\mathbf {S}}_c\), we consider the \(\mathrm{PPMI_R}\) as being \({\mathbf {S}}_r\).

We have chosen to make use of \(\mathrm{PPMI}\) for RBDCo regularization matrices, since these matrices are very general and suitable for incorporating side similarity information. They can also be computed quite easily from the original data matrix. However, other document or word embeddings obtained via external methods might also be used (e.g., Word2Vec (Mikolov et al. 2013), Doc2Vec (Le and Mikolov 2014)).

6 Experimental analysis

We now present our extensive experimental results that show the good performance of our method across a wide range of real-world text datasets. We first compare several variants of our RBDCo approach (Sect. 6.3.1). We then evaluate the best RBDCo variant in relation to baseline clustering and co-clustering methods (Sect. 6.3.2). Specifically, the clustering algorithms that we consider are k -means, spectral clustering (Spec), Non-Negative Matrix Factorization (NMF), and spherical k-means (Skmeans). It is generally recognized that Skmeans in particular is well-suited to high-dimensional sparse text data. The baseline co-clustering algorithms are NMTF, DCC, and CoClustMod. The latter, CoClustMod, is a recent graph modularity-based co-clustering algorithm proposed by Ailem et al. (2016), in which CoClustMod was shown, through extensive experiments on text datasets, to outperform several other established co-clustering methods designed for the same task, including the well-known spectral co-clustering (Dhillon and Modha 2001), and ITCC (Dhillon 2003). Finally, we demonstrate the effectiveness of RBDCo in comparison with a competitive regularized text co-clustering method, WC-NMTF (Salah et al. 2018). Although WC-NMTF is a regularized co-clustering approach, it is based on an additive unidirectional regularization, where only word co-occurrence is used. By contrast, RBDCo proposes a bi-directional multiplicative regularization and embeds a conscience mechanism that makes it able to handle strongly unbalanced textual data. k -means, Spec and NMF come from the scikit-learnFootnote 4 Python package, and Skmeans and CoClustMod from the coclustFootnote 5 Python package. We implemented RBDCo, DCC and WC-NMTF in Python.

6.1 Benchmark datasets

We analyzed eight benchmark datasets widely used for document clustering purposes, namely SPORTS, TR45, LA12, CLASSIC4, CSTR, OHSCALE, PUBMED5, and CLASSIC3. Each dataset can be viewed as a contingency matrix where the coefficients \(x_\mathrm{ij}\) indicate the number of occurrences of word j in document i. Together, these datasets contain a number of different challenging situations, including different degrees of cluster balance, diverse cluster sizes, and various degrees of cluster overlap.

Table 2 Description of Datasets

Table 2 provides an overview of the important characteristics of the datasets sorted in increasing order of their Balance coefficient, which is the ratio of the smallest cluster size to the largest cluster size. As is frequently the case in document-term co-clustering, within these benchmark datasets labels are known only for the documents, and not for the words. However, given that the word partition is inherently linked to the document partition, we would expect the quality of the document clustering to be informative about the quality of the word clustering.

6.2 Experimental settings and evaluation

In all our experiments, the document-term count matrix is normalized using the TF-IDF weighting scheme (term-frequency times inverse document frequency), as implemented in the scikit-learn Python package. The results are averaged over 20 different runs. For RBDCo, each run is done with 10 different initializations and a number of iterations below 100. Specifically, the final RBDCo co-clustering is automatically obtained based on the best criterion (Eq. 7) among the different initializations. To avoid poor local solutions that could be produced by early hard word assignments in the RBDCo iteration, we perform stochastic column assignments during the first 70 iterations, as described in (Salah and Nadif 2019). Whenever applicable, the approaches that RBDCo is being compared with were also performed with 10 initializations and not more than 100 iterations.

We evaluate the document clustering quality of RBDCo using two measures that are widely used for assessing the similarity between the estimated clustering and the true clustering. The measures are Normalized Mutual Information (NMI) (Strehl and Ghosh 2003) and Adjusted Rand Index (ARI) (Hubert and Arabie 1985; Steinley 2004). Specifically, NMI evaluates to what extent the estimated clustering is informative about the known clustering, and ARI quantifies the agreement between the estimated clustering and the true labels. NMI is less sensitive than ARI to cluster splitting or merging.

6.3 Empirical results on document clustering

6.3.1 Comparing RBDCo variants

We first compare the four RBDCo versions that incorporate information on both the document and the word dimensions, namely RBDCo\(_{[S_c, S_r]}\), RBDCo\(_{[S_r, S_c]}\), RBDCo\(_{[S_c, S_c]}\) and RBDCo\(_{[S_r, S_r]}\) (see Table 1 for details on RBDCo schemes). Table 3 summarizes the NMI and ARI evaluations for these versions on all the benchmark datasets.

Table 3 Mean±sd clustering NMI and ARI on \(\mathrm{documents}\times \mathrm{terms}\) matrices for RBDCo variants

These four RBDCo schemes give good NMI and ARI results, with a null standard deviation for almost all datasets. Overall, RBDCo\(_{[S_c,S_r]}\) is the most effective variant. As already mentioned (Sect. 5.2), this version has the most natural bi-directional regularization for co-clustering. SPORTS is an exception: for this dataset it is the uncross bi-directional regularization RBDCo\(_{[{\mathbf {S}}_r,{\mathbf {S}}_c]}\) that provides the best results (Table 3, first row). This may be explained by its high degree of document cluster imbalance. Among all the datasets, SPORTS has the lowest balance coefficient (0.036) and the lowest ratio of minimum to expected (\(\mathrm{RME}=0.099\))—this corresponds to the smallest cluster size with respect to the expected cluster size, that is to say n/g, where g is the number of clusters. SPORTS also has the greatest standard deviation in cluster sizes (\(\mathrm{SDCS}=1253.01\)), which is defined as \(\{1/(g-1)\sum ^g_{k=1}(n_k-n/g)^2 \}^{0.5}\), where n is the total number of documents and \(n_k\) is the cardinality of the k th cluster. This dataset therefore requires that the document similarity \({\mathbf {S}}_r\) be applied on the document dimension rather than on the word dimension in order to achieve a significantly higher NMI (0.71) and ARI (0.68).

Below, in the light of these results, we will consider only the cross-regularized RBDCo scheme RBDCo\(_{[S_c, S_r]}\), where the document clustering \({\mathbf {Z}}\) is regularized with word co-occurrence information \({\mathbf {S}}_c\), and the word clustering \({\mathbf {W}}\) is regularized with document content similarity \({\mathbf {S}}_r\).

6.3.2 Evaluating RBDCo against baselines

Table 4 is a synopsis of our results for RBDCo and the other clustering/co-clustering methods, comparing their NMI and ARI values across the various benchmark datasets. The first thing to notice is that RBDCo clearly outperforms the standard or competitive co-clustering methods shown in the rightmost three columns of Table 4 (see also Average ranks). In contrast to DCC, CoClustMod and NMF, RBDCo uses two regularization terms, specifically a word correlation matrix \({\mathbf {S}}_c\) and a document similarity matrix \({\mathbf {S}}_r\). The superior performance of RBDCo can therefore be attributed to these regularization terms. The ARI metric is generally more sensitive than NMI to cluster merging or splitting. However, RBDCo has good performances for both NMI and ARI, even for highly unbalanced datasets (such as TR45 and SPORTS).

Table 4 RBDCo baseline comparisons with clustering and co-clustering approaches
Fig. 2
figure 2

NMI and ARI comparison of RBDCo\(_{[S_c,S_c]}\), WC-NMTF and DCC on  \(\mathrm{documents}\times \mathrm{terms}\) matrices

Fig. 3
figure 3

Original datasets (a and c) and reorganized version (b and d) using RBDCo\(_{[Sc,Sr]}\)

As expected, our co-clustering approach generally outperforms baseline clustering methods for the document clustering, with a higher mean margin for NMF, k -means and Spec than for the co-clustering approaches. It will be remarked that Skmeans performs well on two benchmark datasets, PubMed5 and ClASSIC3, with a small mean margin of 0.02 for NMI and 0.01 for ARI. However, our approach significantly outperforms Skmeans on unbalanced text datasets (from SPORTS to OHSCALE), with a mean margin of 0.06 for NMI and 0.11 for ARI (Table 4, first two columns). These good results on text datasets against a strong competitor like Skmeans are an indication of how well RBDCo is able to deal with a wide range of textual datasets, and in particular in relation to text data with very small or very large (co-)clusters (see Table 4, Average ranks). It should also be remembered that RBDCo provides a simultaneous clustering of documents and words and hence ensures the identification of the main document cluster topics, while Skmeans gives only a one-sided clustering without automatic association between documents and words.

Table 5 Top frequent terms with RBDCo \(_{[S_c, S_r]}\) and average PPMI within (w/i) and between (b/w) co-clusters
Table 6 \(\mathrm{NPMI}_i\) scores within (w/i) and between (b/w) top frequent word clusters

6.3.3 Bi-directional word-based regularization

We will now compare RBDCo with a competitive regularized method, namely WC-NMTF. WC-NMTF uses an additive unidirectional regularization, where only word co-occurrence is used on the partition of columns to obtain block diagonal co-clusters. A comparison of RBDCo\(_{[{\mathbf {S}}_c,{\mathbf {S}}_c]}\) with WC-NMTF allows us to evaluate the advantage to be derived from multiplicative bi-directional word similarity regularization (i.e., \({\mathbf {S}}_c\) applied on the columns and the rows of \({\mathbf {X}}\)). To further enrich our comparison, we also consider the DCC evaluations. As detailed in Sect. 3, DCC, just like RBDCo, incorporates a conscience mechanism, but it does not use any regularization. Figure 2 gives the NMI and ARI measures for RBDCo\(_{[S_c,S_c]}\), WC-NMTF, and DCC, on all datasets. These results clearly show the better performance of RBDCo. In particular, RBDCo improves the PUBMED5 document partitioning almost by a factor of two in relation to WC-NMTF, although the bi-directional regularization uses only word similarity information.

6.4 Empirical results on word clustering

6.4.1 Block diagonal co-clustering

RBDCo partitions the data in diagonal document-term co-clusters, resulting in a document and a word clustering. Figure 3b, d shows the structures revealed by RBDCo\(_{[S_c, S_r]}\) for PUBMED5 and CLASSIC4 (dots indicate strong TF-IDF weights). While the original PUBMED5 matrix does not have any explicit structure (Fig. 3a), RBDCo proposes a very clear co-clustering of the documents and terms (Fig. 3b; \(\mathrm{NMI} = 0.91, \mathrm{ARI} = 0.94\)). Figure 3d shows the structure uncovered by RBDCo for CLASSIC4 (\(\mathrm{NMI} = 0.77, \mathrm{ARI} = 0.78\)).

The words that occur most frequently within a co-cluster \(C_i\) are usually considered to be the most representative terms for that co-cluster. The ranking of these top terms obtained with RBDCo is given in Table 5 for PUBMED5 and CLASSIC4. We can see that these terms provide a good interpretability of the partitioning. Most importantly, they can easily be linked to the topics of the true document classes. PUBMED5 is composed of five document classes, namely Age-related Macular Degeneration (AMD), Otitis, Kidney Stones, Hay Fever and Migraine. RBDCo clearly uncovers associated topic words. Similarly, for CLASSIC4 RBDCo gives top terms that are highly indicative of the true document classes, namely CISI (information retrieval), CACM (computing machinery), MEDLINE (medical) and CRANFIELD (aeronautical systems). We recall that PUBMED5 and CLASSIC4 contain stemmed terms, so, for example, we have ‘ey’ rather than ‘eye’, and ‘studi’ rather than ‘study’.

6.4.2 Quantifying the quality of word clusters

Assessing the quality of word clustering is challenging, since the benchmark datasets commonly used in text document co-clustering provide the true document labels only. To assess the quality of the incorporation of word similarity information on the word partitioning, we first focus on the PPMI score. We then propose an enhanced version of the NPMI (Normalized Pointwise Mutual Information) score, to quantify the quality of the word clustering.

PPMI-score assessment We first consider the ten most frequent terms in each word cluster as the top terms. Table 5 gives the average pairwise PPMI values for the ten most frequent terms within and between word clusters. A random average is also given as a reference. This is an average pairwise PPMI over 100 random groups of ten words from the 1000 most frequent words within the whole corpus. Interestingly, for both the PUBMED5 and the CLASSIC4 datasets, the within PPMI average is much higher than the between PPMI average, indicating that RBDCo makes effective use of the PPMI regularization information given as input throughout its alternating iterations in a way that ultimately favors the grouping of semantically related words. The average pairwise PPMI between RBDCo word clusters even exhibits a stronger antagonism than the average pairwise PPMI for random groups of words (\(\mathrm{PPMI}_{random} \sim 0.2\)). Word clusters with less specific vocabulary have a lower PPMI average (e.g., Migraine from PUBMED5: Table 5). Although informative, these average PPMI evaluations on most frequent words cannot properly assess the quality of the word clusters. One drawback of the PPMI value is that it is unbounded. Furthermore, the frequency of the words might not be the most appropriate ranking score for identifying representative words. As an example, the terms wa and group can be found among the most frequent terms in the AMD word cluster for PUBMED5. In addition, word clusters might also contain different subtopics that are indirectly related to each other. Therefore, considering the average PPMI over all pairs of words has the effect of lowering the global score, and leads to the spurious conclusion that the word cluster is not coherent.

NPMI-score assessment We propose evaluating the word cluster coherence using the Normalized PMI (NPMI) via a k-nn-like (k nearest-neighbor) approach. The NPMI ranges between \(-1\) and \(+1\) and is formally defined as \(\mathrm{NPMI}(w_i, w_j) =\mathrm{PMI(w_i, w_j)}/\log (p(w_i, w_j)).\) For each word, we propose computing an \(\mathrm{NPMI}_i\) score defined as

$$\begin{aligned} \mathrm{NPMI}_i=\frac{1}{|\varOmega _i|}\sum \limits _{w_j \in \varOmega _i}\mathrm{NPMI}(w_i,w_j) \end{aligned}$$
(16)

where \(\varOmega _i\) is the set of k words \(w_j\) having the highest NPMI score with \(w_i\). \(\mathrm{NPMI}_i\) quantifies the degree to which a word belongs to a cluster, based on its relationships with its k closest NPMI neighbors. \(\mathrm{NPMI}_i\) scores can be computed within or between word clusters. Table 6 gives the top \(\mathrm{NPMI}_i\) words for PUBMED5 word clusters (\(k=5\) neighbors among the 30 most frequent terms). Probabilities are derived from the whole English Wikipedia, using a NPMI implementation proposed by Röder et al. (2015). The \(\mathrm{NPMI}_i\) scores are therefore independent of the input document-term data and regularization matrices. The top \(\mathrm{NPMI}_i\) terms contain new meaningful words rather than the most frequent terms. It can be seen that triptan, which is a drug specific to migraine, is in third position in the Migraine cluster. The top Hay Fever terms now include immunotherapy, a seasonal allergy treatment, and oesinophil, a marker in seasonal allergic rhinitis. The word pneumonia can be found in the top Otitis terms, reflecting the fact that Streptococcus pneumoniae is the most common microbial agent found in otitis. Finally, the top AMD terms include diabet, an AMD risk factor, edema, a symptom of macular degeneration, and inject, which that corresponds to intravitreal injection, a treatment for AMD (intravitr is found in 14th position with \(\mathrm{NPMI}_i=0.21\)).

Figure 4 gives more insight into the relationships between the top \(\mathrm{NPMI}_i\) words in the case of AMD. The color of the vertices reflects the \(\mathrm{NPMI}_i\) word score, with warmer colors corresponding to higher scores, and the thickness of the edges represents the strength of the pairwise \(\mathrm{NPMI}\) coefficient. The most important word in terms of \(\mathrm{NPMI}_i\) is macular. Directly related to this word are words that generally define the disease, namely amd and degen for Age-Related Macular Degeneration, and neovascular for advanced neovascular AMD, which is a serious type of AMD. In the upper part of the graph, we have eye-related vocabulary (e.g., acuit(i)y, visual). On the right we see the words risk, factor, and associ, all of which are linked to diabet, an AMD risk factor. The intravitr inject bigram is directly related to edema and diabet. This makes sense, given that intravitreal injection is a treatment for diabetic macular edema. All in all, our results based on the \(\mathrm{NPMI}_i\) score and \(\mathrm{NPMI}\) coefficient indicate that the word clusters obtained with RBDCo are highly coherent (see Sect. 6.4.3 for the remaining word clusters on PUBMED5).

Fig. 4
figure 4

\(\mathrm{NPMI}\) graph of RBDCo AMD word cluster on PUBMED5

Figure 5 shows the average of the top ten \(\mathrm{NPMI}_i\) terms within and between PUBMED5 word clusters obtained using RBDCo. The within average \(\mathrm{NPMI}_i\) score is seen to be higher than the between average \(\mathrm{NPMI}_i\) score. The cluster with the strongest \(\mathrm{NPMI}_i\) with other clusters is the Migraine word cluster, for which the top terms are related to the topic but not specific to it.

Fig. 5
figure 5

Mean top \(\mathrm{NPMI}_i\) for RBDCo clusters on PUBMED5

6.4.3 RBDCo word cluster graphs on PUBMED5

We now discuss the coherence of the word cluster graphs obtained with \(\mathrm{RBDCo}_{\left[ S_c,S_r\right] }\) on PUBMED5. The graphs are constructed based on the pairwise \(\mathrm{NPMI}\) coefficient between the terms that have the highest individual \(\mathrm{NPMI}_i\) score (Eq. 16, main text). The color of the vertices reflects the \(\mathrm{NPMI}_i\) word score, with warmer colors corresponding to higher scores, and the thickness of the edges represents the strength of the pairwise \(\mathrm{NPMI}\) value.

Otitis (Fig. 6). We observe that otitis is logically related to ear and media, with otitis media (OM) being a group of inflammatory diseases of the middle ear. The two main types are acute otitis media and otitis media with effusion. Our graph relates acute and effusion to otitis. Other words generally used to characterize otitis can also be found in the graph, such as recurrent and chronic. Furthermore, S.pneumoniae and H.influenzae are the most common causes of OM. S.pneumoniae is also the main cause of recurrent infections and postinfectious complications. This is fully coherent with the proposed graphical representation.

Fig. 6
figure 6

\(\mathrm{NPMI}\) graph of RBDCo Otitis word cluster on PUBMED5

Migraine (Fig. 7). The most important word in this graph with respect to its \(\mathrm{NPMI}_i\) score is migraine, which is strongly related to headache and aura. Aura is a neurological phenomenon that can accompany migraine, manifesting itself in the form of visual, sensory, and motor disturbances. As for triptan, this refers to a family of drugs that have been clinically assessed as being effective in the treatment of pain. We can find all these terms related in our graph, in addition to sumatriptan, a common migraine medication. General headache qualifiers (e.g., severe, pain) are also present.

Fig. 7
figure 7

\(\mathrm{NPMI}\) graph of RBDCo Migraine word cluster on PUBMED5

Kidney Stones (Fig. 8). The kidnei(y) stones or renal calculi graph is enriched with urinary-tract-related vocabulary, such as ureter, urine, uric and urinari(y). The graph also contains common bigrams including urinari tract and uric acid. The kidney has a clearance function that requires the excretion (excret) of certain metabolites (metabol) by our organism. In addition, the definition of kidney stones is to be seen (on the right): they are solid masses made of crystals that form calcium oxalate stones. Finally, we note the presence of shock wave lithotripsi(y), the most common treatment for kidney stones.

Fig. 8
figure 8

\(\mathrm{NPMI}\) graph of RBDCo Kidney Stones word cluster on PUBMED5

Hay fever (Fig. 9). Hay fever, also known as allergic rhinitis, is a seasonal allergi(y) caused in large part by pollen. These terms are related in the graph, with pollen linked to allergen. The words intranasal and nasal refer to common hay fever medications (e.g., corticosteroid nasal spray, intranasal antihistamine). The term immunotherapi(y), also to be seen in the graph, is a treatment for hay fever involving a desensitization through doses of certain allergens (e.g., grass and pollen). Our graphical representation also contains oesinophil and cell, which are linked. In fact, oesinophil is a specialized cell within the immune system that helps promote inflammation and plays a key role in the symptoms of asthma and allergies. Interestingly, the terms allergi, rhiniti and asthma are linked to symptom. In fact, the asthma and allergic rhinitis symptoms are so close that people with asthma may not recognize that they also have allergic rhinitis.

Fig. 9
figure 9

\(\mathrm{NPMI}\) graph of RBDCo Hay Fever word cluster on PUBMED5

7 Conclusion

We proposed a flexible general framework, RBDCo, for text data matrix co-clustering. RBDCo derives from a von Mises–Fisher model-based co-clustering suitable for data that is high-dimensional, sparse, and unbalanced. Specifically, we defined our model with a matrix formulation suitable for the incorporation of complementary information to improve the co-clustering. Under some constraints, this formulation bears a close relationship to the well-known Spherical k-means, NMF and NMTF. Our approach utilizes the directional nature of text data and outperforms existing methods by using bi-directional multiplicative regularizations to incorporate side information on the document and word dimensions. Our experiments demonstrate the good performance of RBDCo, its robustness despite its random initialization, and its capabilities in terms of co-clustering quality and interpretability. Although all versions of RBDCo give good results, our results suggest that RBDCo\(_{[{\mathbf {S}}_c,{\mathbf {S}}_r]}\) should be preferred. The proposed bi-directional regularization may also be seen as a means of performing a semi-supervised co-clustering. The regularization matrices might contain side expert information on the data to be partitioned, thus allowing a co-clustering that does not only depend on the input data.

In our study, we assumed that the number of co-clusters is known. Often, in practice, the number of clusters is not known and needs to be determined by the user. Assessing the number of clusters is, however, not straightforward and remains one of the biggest challenges in co-clustering. Unfortunately, in our approach we cannot rely on the well-established statistical theory of model selection, since our algorithm is not based on the maximization of the likelihood or, more precisely, on the complete-data likelihood. However, based on the vMF-Fisher mixture model designed for co-clustering, Salah and Nadif (2019) showed that (AIC) (Akaike 1998) and AIC3 (Bozdogan 2000) are effective. They also showed that these criteria give better results than the versions of the Bayesian information criterion (BIC) (Schwarz 1978) and the integrated classification likelihood (ICL) (Keribin et al. 2015) derived from latent block models (Govaert and Nadif 2003, 2005). As Salah and Nadif pointed out in their paper, this is due to the effective number of free parameters in the vMF-Fisher mixture model. Inspired by this result, our objective is now to address this issue in future work.

To go further, in the future we are planning to improve upon our proposed method in measuring the impact of each matrix \({\mathbf {S}}_r\) and \({\mathbf {S}}_c\) in the construction of the objective function by considering two different weights for both matrices.