1 Introduction

In the era of big data, there is more than ever a need for techniques that simultaneously group objects and features of data, thus making large data sets easier to handle and interpret. Co-clustering techniques just serve this purpose [1]. Several co-clustering methods have been proposed in the past decades and applied to many fields, such as collaborative filtering recommendation [2, 3], biological science [4, 5], multimedia data analysis [6, 7], and natural language processing [1, 8].

With the rapid development of data science, an increasing amount of high-order heterogeneous data containing multiple types of object and involving the relationship between multiple types of object has been generated [9]. These high-order heterogeneous data provide useful information in comparison with homogeneous data [10], properly combining information from multiple types of object data would help improve the clustering performance [11], Hence, it is essential to develop high-order co-clustering models to solve the high-order heterogeneous data co-clustering problems [12]. Researchers have developed many high-order co-clustering methods based on different theories, such as graph partitioning [13], information theory [14], matrix factorization [15] and so on.

Although these methods are different in the technique employed on co-clustering high-order heterogeneous data, they all simultaneously generate the flat partitioning of every types of object data, thus the relationship between the clusters cannot be captured. Motivated by the above observations, this paper proposes a Hierarchical High-order Co-Clustering algorithm by maximizing Modularity (MHCoC). MHCoC chooses the modularity in the co-clustering context as a quality evaluation indicator, and an efficient iterative algorithm is designed to optimize the objective function. In each iteration, MHCoC uses a top-down strategy for splitting co-cluster, and finally the tree-like clustering result of each types of object data is obtained. Experiments show that the method is effective and superior to other similar methods.

The main contributions of this paper are summarized as follows:

(1). We present a modularity-based co-clustering objective function, which effectively merges information of high-order heterogeneous data.

(2). An alternating iterative method is designed to optimize the objective function in an interplay manner, which simultaneously generates tree-like clustering result of each types of object data.

(3). Extensive experiments have been conducted on several real-world datasets, demonstrating that the proposed algorithm outperforms the existing high-order co-clustering algorithms.

The remainder of this paper is organized as follows: In Sect. 2, we review related work. In Sect. 3, we briefly provide the modularity in co-clustering context and presents co-clustering on star-structured high-order heterogeneous data. The technical details of our approach are provided in Sect. 4. In Sect. 5, we present the results with a comprehensive set of experiments and analyzed them. Finally, Sect. 6 concludes the paper.

2 Related works

Before presenting our proposed method, we review the background of our proposed method, including co-clustering and high-order clustering, followed by hierarchical clustering. Here we mainly discuss these approaches that are most relevant to our work, and then briefly explore other techniques used in our work.

2.1 Co-clustering

Co-clustering, also called bi-clustering or block clustering, aims to cluster both rows and columns of a data matrix simultaneously. Compared with the traditional one-side clustering, co-clustering algorithms show more powerful performance [16] and they have become one of the important methods to solve the problem of high-dimensional sparse data clustering [17]. Co-clustering could be used for an extensive range of applications [18]. Some notable works are summarized as follows: Dhillion [19] poses the clustering problem as a graph partitioning problem and gives a new spectral algorithm based on normalized cut that, uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bi-partitioning. Another notable co-clustering algorithm is the block diagonal clustering algorithm described in [20]. This algorithm produces a block diagonal matrix of the given binary document-term matrix by minimizing the squared error between the original data and its approximation. More recently, another attempt is presented to use network-related criteria in the field of block diagonal structure is the spectral co-clustering maximizing a generalization of the modularity [21]. Compared to clustering methods based on other theories like nonnegative matrix factorization or tri-factorization, this algorithm appears to perform better. Despite similar algorithms are trying to optimize criteria initially used in the studies of networks, they do so by using a spectral relaxation of the discrete optimization problem. Ailem et al. [1] develop an algorithm that allows getting even better co-clusters by directly maximizing modularity using an iterative alternating optimization procedure.

Inspired by above work and in order to take full advantage of the modularity co-clustering for high-order heterogeneous data, we develop our approach by extending the modularity to the task of high-order clustering.

2.2 High-order co-clustering

Although many clustering methods have been developed, yet they cannot be directly applied to cluster high-order heterogeneous data. Properly combining information from high-order heterogeneous data would help improve the learning performance, which leads to the emergence of the high-order clustering technique [11]. It has drawn a large amount of attention in different research fields, such as recommender system [22], social network mining [23], natural language processing [24], biomedicine [25] and multimedia analysis [26]. In recent years, high-order clustering has been studied extensively, researchers have designed many high-order co-clustering methods based on different theories. The graph partitioning based methods [27] take the features of samples and different feature spaces as nodes and consider the data values as edges to find the clustering problem of optimal graph partitioning. The information theory-based methods [28] perform clustering by maximizing the mutual information of object cluster and feature cluster random variables. For the authoritative-based clustering algorithms [29], the authoritative ordering of samples and features is obtained and applied to clustering. The correlation measure-based methods [9] uses the predictive ability to estimate the correlation between objects and features, and partitions the objects and features with strong correlation into a cluster. The fuzzy-based methods are used to maximize the degree of aggregation inside the cluster, and the membership degree of the object and feature is solved by iterative method [30]. The matrix factorization-based methods reduce each high dimensional feature space to a low dimensional space [31] for clustering.

Their approach, although using different theories, does not involve network-related measures. Also, they all generate the flat partitioning of every types of object data, while our method is a hierarchical high-order co-clustering method.

2.3 Hierarchical clustering

Top-down hierarchical clustering is a recursive partitioning of a dataset into successively smaller clusters [32]. It does not require a fixed number of clusters and provides richer information at all levels of granularity, simultaneously displayed in an intuitive form, importantly, there are many fast and easy to implement algorithms commonly used in practice to find a tree-shaped cluster structure [33]. Several hierarchical clustering methods have been proposed and applied to many fields, include image and text classification [34], Anomaly detection [35], bioinformatics [36], person re-identification [37] and more. In summary, a taxonomy structure can be more beneficial than a flat partition for many real applications [8].

In order to obtain tree-like hierarchical clustering results and reveal the relationship between clusters, MHCoC adopts a top-down strategy to perform a greedy splitting process.

2.4 Summary

By integrating co-clustering and high-order clustering and hierarchical clustering, this paper proposes a Hierarchical High-order Co-Clustering algorithm by maximizing Modularity (MHCoC). MHCoC can merges information of high-order heterogeneous data and then produce a tree-like partition for each types of object data.

3 Preliminaries

In this section, we first review the modularity in co-clustering context and then present the star-structured high-order heterogeneous data.

3.1 Modularity for co-clustering

Modularity has been widely used in many fields and has shown outstanding effects. Modularity is a commonly used to measure the quality of community detection, and it can also be used as a quality evaluation in graph clustering. By rewriting the formula, the modularity can be used as a quality measure for block diagonal co-clustering, which means that objects and features have the same number of clusters [1]. In this way, while obtaining the clustering results of the objects, the feature clusters that indicate them can be discovered.

Let A ∊ Rm×n be the two dimensional contingency table or co-occurrence matrix, where rows and columns are represented by object set O = {o1,o2,…,oi,…,om} and feature set F = {f1,f2,…,fj,…,fn}, respectively. The co-clustering problem is to partition object set O and feature set F into k object clusters {r1,r2,…,rl,…,rk} and k feature clusters {c1,c2,…,cl,…,ck} simultaneously (k ≪ min{m,n} can be greater or equal to 2). For this, we define the object index matrix U ∊ {0,1}m×k, with each entry as follows: uil = 1 if the object i belongs to the object cluster l and uil = 0 otherwise. Feature index matrix V ∊ {0,1}n×k is defined similar with U. The relationship matrix is defined as C = UVt, with its the entry cil = 1 when object i and feature j belong to object cluster and feature cluster having same cluster sequence number, otherwise cil = 0, ie.\(c_{ij} = \sum\nolimits_{l = 1}^{k} {u_{il} v_{jl} }\). Then the modularity can be formulated as follows in co-clustering context:

$$ Q({\mathbf{A}},{\mathbf{C}}) = Q({\mathbf{A}},{\mathbf{UV}}^{\mathrm{ T}} ) = \frac{1}{{a_{ \cdot \cdot } }}\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{n} {\sum\limits_{l}^{k} {\left( {a_{ij} - \frac{{a_{i \cdot } a_{ \cdot j} }}{{a_{ \cdot \cdot } }}} \right)} } } u_{il} v_{jl} $$
(1)

where \(a_{ \cdot \cdot } = \sum\nolimits_{i,j} {a_{ij} }\), \(a_{i \cdot } = \sum\nolimits_{j} {a_{ij} }\) and \(a_{ \cdot j} = \sum\nolimits_{i} {a_{ij} }\). For convenience, we define the matrix X ∊ Rm×n with elements \(x_{ij} = \frac{{a_{i \cdot } a_{ \cdot j} }}{{a_{ \cdot \cdot } }}\), the modularity can also be expressed in the following form:

$$ Q({\mathbf{A}},{\mathbf{C}}) = \frac{1}{{a_{ \cdot \cdot } }}Trace\left[ {\left( {{\mathbf{A}} - {\mathbf{X}}} \right)^{\mathrm{ T}} {\mathbf{UV}}^{{\text{T}}} } \right] $$
(2)

3.2 Star-structured high-order heterogeneous data

High-order heterogeneous data is a data structure containing multiple types of object data. In this structure, there is a central type of object that connects the other types of objects. We take academic thesis systems as an example, where paper is the central data type while other data types can be seen as feature spaces used to represent paper. In order to identify a certain group of authors who usually write papers of a certain series of topics with a certain list of terms, and submit them to a certain kind of conferences [14].

Figure 1 shows the high-order heterogeneous data \(A = \cup {\kern 1pt} {\mathbf{A}}^{(s)}\)(s = 1,2,…,N), which is composed of the central object set O = {o1,o2,…,om} with its multiple feature spaces \(F^{(1)} = \left\{ {f_{1}^{(1)} , \, f_{2}^{(1)} ,...,f_{{n^{(1)} }}^{(1)} } \right\}\),\(F^{(2)} = \left\{ {f_{1}^{(2)} , \, f_{2}^{(2)} ,...,f_{{n^{(2)} }}^{(2)} } \right\}\),…,\(F^{(N)} = \left\{ {f_{1}^{(N)} , \, f_{2}^{(N)} ,...,f_{{n^{(N)} }}^{(N)} } \right\}\), where \(f_{j}^{(s)}\) is the j-th feature in the s-th feature space \(F^{(s)}\), and \(a_{ij}^{(s)}\) is the feature value of the object oi expressed on the feature \(f_{j}^{(s)}\).

Fig. 1
figure 1

Star-structure high-order heterogeneous data

3.3 Problem definition

The goal of the MHCoC algorithm is to acquire the hierarchical clustering result R = {R1,R2,…,Rh,…RH} of the central data O and hierarchical clustering results \(C^{(s)} = \left\{ {C_{1}^{(s)} ,C_{2}^{(s)} , \ldots ,C_{h}^{(s)} , \ldots C_{H}^{(s)} } \right\}\)(s = 1,2,…,N) of feature spaces F(s)(s = 1,2,…,N). Rh is the h-th time split clustering result of O, which is a set with h sample clusters, that is, Rh = {rh1,rh2…,rhh}, and \(C_{h}^{(s)} = \left\{ {c_{h1}^{(s)} ,c_{h2}^{(s)} , \ldots ,c_{hh}^{(s)} } \right\}\) is the h-th time split clustering result of F(s). The co-cluster formed by rhl and \(c_{hl}^{(s)}\) is denoted as \(rc_{hl}^{(s)}\), and \(RC_{hl} = \{ rc_{hl}^{(1)} ,rc_{hl}^{(2)} , \ldots ,rc_{hl}^{(N)} \}\) is the set of \(rc_{hl}^{(s)}\) in each feature space.

4 MHCoC algorithm

In this section, we first propose the modularity-based objective function and then develop a greedy divisive algorithm to optimize the objective function.

4.1 Objective function of MHCoC

MHCoC algorithm chooses the modularity used for the co-clustering context as the quality measurement. However, if we conduct co-clustering on every feature space independently, it will have a great probability that the clustering result for central type of data is different in every co-clustering result. In other words, every single clustering result of central type of data do not match in most cases. Actually, we are looking for a series of co-clustering results that each of them is not locally optimal, but their clustering results on the central type are the same, and the overall partitioning is globally optimal under a certain objective function. Therefore, the modularity-based objective function of MHCoC is defined as follows:

$$ \begin{gathered} Q(A,C) = \sum\limits_{s} {Q({\mathbf{A}}^{\left( s \right)} ,{\mathbf{UV}}^{{\left( s \right)}{\text{T}}} )} \\ { = }\sum\limits_{s = 1}^{N} {\frac{1}{{a_{ \cdot \cdot }^{\left( s \right)} }}\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{{n^{\left( s \right)} }} {\sum\limits_{l = 1}^{k} {\left( {a_{ij}^{\left( s \right)} - \frac{{a_{i \cdot }^{\left( s \right)} a_{ \cdot j}^{\left( s \right)} }}{{a_{ \cdot \cdot }^{\left( s \right)} }}} \right)} } } } u_{il} v_{jl}^{\left( s \right)} \\ \end{gathered} $$
(3)

where \(A = \cup {\kern 1pt} {\mathbf{A}}^{(s)}\)(s = 1,2,…,N) is the star-structured high-order heterogeneous data, and A(s) is a co-occurrence matrix of O and F(s). \(a_{ \cdot \cdot }^{\left( s \right)} = \sum\nolimits_{i,j} {a_{ij}^{\left( s \right)} }\) is the sum of the weights of all terms in A(s), \(a_{i \cdot }^{\left( s \right)} = \sum\nolimits_{j} {a_{ij}^{\left( s \right)} }\) is the sum of the weights of all terms on the i-th row, and \(a_{ \cdot j}^{\left( s \right)} = \sum\nolimits_{i} {a_{ij}^{\left( s \right)} }\) is the sum of the weights of all terms on the j-th column. C(s) = UV(s)T is the relationship matrix, where U is a sample index matrix and V(s) is a feature index matrix of the s-th feature space. Then we define a matrix X(s)\(\in \) Rm×n, and the term in X(s) is \(x_{ij}^{\left( s \right)} = \frac{{a_{i \cdot }^{\left( s \right)} a_{ \cdot j}^{\left( s \right)} }}{{a_{ \cdot \cdot }^{\left( s \right)} }}\). The objective function can also be expressed in the following form:

$$ \begin{gathered} Q(A,C) = \sum\limits_{s} {Q({\mathbf{A}}^{\left( s \right)} ,{\mathbf{C}}^{\left( s \right)} )} \\ = \sum\limits_{s} {\frac{1}{{a^{\left( s \right)}_{ \cdot \cdot } }}Trace\left[ {\left( {{\mathbf{A}}^{\left( s \right)} - {\mathbf{X}}^{\left( s \right)} } \right)^{\mathrm{ T}} {\mathbf{UV}}^{{\left( s\right)}{\mathrm{ T}} }} \right]} \\ \end{gathered} $$
(4)

4.2 MHCoC algorithm

MHCoC uses a top-down strategy to perform hierarchical co-clustering to optimize the objective function. For easy understanding, we use an example to describe the MHCoC implement process. Figure 2 shows a toy example with contains central data type and three feature spaces. MHCoC first takes the O as the unique cluster r11 in the 1-th time sample clustering result R1, and takes each F(s) as the unique cluster \(c_{11}^{(s)}\) in the 1-th time feature clustering result \(C_{1}^{(s)}\). MCC is then used to process \(RC_{11} = \left\{ {rc_{11}^{(s)} \left| {s = 1,2,3} \right.} \right\}\), that is, each co-cluster is split into two sub-co-clusters in three feature spaces. This split can maximizing the module increase value ΔQ(A11,C11). Accordingly, the 2-th time sample clustering result R2 = {r21,r22} and the 2-th time feature clustering result \(C_{2}^{(s)} = \left\{ {c_{21}^{(s)} ,c_{22}^{(s)} } \right\}\)(s = 1,2,3) are generated. Then for the two co-cluster sets RC21 and RC22 in the 2-th time feature clustering result, MCC is used to find the optimal partition and compute the modularity increase value. It is assumed here that the modularity increase value of RC21 is larger, so each co-cluster in RC21 is respectively split into two sub-co-clusters, and other co-clusters are retained. Then the 3-th time co-clustering result R3 = {r31,r32,r33} and \(C_{3}^{(s)} = \left\{ {c_{31}^{(s)} ,c_{32}^{(s)} ,c_{33}^{(s)} } \right\}\)(s = 1,2,3) are generated, and then the layer-by-layer split is continued with a top-down strategy. Finally, the MHCoC algorithm returns R = {R1,R2,…,Rh,…RH} and \(C^{(s)} = \left\{ {C_{1}^{(s)} ,C_{2}^{(s)} , \ldots ,C_{h}^{(s)} , \ldots C_{H}^{(s)} } \right\}\) (s = 1,2,…,N) as shown in Fig. 3. The algorithm is described in more details as follows:

figure a
Fig. 2
figure 2

MHCoC algorithm process example

Fig. 3
figure 3

Hierarchical clustering results

4.3 MCC algorithm

For given data matrix A and relationship matrix C, the following two propositions are hold:

Proposition 1

$$ \begin{gathered} Q({\mathbf{A}},{\mathbf{C}}) = \frac{1}{{a_{ \cdot \cdot } }}Trace\left[ {\left( {{\mathbf{A}} - {\mathbf{X}}} \right)^{{\text{T}}} {\mathbf{UV}}^{{\text{T}}} } \right] \\ = \frac{1}{{a_{ \cdot \cdot } }}Trace\left[ {\left( {{\mathbf{A}}^{{\mathbf{V}}} - {\mathbf{X}}^{{\mathbf{V}}} } \right)^{{\text{T}}} {\mathbf{U}}} \right]{ = }Q({\mathbf{A}}^{{\mathbf{V}}} ,{\mathbf{U}}) \\ \end{gathered} $$

The matrices AV and XV are computed from the matrices A and X according to the column index matrix V:

$$ {\mathbf{A}}^{{\mathbf{V}}} : = \left\{ {a_{il}^{{\mathbf{V}}} = \sum\nolimits_{j = 1}^{n} {v_{jl} a_{ij} } ;i = 1, \ldots ,m;l = 1, \ldots ,k} \right\}, $$
$$ {\mathbf{X}}^{{\mathbf{V}}} : = \left\{ {x_{il}^{{\mathbf{V}}} = \frac{{a_{i \cdot } a_{ \cdot l}^{{\mathbf{V}}} }}{{a_{ \cdot \cdot } }};i = 1, \ldots ,m;l = 1, \ldots ,k} \right\}, $$

where \(a_{ \cdot l}^{{\mathbf{V}}} { = }\sum\nolimits_{j = 1}^{n} {v_{jl} a_{ \cdot j} }\).

Proof

$$ \begin{gathered} Q({\mathbf{A}},{\mathbf{C}})\quad = \quad \frac{1}{{a_{ \cdot \cdot } }}Trace\left[ {\left( {{\mathbf{A}} - {\mathbf{X}}} \right)^{\mathrm{ T}} {\mathbf{UV}}^{\mathrm{ T}} } \right]\quad \hfill \\ \quad \quad \quad \;\;\;\,{ = }\quad \frac{1}{{a_{ \cdot \cdot } }}\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{n} {\sum\limits_{l}^{k} {\left( {a_{ij} - \frac{{a_{i \cdot } a_{ \cdot j} }}{{a_{ \cdot \cdot } }}} \right)} } } u_{il} v_{jl} \,\,\;\,\, \hfill \\ \quad \quad \quad \;\;\;\,{ = }\quad \frac{1}{{a_{ \cdot \cdot } }}\sum\limits_{i = 1}^{m} {\sum\limits_{l = 1}^{k} {u_{il} \sum\limits_{j = 1}^{n} {v_{jl} \left( {a_{ij} - \frac{{a_{i \cdot } a_{ \cdot j} }}{{a_{ \cdot \cdot } }}} \right)} } } \hfill \\ \quad \quad \quad \;\;\;\,{ = }\quad \frac{1}{{a_{ \cdot \cdot } }}\sum\limits_{i = 1}^{m} {\sum\limits_{l = 1}^{k} {u_{il} \left( {\sum\limits_{j = 1}^{n} {v_{jl} a_{ij} - } \frac{{a_{i \cdot } }}{{a_{ \cdot \cdot } }}\sum\limits_{j = 1}^{n} {v_{jl} a_{ \cdot j} } } \right)} } \quad \hfill \\ \quad \quad \quad \;\;\;\,{ = }\quad \frac{1}{{a_{ \cdot \cdot } }}\sum\limits_{i = 1}^{m} {\sum\limits_{l = 1}^{k} {\left( {a_{il}^{{\mathbf{V}}} - \frac{{a_{i \cdot } a_{ \cdot l}^{{\mathbf{V}}} }}{{a_{ \cdot \cdot } }}} \right)u_{il} } } \quad \hfill \\ \quad \quad \quad \;\;\;\, = \quad \frac{1}{{a_{ \cdot \cdot } }}Trace\left[ {\left( {{\mathbf{A}}^{{\mathbf{V}}} - {\mathbf{X}}^{{\mathbf{V}}} } \right)^{\mathrm{ T}} {\mathbf{U}}} \right]\quad \;\;\, = \quad Q({\mathbf{A}}^{{\mathbf{V}}} ,{\mathbf{U}}) \hfill \\ \end{gathered} $$

Proposition 2

$$ \begin{gathered} Q({\mathbf{A}},{\mathbf{C}}) = \frac{1}{{a_{ \cdot \cdot } }}Trace\left[ {\left( {{\mathbf{A}} - {\mathbf{X}}} \right)^{\mathrm{ T}} {\mathbf{UV}}^{\mathrm{ T}} } \right] \hfill \\ = \frac{1}{{a_{ \cdot \cdot } }}Trace\left[ {\left( {{\mathbf{A}}^{{\mathbf{U}}} - {\mathbf{X}}^{{\mathbf{U}}} } \right)^{\mathrm{ T}} {\mathbf{V}}} \right] = Q({\mathbf{A}}^{{\mathbf{U}}} ,{\mathbf{V}}). \hfill \\ \end{gathered} $$

The matrices AU and XU are computed from the matrices A and X according to the column index matrix U: \({\mathbf{A}}^{{\mathbf{U}}} : = \left\{ {a_{jl}^{{\mathbf{U}}} = \sum\nolimits_{i = 1}^{m} {u_{il} a_{ij} } ;j = 1, \ldots ,n;l = 1, \ldots ,k} \right\}\),

\({\mathbf{X}}^{{\mathbf{U}}} : = \left\{ {x_{jl}^{{\mathbf{U}}} = \frac{{a_{j \cdot } a_{ \cdot l}^{{\mathbf{U}}} }}{{a_{ \cdot \cdot } }};j = 1, \ldots ,n;l = 1, \ldots ,k} \right\}\),where \(a_{ \cdot l}^{{\mathbf{U}}} { = }\sum\nolimits_{i = 1}^{m} {u_{il} a_{i \cdot } }\).

Proof

$$ \begin{gathered} Q({\mathbf{A}},{\mathbf{C}})\quad = \quad \frac{1}{{a_{ \cdot \cdot } }}Trace\left[ {\left( {{\mathbf{A}} - {\mathbf{X}}} \right)^{\mathrm{ T}} {\mathbf{UV}}^{\mathrm{ T}} } \right]\quad \hfill \\ \quad \quad \quad \;\;\;\,{ = }\quad \frac{1}{{a_{ \cdot \cdot } }}\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{n} {\sum\limits_{l}^{k} {\left( {a_{ij} - \frac{{a_{i \cdot } a_{ \cdot j} }}{{a_{ \cdot \cdot } }}} \right)} } } u_{il} v_{jl} \,\,\;\,\, \hfill \\ \quad \quad \quad \;\;\;\,{ = }\quad \frac{1}{{a_{ \cdot \cdot } }}\sum\limits_{j = 1}^{n} {\sum\limits_{l = 1}^{k} {v_{jl} \sum\limits_{i = 1}^{m} {u_{il} \left( {a_{ij} - \frac{{a_{i \cdot } a_{ \cdot j} }}{{a_{ \cdot \cdot } }}} \right)} } } \hfill \\ \quad \quad \quad \;\;\;\,{ = }\quad \frac{1}{{a_{ \cdot \cdot } }}\sum\limits_{j = 1}^{n} {\sum\limits_{l = 1}^{k} {v_{jl} \left( {\sum\limits_{i = 1}^{m} {u_{il} a_{ij} - } \frac{{a_{i \cdot } }}{{a_{ \cdot \cdot } }}\sum\limits_{i = 1}^{m} {u_{il} a_{i \cdot } } } \right)} } \quad \hfill \\ \quad \quad \quad \;\;\;\,{ = }\quad \frac{1}{{a_{ \cdot \cdot } }}\sum\limits_{j = 1}^{n} {\sum\limits_{l = 1}^{k} {\left( {a_{jl}^{{\mathbf{U}}} - \frac{{a_{i \cdot } a_{ \cdot l}^{{\mathbf{U}}} }}{{a_{ \cdot \cdot } }}} \right)v_{jl} } } \quad \hfill \\ \quad \quad \quad \;\;\;\, = \quad \frac{1}{{a_{ \cdot \cdot } }}Trace\left[ {\left( {{\mathbf{A}}^{{\mathbf{U}}} - {\mathbf{X}}^{{\mathbf{U}}} } \right)^{\mathrm{ T}} {\mathbf{V}}} \right]\quad \;\;\, = \quad Q({\mathbf{A}}^{{\mathbf{U}}} ,{\mathbf{V}}) \hfill \\ \end{gathered} $$

According to the above two Propositions, we can iteratively update the column index matrixes according to the latest row cluster index matrix or update the row index matrixes according to the latest column index matrixes.

In each iteration, using proposition 1, the row index matrix U is first updated according to the current column index matrices V = {V(s)| s = 1,2,…,N} to maximize Q(AV,U):

$$ \begin{gathered} {\mathbf{U}} = \arg \max_{{{\mathbf{U}}*}} Q\left( {{\mathbf{A}}^{V} ,{\mathbf{U}}} \right) \hfill \\ = \arg \max_{{{\mathbf{U}}*}} \sum\limits_{s = 1}^{N} {Trace\left( {{\mathbf{A}}^{{{\mathbf{V}}(s)}} - {\mathbf{X}}^{{{\mathbf{V}}(s)}} } \right)^{\mathrm{ T}} {\mathbf{U}}*} \hfill \\ \end{gathered} $$
(5)

the term in the matrix AV is \(a_{il}^{V} = \sum\limits_{s = 1}^{N} {a_{il}^{{{\mathbf{V}}(s)}} }\), and the term in the matrix XV is \(x_{il}^{V} = \sum\limits_{s = 1}^{N} {x_{il}^{{{\mathbf{V}}(s)}} }\). Then according to the updated row index matrix U, each column index matrix V(s)(s = 1,2,…,N) is updated using proposition 2 to maximize Q(AU(s),V(s)):

$$ \begin{gathered} {\mathbf{V}}^{(s)} = \arg \max_{\mathbf{V}}^{(s)*} Q\left({A^{{{\mathbf{U}}}{(s)}} ,{\mathbf{V}}^{(s)} } \right) \\ \quad \,\, = \arg \max_{\mathbf{V}}^{(s)*} Trace\left( {{\mathbf{A}}^{{{\mathbf{U}}(s)}} - {\mathbf{X}}^{{{\mathbf{U}}(s)}} } \right)^{\mathrm{ T}} {\mathbf{V}}^{(s)} * \\ \end{gathered} $$
(6)

the term in the matrix AU is \(a_{jl}^{{\mathbf{U}}} = \sum\limits_{s = 1}^{N} {a_{jl}^{{\mathbf{U}}} }\), and the term in the matrix XU is \(x_{jl}^{{\mathbf{U}}} = \sum\limits_{s = 1}^{N} {x_{jl}^{{\mathbf{U}}} }\).The modularity increase value ΔQ(A,C) is computed after each iteration. Since the MCC only processes one co-cluster set, the global modularity increase value is only generated in the split of this group of co-clusters:

$$ \begin{gathered} \Delta Q\left( {A,C} \right)\quad = \quad \sum\limits_{s = 1}^{N} {\frac{1}{{a_{ \cdot \cdot }^{\left( s \right)} }}\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{{n^{\left( s \right)} }} {\sum\limits_{l = 1}^{k} {\left( {a_{ij}^{\left( s \right)} - \frac{{a_{i \cdot }^{\left( s \right)} a_{ \cdot j}^{\left( s \right)} }}{{a_{ \cdot \cdot }^{\left( s \right)} }}} \right)} } } } u_{il} v_{jl}^{\left( s \right)} \hfill \\ \quad \quad \quad \quad \quad \quad - \sum\limits_{s = 1}^{N} {\frac{1}{{a_{ \cdot \cdot }^{\left( s \right)} }}\sum\limits_{i = 1}^{m} {\sum\limits_{j = 1}^{{n^{\left( s \right)} }} {\left( {a_{ij}^{\left( s \right)} - \frac{{a_{i \cdot }^{\left( s \right)} a_{ \cdot j}^{\left( s \right)} }}{{a_{ \cdot \cdot }^{\left( s \right)} }}} \right)} } } \hfill \\ \end{gathered} $$
(7)

Taking RC21 in Fig. 2 as an example, the following example is designed to describe the execution process of the MCC:

$$ {\mathbf{A}}_{21}^{(1)} { = }\left( {\begin{array}{*{20}c} 1 & 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 0 & 1 \\ 1 & 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 0 & 1 \\ \end{array} } \right)\,{\mathbf{A}}_{21}^{(2)} { = }\left( {\begin{array}{*{20}c} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ \end{array} } \right)\,{\mathbf{A}}_{21}^{(3)} { = }\left( {\begin{array}{*{20}c} 1 & 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 1 & 0 \\ 1 & 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 1 & 0 \\ \end{array} } \right) $$

Initialization of \({\mathbf{V}}_{hl}^{(s)}\)

$$ {\mathbf{V}}_{21}^{(1)} { = }\left( {\begin{array}{*{20}c} 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ \end{array} } \right)\,{\mathbf{V}}_{21}^{(2)} { = }\left( {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ \end{array} } \right)\,{\mathbf{V}}_{21}^{(3)} { = }\left( {\begin{array}{*{20}c} 0 & 1 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ \end{array} } \right) $$

Compute \({\mathbf{A}}_{hl}^{V} { - }{\mathbf{X}}_{hl}^{V}\) according to \({\mathbf{V}}_{hl}^{(s)}\)

$$ {\mathbf{A}}_{21}^{{{\mathbf{V}}(1)}} - {\mathbf{X}}_{21}^{{{\mathbf{V}}(1)}} { = }\left( {\begin{array}{*{20}c} {0.2} & {{ - }0.2} \\ {{ - }0.2} & {0.2} \\ {0.2} & {{ - }0.2} \\ {{ - }0.2} & {0.2} \\ \end{array} } \right)\,{\mathbf{A}}_{21}^{{{\mathbf{V}}(2)}} - {\mathbf{X}}_{21}^{{{\mathbf{V}}(2)}} { = }\left( {\begin{array}{*{20}c} 1 & {{ - }1} \\ {{ - }1} & 1 \\ 1 & {{ - }1} \\ {{ - }1} & 1 \\ \end{array} } \right) $$
$$ {\mathbf{A}}_{21}^{{{\mathbf{V}}(3)}} - {\mathbf{X}}_{21}^{{{\mathbf{V}}(3)}} { = }\left( {\begin{array}{*{20}c} {0.2} & {{ - }0.2} \\ {{ - }0.2} & {0.2} \\ {0.2} & {{ - }0.2} \\ {{ - }0.2} & {0.2} \\ \end{array} } \right){\mathbf{A}}_{21}^{V} - {\mathbf{X}}_{21}^{V} { = }\left( {\begin{array}{*{20}c} {1.4} & {{ - }1.4} \\ {{ - }1.4} & {1.4} \\ {1.4} & {{ - }1.4} \\ {{ - }1.4} & {1.4} \\ \end{array} } \right) $$

Compute Uhl

$$ {\mathbf{U}}_{21} { = }\left( {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ \end{array} } \right) $$

Compute \({\mathbf{A}}_{hl}^{{{\mathbf{U}}(s)}} { - }{\mathbf{X}}_{hl}^{{{\mathbf{U}}(s)}}\) according to Uhl

$$ {\mathbf{A}}_{21}^{{{\mathbf{U}}(1)}} - {\mathbf{X}}_{21}^{{{\mathbf{U}}(1)}} { = }\left( {\begin{array}{*{20}c} {0.8} & {{ - }0.8} \\ {{ - }1.2} & {1.2} \\ {0.8} & {{ - }0.8} \\ {0.8} & {{ - }0.8} \\ {{ - }1.2} & {1.2} \\ \end{array} } \right) $$
$$ {\mathbf{A}}_{21}^{{{\mathbf{U}}(2)}} - {\mathbf{X}}_{21}^{{{\mathbf{U}}(2)}} { = }\left( {\begin{array}{*{20}c} 1 & {{ - }1} \\ {{ - }1} & 1 \\ 1 & {{ - }1} \\ {{ - }1} & 1 \\ \end{array} } \right) $$
$$ {\mathbf{A}}_{21}^{{{\mathbf{U}}(3)}} - {\mathbf{X}}_{21}^{{{\mathbf{U}}(3)}} { = }\left( {\begin{array}{*{20}c} {1.2} & {{ - }1.2} \\ {{ - 0}{\text{.8}}} & {0.8} \\ {{ - 0}{\text{.8}}} & {0.8} \\ {{ - 0}{\text{.8}}} & {0.8} \\ {1.2} & {{ - }1.2} \\ \end{array} } \right) $$

Compute \({\mathbf{V}}_{hl}^{(s)}\)

$$ {\mathbf{V}}_{21}^{(1)} { = }\left( {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ \end{array} } \right){\mathbf{V}}_{21}^{(2)} { = }\left( {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ \end{array} } \right){\mathbf{V}}_{21}^{(3)} { = }\left( {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ 1 & 0 \\ \end{array} } \right) $$

Compute ΔQ(Ahl,Chl)

$$ \begin{gathered} \Delta Q\left( {A_{21} ,C_{21} } \right) = \sum\limits_{s = 1}^{3} {\frac{1}{{a_{ \cdot \cdot }^{\left( s \right)} }}\sum\limits_{i = 1}^{4} {\sum\limits_{{j^{(s)} = 1}}^{{n^{(s)} }} {\left( {a_{ij}^{\left( s \right)} - \frac{{a_{i \cdot }^{\left( s \right)} a_{ \cdot j}^{\left( s \right)} }}{{a_{ \cdot \cdot }^{\left( s \right)} }}} \right)} } u_{il} v_{jl}^{\left( s \right)} } \hfill \\ \quad \quad \quad \quad \quad \quad \;\;\quad - \sum\limits_{s = 1}^{3} {\frac{1}{{a_{ \cdot \cdot }^{\left( s \right)} }}\sum\limits_{i = 1}^{4} {\sum\limits_{{j^{(s)} = 1}}^{{n^{(s)} }} {\left( {a_{ij}^{\left( s \right)} - \frac{{a_{i \cdot }^{\left( s \right)} a_{ \cdot j}^{\left( s \right)} }}{{a_{ \cdot \cdot }^{\left( s \right)} }}} \right)} } } \hfill \\ \quad \quad \quad \quad \,\quad {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = 1.46 \hfill \\ \end{gathered} $$

then iterating. This alternated co-clustering is described in more details in algorithm 2.

figure b

5 Experiments and results analysis

In this section, we perform experiments on two real-world datasets to evaluate our proposed method. We aim to answer the following research questions:

• RQ1: How does MHCoC perform as compared with state-of-the-art homogeneous co-clustering methods?

• RQ2: Compared with existing high-order heterogeneous clustering methods, how does MHCoC perform?

• RQ3: What are the benefits of tree-like clustering results generated from MHCoC?

In the following parts, we will first present the experimental settings and then answer the above research questions one by one.

5.1 Experimental setting

5.1.1 Datasets description

In order to verify comprehensively and objectively, we use four different real-world datasets spanning different domains to verify the effectiveness of our approach.

Reuters is a multilingual dataset consists of 2000 samples with 5 types of languages. We use the subset that is written in English, while the other 4 features are its corresponding translations in 4 different languages.

USPS dataset consists of features of handwritten numerals and there are 2000 examples uniformly distributed in 10 classes, and three types of features are used.

20Newsgroup corresponds to about 20,000 news documents from 20 different newsgroups with each newsgroup containing 1000 documents. Two types of features are used.

Corel5k dataset contains 5000 images, each with text annotation information and image segmentation. We construct high-order heterogeneous dataset including image, word and blobs. We summarize the statistics of the datasets in Table 1.

Table 1 Datasets

5.1.2 Evaluation protocols

To evaluate the clustering performance, we focus primarily on the Macro-F1 score and NMI value which measures how well each algorithm finds the ground truth clusters. Specifically, given a set of algorithmic clusters C, and the ground truth clusters S, the Macro-F1 score is computed by averaging the F1 score of the best match between each ground truth cluster and algorithmic clusters:

$$ {\text{Macro}} - F_{1} = \frac{1}{\left| S \right|}\sum\limits_{s \in S} {{\text{F}}_{1} \left( s \right)} $$
(8)

The F1 score of a single ground truth cluster s is computed as the harmonic mean of P(Precision) and R(Recall):

$$ {\text{F}}_{1} \left( s \right) = \frac{2 \times P \times R}{{P + R}} $$
(9)

where Precision P and Recall R can be calculated as follows:

$$ P = \frac{TP}{{TP + FP}} $$
(10)
$$ R = \frac{TP}{{TP + FN}} $$
(11)

TP, TN, FP, FN represent True Positive (The fact is the positive samples, which were judged as positive samples), True Negative, False Positive, False Negative, respectively.

NMI value is calculated as:

$$ {\text{NMI}} = \frac{{I\left( {S,C} \right)}}{{{{\left( {H\left( S \right) + H\left( C \right)} \right)} \mathord{\left/ {\vphantom {{\left( {H\left( S \right) + H\left( C \right)} \right)} 2}} \right. \kern-\nulldelimiterspace} 2}}} $$
(12)

where the molecule is the mutual information of S and C, and the denominator is the average of the entropy of S and C:

$$ I\left( {S,C} \right) = \sum\limits_{{s_{i} \in S}} {\sum\limits_{{c_{j} \in C}} {p\left( {s_{i} \cap {\text{c}}_{j} } \right)} } \log \frac{{p\left( {s_{i} \cap {\text{c}}_{j} } \right)}}{{p\left( {s_{i} } \right)p\left( {c_{j} } \right)}} $$
(13)
$$ H\left( S \right) = - \sum\limits_{{s_{i} \in S}} {p\left( {s_{i} } \right)} \log p\left( {s_{i} } \right) $$
(14)

Since the clustering algorithm involved in this experiment is random, each clustering algorithm is executed 30 times on each data set, and the average value of 30 clustering results are used for comparative analysis.

5.2 Performance comparison with homogeneous co-clustering methods (RQ1)

We first compare our proposed methods against other competing homogeneous co-clustering approaches, spanning from the information theory based approach ITCC [38]: a classical co-clustering algorithm maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters, and the modularity-based approach Coclus [1]: a block diagonal co-clustering algorithm that implements automatic feature reduction during co-clustering, and matrix factorization based approach NMTFCoS [39]: a non-negative matrix tri-factorization model based on co-sparsity regularization, with its objective to simultaneously minimize the loss function for the matrix tri-factorization, and the deep co-clustering algorithm DCC [17]: a deep co-clustering model jointly optimizes the parameters of the deep autoencoder and the mixture model in an end-to-end fashion on both the instance and the feature spaces.

For the above four baseline methods, all the features from each feature space are simply combined into one feature space. The input parameters of the comparing methods are set as the authors suggested in their papers or shared code. The clustering performance comparison of each method is shown in Tables 2 and 3, respectively. We have the following observations:

  • ITCC achieves poorest performance on all the datasets. This indicates that maximizing the mutual information between the clustered random variables is insufficient to capture the complex relations among different feature spaces, further limiting the performance. NMTFCoS consistently outperforms ITCC across all cases, verifying that incorporating the co-sparsity regularization can improve the clustering results. Compared to ITCC and NMTFCoS, the performance of Coclus demonstrates the effectiveness of modularity for co-clustering. DCC reflects the superior performance of deep learning, but as a method for homogeneous data, it still has limitations when dealing with heterogeneous data.

  • As expected, MHCoC consistently yields the best performance on all the datasets. Through dealing with multiple feature space under a unified framework, MHCoC is capable of exploring the high-order heterogonous features in an explicit way, while other baseline methods only simply merge all the features to guide the clustering process. This verifies the importance of properly capturing heterogonous features in the objective function.

Table 2 NMI comparison of the homogeneous co-clustering methods
Table 3 Macro-F1 comparison of the homogeneous co-clustering methods

In summary, the four excellent baseline homogeneous co-clustering algorithms are based on different theories, their performance degenerate when dealing with heterogeneous data.

5.3 Performance comparison with high-order heterogeneous clustering methods (RQ2)

To answer RQ2, we compare the proposed approach against four high-order heterogeneous clustering methods CBGC [13] is a consistent bipartite graph co-partitioning co-clustering method based on semi-definite programming. CIT [14] extends the information theoretic co-clustering algorithm to solve the high-order co-clustering problem. O-NMTF [40] employs Nonnegative Matrix Tri-Factorization (NMTF) to simultaneously cluster different types of data using the inter-type relationships, and incorporate intra-type information through manifold regularization. CoStar [9] optimizes the measure for cross-association in contingency tables that evaluates the strength of the relationship between two categorical variables by a local search approach.

Likewise, the input parameters of the comparing methods are set as the authors suggested in their papers or shared code. For Reuters and USPS datasets, CIT uses the first two features of each dataset. The clustering performance comparisons on the experimental data sets are shown in Tables 4 and 5. From the results, we observed that:

  • CBGC and CIT achieve the worse performance than other baselines over all the datasets in all cases. It is reasonable since they both model high-order co-clustering problem as the consistent fusion of pair-wise co-clustering sub-problems.

  • O-NMTF achieves better results in terms of both metrics, showing the importance of the factor’s orthogonalities can lead to better results. Costar shows similar results, as it benefits from that the τ functions used to compare co-clustering of different sizes. Further, compared with first two methods, these two algorithms demonstrate that it is a better strategy to dealing with multiple feature space under a unified framework.

  • Generally speaking, MHCoC achieves better performance than other baselines over all datasets, sometimes very significantly, which confirms the effectiveness of MHCoC in co-clustering of high-order heterogeneous data. We mainly attribute the improvement of the clustering performance to the advantage of modularity and the objective function we designed which can properly combine the information from the high-order heterogeneous data. Moreover, MHCoC has implicit subspace selection in the cluster partitioning process, thus it shows better performance on high-dimensional sparse data high-order co-clustering.

Table 4 NMI comparison of the heterogeneous clustering methods
Table 5 Macro-F1 comparison of the heterogeneous clustering methods

5.4 Hierarchical co-clustering analysis (RQ3)

In this section, we attempt to demonstrate how the hierarchical co-clustering facilitates the understanding of clustering result. Towards this end, we select five newsgroups with different relationships from the 20NG dataset to form the NG5 dataset for experiments. The five newsgroups we selected are ‘talk.politics.mideast’, ‘sci.space’, ‘comp.graphics’, ‘rec.motorcycles’ and ‘rec.sport.baseball’. It is easy to see that in NG5, ‘sci.space’ and ‘comp.graphics’ are similar, and so are ‘rec.motorcycles’ and ‘rec.sport.baseball’, but there are obvious differences between these two groups. Besides, ‘talk.politics.mideast’ shares no obvious relationship with other classes. Therefore, the NG5 dataset can representatively and intuitively show how the hierarchical strategy reflects the relationship between clusters. We initially run MHCoC on NG5 to get 5 clusters, then continue to split into 20 clusters.

The hierarchical object clustering results are shown in Fig. 4. The root consists of the entire sample set. After four splits, five clusters represented by rectangle with a thickened border are obtained, and the circles below indicate the results of continued splitting. The number represents which class the cluster corresponds to. Intuitively, the two clusters ‘comp.graphics’ and ‘sci.space’ are from the same cluster from the previous layer, and ‘rec.motorcycles’ and ‘rec.sport.baseball’ are from the same cluster in the previous layer. This structure has practical meaning and meets our expectations. Both the ‘comp.graphics’ and ‘sci.space’ clusters involve science and technology. The two clusters ‘rec.motorcycles’ and ‘rec.sport.baseball’ belong to the entertainment category. However, the difference between the two clusters of ‘comp.graphics and sci.space’ and ‘rec.motorcycles and rec.sport.baseball’ is relatively large. In addition, it can be seen from the Fig. 4 that during the first split, ‘talk.politics.mideast’, which has no obvious relationship with other classes, is divided into the same cluster as ‘sci.space and comp.graphics’, and then it is split from them. Although in the subsequent splitting process, we can see that some of the samples in ‘talk.politics.mideast’ also appeare in the ‘rec.motorcycles’ that was divided with them at the first splitting, but from the results, there seems to be more in common between ‘talk.politics.mideast’ and ‘comp.graphics and sci.space’.

Fig. 4
figure 4

Result of visual hierarchical cluster structure

In summary, we find that compared with the flat cluster structure, the hierarchical cluster structure is more conducive to reveal the implicit relationship between clusters and clusters.

We then take the word feature space as an example. Table 6 shows the first five words in the corresponding word cluster of each sample cluster obtained by sorting according to the method in [41]. It can be seen intuitively that the word cluster has a clear indication of the object cluster. This shows that the block diagonal co-clustering includes the implicit hard subspace selection process, and the dimensionality reduction function brings the algorithm to improve the clustering effect and efficiency.

Table 6 Keywords in different clusters

Conclusion

In this paper, we propose a hierarchical high-order co-clustering algorithm by maximizing modularity. Our approach merges information of multiple types of object data and optimizes modularity-based objective function by performs high-order co-clustering, finally the tree-like hierarchical high-order co-clustering results are obtained. The effectiveness of the proposed method is proved on the real data set. Future work will focus on the study of overlapping hierarchical high-order co-clustering algorithms.