1 Introduction

Constrained low rank approximation (CLRA) such as Nonnegative matrix factorization (NMF) has played an important role in data analytics, providing a foundational framework for formulating key analytics tasks such as text clustering, graph clustering, and recommendation system [16,17,18,19] problems. In this paper, we propose a joint NMF algorithm which jointly optimizes the standard NMF for content clustering and Symmetric NMF (SymNMF) for graph clustering. Detailed discussions of NMF and SymNMF can be found in [14, 15] and [19], respectively. The goal is to simultaneously cluster data sets that contain both content and connection structure, utilizing both information sources, to obtain higher quality clustering results. This type of fusion can be done at the data level (early fusion) or at the result level (late fusion). An advantage of NMF and SymNMF is that both are formulated using one framework of CLRA, and therefore, we can naturally design a joint objective function to obtain the objective function level fusion as we illustrate in a later section.

Numerous data sets contain both text content and connection structure. For example, in a data set of research papers or patents, papers or patents have text content where the citations or co-author relationships define the connection structure; in a data set of emails, email messages have text content and the sender–recipient relations define a hypergraph structure where one email may have multiple recipients. When the connection structure is represented as edges in a graph, in the former case the text content is associated with graph nodes while in the latter case the text content is associated with hypergraph edges. A hybrid clustering method is designed to utilize both content and connection structure information, thus taking advantage of the full information provided in the data.

Many methodologies exist for data clustering. However, our framework using CLRA offers multiple advantages. The proposed method is simple to implement based on an existing numerical routine to solve a nonnegativitiy constrained least squares (NLS) and widely applicable. Also the proposed method can provide valuable insights about the data when there is not enough knowledge about the underlying data model or when one desires only a quick glance at results. In fact, in the area of text and graph clustering, CLRA methods (NMF and SymNMF) have been demonstrated to have superior performance in terms of speed and accuracy [17,18,19]; The two CLRA based methods, NMF for content clustering and SymNMF for graph clustering, have the same underlying matrix factorization framework, and, can be merged at the objective function level and the result is easily interpretable as clustering result without requiring an additional step for clustering unlike in many of the spectral methods.

In this paper we discuss data with associated text content and connection structure. In addition to text content, other types of information may also be associated with connection structure, such as images and attributes that appear in structured data like a person’s age and gender, etc. Our hybrid clustering method can naturally extend to other content information as long as the raw data can be encoded as nonnegative vectors.

This paper is organized as follows: We start with a simpler case where the text content can be associated with graph nodes, and extend the idea to the case where we need a hypergraph for connection representation and the text content is associated with hypergraph edges (Sect. 2). We then summarize some related work in Sect. 3. We have conducted extensive experiments using patent citation data sets and two other types of data sets to show the effectiveness of our method (Sect. 4). In addition to demonstrating improvements of clustering quality, we list several potential applications of our hybrid clustering approach, including citation recommendation on a paper data set and the application of our hypergraph extension to an Enron email data set (Sect. 5). Discussions and conclusions can be found in Sect. 6.

2 Hybrid clustering via joint NMF

We have designed fast, scalable algorithms for some variants of NMF for key data analytics problems [3, 15, 17]. Currently one of the fastest algorithms for hierarchical and flat (non-hierarchical) topic modeling and clustering that also produce consistently high quality solutions are HierNMF2 and FlatNMF2, which are available in our open source software package in C++ called SmallK (http://smallk.github.io/). SmallK also includes Python drivers, pysmallk, that allow seamless integration of SmallK into existing Python applications.

First we assume that the text content is associated with the graph nodes. For example, a collection of research papers or patents can be represented in a graph where the content information of each paper or patent is a graph node and the citation information provides the graph connection information. We assume that a data set’s text information is represented in a nonnegative matrix \(X\in \mathbb {R}_{+}^{m\times n}\) and the graph structure is represented in a nonnegative symmetric matrix \(S \in \mathbb {R}_+^{n \times n}\), where m is the number of features, the columns of X represent the n data items, the (ij)-th element of S represents a relationship such as similarity between the i-th and j-th data items, and \(\mathbb {R}_+\) denotes the real nonnegative numbers. Then the NMF formulation for text clustering/topic modeling [16] is

$$\begin{aligned} \min _{W\ge 0, H\ge 0}\Vert X-WH\Vert _F \end{aligned}$$
(1)

and the SymNMF formulation for graph clustering [18, 19] is

$$\begin{aligned} \min _{H\ge 0} \Vert S-H^TH\Vert _F \end{aligned}$$
(2)

where \(W\in \mathbb {R}_{+}^{m\times k}\) and \(H\in \mathbb {R}_{+}^{k\times n}\), and a given integer k, which is typically much smaller than m or n, represents the reduced dimension, i.e., number of clusters, number of communities, or number of topics [14]. In (1), each column of W, subject to some scaling, is regarded as the representative of each cluster or a topic in the document collection. The matrix H can be seen as a low rank (rank k) representation of the data points since each data item in X can be explained by an additive linear combination of the representative columns in W, i.e., the columns of H are approximative coordinates of data items in X with columns of W as basis vectors. Similarly, in (2), H is a low rank representation of the nodes in the graph. Such a low rank approximation also gives us k clusters, since \(H_{i,j}\) can be seen as a measurement of strength that the j-th data item belongs to the i-th cluster. Therefore, each column of H gives the soft clustering assignment information. By taking the row index with the maximum value in each column vector of H as the cluster index of each data item, one can also perform hard clustering [14, 15].

The hybrid clustering method we propose finds a low rank representation that simultaneously represents the text content and the graph structure of the data items by jointly optimizing the combined NMF and SymNMF objective functions:

$$\begin{aligned} \min _{W\ge 0, H\ge 0} \alpha _1|| X-WH ||_F^2 + \alpha _2 || S-H^T H ||_F^2. \end{aligned}$$
(3)

where \(\alpha _1\ge 0\) and \(\alpha _2 \ge 0\) are the weighting parameters. By adjusting the parameters \(\alpha _i\), we can emphasize one over the other. In the extreme case, some \(\alpha _i\) can be set to zero: e.g. when \(\alpha _2=0\) in the above, we are only concerned with the content, when \(\alpha _1=0\), we only pay attention to the structural information and ignore the content. Excluding these special cases, we can assume \(\alpha _1=1\) without loss of generality and Eq. (3) becomes

$$\begin{aligned} \min _{W\ge 0, H\ge 0} || X-WH ||_F^2 + \alpha || S-H^T H ||_F^2. \end{aligned}$$
(4)

with \(\alpha \ge 0\) as the weighting parameter.

Now we extend our method to hypergraphs where the text content is associated with hypergraph nodes. Once this is done, it would be natural to extend our method further to the cases where text is associated with graph or hypergraph edges due to the duality that exists between edges and nodes of a hypergraph and the fact that a graph can be treated as a special case of a hypergraph.

A hypergraph \(\mathscr {H}\) is a pair \(\mathscr {H} = (\mathscr {V},\mathscr {E})\), where \(\mathscr {V}=\{v_1,\ldots ,v_m\}\) is the set of vertices and \(\mathscr {E}=\{e_1,\ldots ,e_n:\,e_i\subset \mathscr {V}\}\) is the set of hyperedges. Unlike a graph edge, a hypergraph edge \(e_i\) may connect more than two vertices in the graph. Such a hypergraph \(\mathscr {H}\) can be represented by an incidence matrix \(M=(m_{ij})\in \mathbb {R}^{m\times n}\), where

$$\begin{aligned} m_{ij} = {\left\{ \begin{array}{ll} 1,&{} v_i\in e_j;\\ 0,&{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

The dual hypergraph \(\mathscr {H}^*\) is the hypergraph corresponding to the incidence matrix \(M^T\).

Assume there’s a k-way partition of the vertices \((\mathscr {V}_1,\ldots ,\mathscr {V}_k)\) where \(\mathscr {V}_1 \cup \cdots \cup \mathscr {V}_k=\mathscr {V}\) and \(\mathscr {V}_i \cap \mathscr {V}_j=\emptyset \) for all \(1\le i\ne j\le k\). Define the matrix \(H=(h_{ij})\in \mathbb {R}^{k\times n}\) as

$$\begin{aligned} h_{ij} = \frac{[v_j\in \mathscr {V}_i]}{\sqrt{d_v(j)}\left( \displaystyle \sum _{v_l\in \mathscr {V}_i}\frac{1}{d_v(l)}\right) ^{1/2}} \end{aligned}$$
(5)

which is a normalized partition indicator matrix where

$$\begin{aligned}{}[v_j\in \mathscr {V}_i] = {\left\{ \begin{array}{ll} 1, &{} v_j\in \mathscr {V}_i;\\ 0, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

and \(d_v(l) = \sum _{j=1}^n m_{lj}\) is the degree of vertex \(v_l\). It is shown in [35] that the following optimization problem

$$\begin{aligned} \max _H {{\mathrm{tr}}}HSH^T \end{aligned}$$
(6)

is equivalent to minimizing the hypergraph normalized cut as defined in [35], where

$$\begin{aligned} S=D_v^{-1/2}MD_e^{-1}M^TD_v^{-1/2} \end{aligned}$$
(7)

is symmetric,

$$\begin{aligned} D_v={{\mathrm{diag}}}(d_v(1),\ldots ,d_v(m)), \quad D_e={{\mathrm{diag}}}(d_e(1),\ldots ,d_e(n)), \end{aligned}$$

and \(d_e(l) = \sum _{i=1}^m m_{il}\) is the degree of edge \(e_l\). Following the same argument as in [18], it can be shown that (6) is equivalent to \(\min _{H} \Vert S-H^TH\Vert _F^2\) and by relaxing constraint (5) to \(H\ge 0\), we obtain the objective function of SymNMF. Therefore, in the case of a hypergraph, we can use the matrix S defined in Eq. (7) as the similarity matrix in Eq. (4).

There are many ways to find a solution for the objective function (4). Theoretically, a Newton-like algorithm can be developed to directly solve (4). However, as pointed out in [19], a Newton-like algorithm can not utilize the sparsity of X and S for speeding up because the matrices \(X-WH\) and \(S-H^TH\) need to be computed explicitly and thus the sparsity will be destroyed. On the other hand, an alternating nonnegative least square (ANLS) algorithm can be sped up with sparsity. To apply an ANLS-like algorithm that can utilize the sparse nature of text documents and associated networks, we propose reformulating (4) in the following form with a penalty term

$$\begin{aligned} \min _{W,H,\tilde{H}\ge 0} || X-WH ||_F^2 + \alpha || S-\tilde{H}^T H ||_F^2 + \beta \Vert \tilde{H}-H\Vert _F^2. \end{aligned}$$
(8)

where \(\tilde{H}\in \mathbb {R}_+^{k\times n}\) and \(\beta \ge 0\) is the regularization parameter. This reformulation is motivated from our earlier work to generate an algorithm that is based on the block coordinate descent (BCD) scheme so that each sub-problem in the BCD is a nonnegativity constrained least squares (NLS) problem for which we have developed a highly efficient algorithm and optimized open-source software [7]. Then Eq. (8) can be solved using a 3-block coordinate descent (BCD) scheme, i.e. minimize the objective function with respect to W, \(\tilde{H}\) and H in turn. Specifically, we solve the following three subproblems in turn:

$$\begin{aligned}&\min _{W\ge 0} \Vert H^TW^T-X^T\Vert _F\end{aligned}$$
(9)
$$\begin{aligned}&\min _{\tilde{H}\ge 0} \left\| \begin{bmatrix} \sqrt{\alpha } H^T\\ \sqrt{\beta } I_k \end{bmatrix}\tilde{H} - \begin{bmatrix} \sqrt{\alpha } S\\ \sqrt{\beta } H \end{bmatrix} \right\| _F \end{aligned}$$
(10)
$$\begin{aligned}&\min _{H\ge 0}\left\| \begin{bmatrix} W\\\sqrt{\alpha }\tilde{H}^T\\\sqrt{\beta } I_k \end{bmatrix} H - \begin{bmatrix} X\\\sqrt{\alpha }S\\\sqrt{\beta } \tilde{H} \end{bmatrix} \right\| _F \end{aligned}$$
(11)

where each subproblem is simply a nonnegative least squares problem (NLS), which is convex. Thus, an active-set-based algorithm can find the optimal solution in a finite number of operations and ensures that the solution is in the feasible region. The active-set-based algroithm has an additional advantage when dealing with a rank-deficient input matrix as it does not run into a subproblem of passive set indices which involve inlinearly dependent vectors, which has profound implications for real-world applications such as chemical detection where false negatives and false positives can increase dramatically in the presence of rank deficiency. For details, see [6]. The above three block BCD algorithm converges to a stationary point according to Bertsekas’ theorem [1]. The identity submatrices \(I_k\) in the above equations make the problem better conditioned than the subproblems in the standard NMF that uses two block BCD alternating updating W and H. We solve each NLS problem using the block principal pivoting (BPP) algorithm [15]. Theoretically, to force H to be identical to \(\tilde{H}\), the value of the parameter \(\beta \) has to be infinity. This problem has been studied extensively and we use a scheme similar to that proposed in [32]. It should be pointed out that in [15] it is shown that algorithms based on the BCD framework have guaranteed convergence to a stationary point, whereas, popular and easy to implement algorithms such as Multiplicative Updating (MU) may not converge. In addition, extensive experiments show that the BPP method is faster and more accurate than MU.

3 Related work

The use of joint matrix factorization for clustering can also be seen in [11, 21, 29], all of which consider clustering using information from different sources. [29] is also a method for hybrid clustering of connection structure and content data. Their formulation (2JointMF) is \(\min _{W_0,W_1,H} \Vert A-W_0 H \Vert _F^2 + \Vert X-W_1H\Vert _F^2 + \lambda \sum _{j=1}^n \Vert H(:,j)\Vert _1^2 + \eta (\Vert W_0\Vert _F^2 + \Vert W_1\Vert _F^2) \), with the constraints \(H \ge 0\) and \(0 \le W_0H \le 1\), where A is the adjacency matrix of the graph (can be an asymmetric matrix representing a directed graph) and X is the feature-data matrix. Our JointNMF is different from 2JointMF in the following ways: (1) JointNMF has nonnegative constraints on all matrix factors while 2JointMF has nonnegative constraint on H only. The nonnegative constraints on all factors usually lead to better interpretations of the result. For example, the W factor in our formulation can be interpreted as topic vectors due to its nonnegativity. (2) 2JointMF factors the graph matrix in a way similar to factoring a general feature matrix, while the symmetric factorization from JointNMF acknowledges the symmetric similarity relation encoded in a graph and also relates itself with minimizing normalized cut. (3) The constraint \(0 \le W_0H \le 1\) makes 2JointMF computationally much more difficult [12, 13]. The algorithm in [21] also jointly minimizes several NMF objectives. However, they do not consider graph information and therefore SymNMF was not in their formulation. Although the objective function in [11] looks very similar to what we propose in the paper, the matrices in the formulas have different meanings and their formulation is used only in the context of graph clustering.

Some other methods for hybrid clustering (of graph and node content) can be summarized into the following categories: (1) Generative models [2, 4, 9, 10, 22, 25]. These algorithms learn a latent cluster indicator for each node, based on which all the content and links are generated. Such latent cluster indicator could be a vector that measures how likely a node belongs to each cluster (for soft clustering), or a single variable that assigns a node to a specific cluster (hard clustering). (2) Discriminative models [34]. The authors of [34] argue that generative models fail to consider additional factors that could affect the community memberships and isolate the content that is irrelevant to community memberships. They propose a discriminative algorithm, PCL-DC, to overcome these two shortcomings. (3) Topic modeling with network regularization [24, 28]. These methods start with the objective function of a topic modeling method and add the graph related part as a regularization term. (4) Augmenting the graph with content information [26]. This method reduces the hybrid clustering problem to a graph clustering problem by augmenting the graph with new vertices and edges that reflect the text content. (5) Entropy based [5]. This method jointly minimizes the entropy of document clusters and maximizes modularity of graph clusters. (6) Cluster ensembles [27] This algorithm assembles the result of a graph clustering algorithm and the result of a document clustering algorithm into a combined result (late fusion). (7) Cluster selection [8]. This method selects a graph clustering algorithm when the graph has clear structure information and selects content only algorithms when the graph has ambiguous structure information.

4 Clustering US patent, BlogCatalog and Flickr data

All experiments were performed on a server with two Intel(R) Xeon(R) CPU E5-2680 v3 CPUs and 377GB memory.

The main data set used for the experiments is the US patent claim and citation data from PatentsView.Footnote 1 Some advantages of using US patents as a data source are: (1) the openness, centralized management and availability of relatively structured data format makes the patent data easier to obtain and process; (2) the abundance of the patent database ensures enough samples that can be studied; (3) patents were carefully assigned with classification labels, and such labels were examined by patent examiners; therefore the classification information can be used as a relatively reliable ground truth.

We use the Cooperative Patent Classification (CPC) system, where each classification label has the scheme illustrated in Fig. 1. We select 13 CPC classes (A22, A42, B06, B09, B68, C06, C13, C14, C40, D02, D10, F22, Y04) and use patents under each class to construct 13 different data sets.Footnote 2 For each data set, we first construct the term-document matrix representing the patent claims and the graph adjacency matrix representing the patent citation relations. Our algorithm requires a symmetric adjacency matrix and therefore we treat the citation graph as undirected by ignoring the directions. We then clean the data by removing terms that appear very infrequently and documents that are too short or duplicated, and extract the largest connected components of the graph. Finally, we apply tf-idf to the term-document matrix, normalize its columns to have unit 2-norm, obtaining the matrix X, and let S be \(D^{-1/2}AD^{-1/2}\), where \(A\in \mathbb {R}^{n\times n}\) is the adjacency matrix, \(D={{\mathrm{diag}}}(d_1,\ldots ,d_n)\) and \(d_i=\sum _{j=1}^nA_{ij}\) is the degree of vertex i. We use CPC groups as ground truth clusters. Some statistics about these data sets (after cleaning) are listed in Table 1.

Fig. 1
figure 1

An example classification label in the CPC scheme

Table 1 Some statistics of US patent data sets

To verify our algorithm on other types of data, we also use the BlogCatalog data set from [30] and the Flickr data set from [31]. These data sets have users as graph nodes and represent user commenting and friendship relations as graph edges. The content comes from user generated keywords/tags that are used to describe their blog articles (BlogCatalog) or photos (Flickr), which is different from traditional text content. The ground truth clusters of BlogCatalog data set are defined by categories of each blog and the ones for the Flickr data set are defined by user groups. We apply the same preprocessing as for the US patent data sets. Some statistics regarding these two data sets (after preprocessing) are listed in Table 2.

Table 2 Some statistics of BlogCatalog and Flickr data sets

We now define the measures for the evaluation of the clustering results. Assume we computed k clusters \(B_1,\ldots ,B_k\) and the ground truth has \(k'\) clusters \(G_1,\ldots ,G_{k'}\). We compute the confusion matrix \(C=(c_{ij})_{k\times k'}\), where \(c_{ij}=|A_i\cap B_j|\). Then we define the average\(F_1\)score [33] as

$$\begin{aligned} F_1=\frac{1}{2}\left( \frac{1}{k}\sum _{i=1}^k\max _jF_1(A_i,B_j) + \frac{1}{k'}\sum _{j=1}^{k'}\max _iF_1(B_j,A_i) \right) \end{aligned}$$

where

$$\begin{aligned} F_1(A_i,B_j)=F_1(B_j,A_i)=\frac{2c_{ij}}{|A_i|+|B_j|} \end{aligned}$$

This score measures how well an algorithm can recover the ground truth clusters. We also use another measure called rand index [23], which measures how well an algorithm can predict the connections among data items. Assume there are n data items in total. For each of the \(n(n-1)/2\) pairs of data items, we say the two items are c-connected if they belong to the same cluster, otherwise we call them c-disconnected (prefix c is added to distinguish from connectivity in graph theory). Clustering results can also be treated as a prediction of c-connectivity of each pair of data items. A prediction regarding one pair of data items can have four cases of true positive (TP), true negative (TN), false positive (FP) or false negative (FN) according to the rules listed in Table 3. Then the rand index can be defined as

$$\begin{aligned} RI= \frac{\#TP + \#TN}{\#TP+\#TN+\#FN+\#FP} = \frac{\#TP + \#TN}{n(n-1)/2} \end{aligned}$$
Table 3 Type of predictions
Fig. 2
figure 2

Parameter sensitivity of PCL-DC and JointNMF. The parameter of PCL-DC is \(\lambda \) and the parameter of JointNMF is \(\alpha \)

We compare our algorithm with NMF and SymNMF, which have leading performance in text clustering and graph clustering, respectively. For hybrid clustering, we choose PCL-DC [34] to compare with based on its popularity and source code availability. While our method is based on nonnegative matrix factorization, PCL-DC is a probabilistic method that combines a conditional model for link analysis and a discriminative model for content analysis. Although we mentioned many other algorithms in Sect. 3, we found that for other algorithms, either the code is not available or the code is available but we encountered runtime errors during experimental tests. Both JointNMF and PCL-DC have parameters to set. For JointNMF, we let the default parameter be \(\alpha =\Vert X\Vert _F^2/\Vert S\Vert _F^2\), meaning half-half balance between graph clustering and text clustering, and set \(\beta =\alpha \Vert S\Vert _{max}\), where \(\Vert S\Vert _{max}\) is the maximum absolute value of elements in S. The authors of PCL-DC do not provide a method to specify its regularization parameter \(\lambda \). Therefore, it is important to first study how the parameter change will affect the algorithm performance. It is found that for \(\lambda <1\), PCL-DC sometimes becomes extremely slow, such that it may take weeks to run over all the data sets (estimated based on sampling run). Therefore, \(\lambda \) is varied within [1, 20]. In Fig. 2, we show how the average F1 score changes when \(\lambda \) varies in that range for the first four data sets listed in Table 1. The code of PCL-DCFootnote 3 provides two models (popularity link model and productivity link model), which we label as PCL-DC-1 and PCL-DC-2, respectively. The performance change of JointNMF when its parameter \(\alpha \) varies in the same range is also studied.

We observe that the PCL-DC is either worse than JointNMF or very sensitive to the parameters, and it is concluded that when \(\lambda \) exceeds a certain threshold (depending on the data), there is a large drop in clustering quality. Therefore, to have a tolerable run time while having a fair clustering quality, \(\lambda =1\) is chosen for the comparison experiments. The results of the comparison are listed in Tables 4, 5 and 6, where each value is the average over 10 runs, and the bold font denotes best results.

Table 4 Comparison of average F1 scores
Table 5 Comparison of rand index
Table 6 Comparison of run time (seconds)

Using these patent data sets, from our experiments it can be observed that: (1) JointNMF usually has the best average F1 scores, and its average F1 score is almost always better than that of NMF or SymNMF alone; (2) JointNMF and SymNMF have the best rand index; (3) SymNMF is usually the fasted algorithm; (4) The run time varies in a very different pattern between NMF based methods and PCL-DC. The algorithms for both NMF based methods and PCL-DC are iterative. For NMF based methods, the run time of each iteration is linear with respect to data size (e.g. number of nodes and edges) and cubic with respect to the number of clusters [17]. For PCL-DC, the run time of each iteration is linear with respect to both data size and the number of clusters [34]. If we compare Table 6 with Table 1, we can observe that for NMF methods the number of clusters does dominate the run time but for PCL-DC the run time is rather unpredictable, which may suggest that the convergence behavior of PCL-DC is not consistent over different data sets. On BlogCatalog and Flickr data sets, which have different kinds of content and graph edges, the performance varies depending on the data. However, the performance of JointNMF is comparable to the best method with the exception of run time on the Flickr data set.

In conclusion, for patent data sets, based on content and citations, JointNMF produces better quality solutions for clustering; for prediction of pairwise connection, both JointNMF and SymNMF perform well; speed-wise, JointNMF is not the fastest, but is comparable to other methods. On other types of data, the performance of each method varies, and JointNMF generates comparable results. The JointNMF method has other advantages: its parameter has explicit meanings (weight between text and graph), the clustering quality is not very sensitive to the parameter setting, and its default parameter works very well.

5 Other applications

In this section we present additional applications of our JointNMF framework beyond clustering. We demonstrate our JointNMF on other potential applications such as citation recommendations of papers/patents and activity/leader detection in an organization.

5.1 Citation recommendation

When applied to papers/patents with citations or web pages with hyperlinks, the formulation (4) can also be understood as finding a basis W for the text space, such that under this basis, the representation (coordinates) of the documents can also reflect their linkage information. Therefore, when we express a new vector \(\varvec{x}\) in the text space using the basis W, i.e. finding a vector \(\varvec{h}\) that solves the following optimization problem

$$\begin{aligned} \min _{\varvec{h}\ge 0}\Vert \varvec{x}-W\varvec{h}\Vert _2, \end{aligned}$$
(12)

we can use closeness of \(\varvec{h}\) to the column vectors in H to decide how likely the new document represented by \(\varvec{h}\) should cite some of the documents in H. For example, one can recommend a new document to cite the i-th original document if the i-th entry of \(H^T\varvec{h}\) is larger than certain threshold. Since the matrix S is approximated by \(H^TH\), we can treat \(H^T\varvec{h}\) as an augmented column of S, representing the relation between the original documents and the new document. Another method is to set the threshold for the cosine similarity between \(\varvec{h}\) and column vectors in H. It will be observed that each method has its advantages.

For this task, we use the paper title/abstract and citation data cit-HepTh from SNAP[20], which contains 27,770 papers from January 1993 to April 2003 in the hep-th (high energy physics—theory) section of arXiv. Note that this is a different task from clustering and therefore the data preprocessing procedure is a little different: the raw adjacency matrix for S (i.e. \(S=A\)) is used. The normalized version \(D^{-1/2}AD^{-1/2}\) is related to minimizing the normalized cut [18] and therefore good for clustering. Here the raw adjacency matrix is a better indicator of citations, which is used as an input that the algorithm learns from, instead of a basis for clustering.

To evaluate our method, the data is separated into training and test sets by treating papers published earlier than 2003 as the training set and papers published in 2003 as the test set. We use JointNMF to learn a matrix W from the document and citation relations in the training set, and then make predictions of citations for documents in the test set and compare the predictions with the actual citations.

To verify that the W computed by our algorithm indeed reflects the network structure better, we also design several baseline methods. A naive method is to predict citations based on number of words shared by two documents. One method based on NMF is to learn the matrix W used in (12) only by NMF, i.e. \(\min _{W\ge 0, H\ge 0}\Vert X_{train}-WH\Vert _F\). Another method based on NMF is to directly learn the \(\varvec{h}\) vector in (12) using \(\min _{W,H,\varvec{h}\ge 0}\Vert [X_{train},\varvec{x}]-W[H,\varvec{h}]\Vert _F\). For the two NMF-based methods, the rest of the steps for making predictions are the same as JointNMF, once the matrix W or the vector \(\varvec{h}\) is obtained. In this subsection, we denote these two NMF based methods as NMF-1 and NMF-2, respectively.

For both prediction methods (compute \(H^T\varvec{h}\), the inner product , or compute cosine similarity scores), a threshold is needed. Instead of evaluating these algorithms with a fixed threshold, we show the receiver operating characteristic (ROC) curve, which plots the true positive rate against the false positive rate at various threshold values. In general, the closer the curve is to the upper left corner of the graph, the better the algorithm results. Or quantitatively, the larger the area under the curve (AUC) is, the better.

Paper abstracts are used to extract text content. The experimental results are shown in Fig. 3. Some observations are: when cosine similarity is used, JointNMF makes the overall best predictions, and when inner product is used, at certain threshold values JointNMF can achieve relatively high true positive rate with a very low false positive rate. One can choose which method to use based on requirements. A heuristic explanation of such a difference caused by using cosine similarity or inner product is as follows: The cosine similarity ignores the length of the similar content while the inner product does not. Thus, the low false positive rate on the right sub-figures of Figs. 3 and 4 suggests that if two papers’ abstracts/titles share a large amount of content (in the sense of bag-of-word model), it is very likely that one paper would cite the other.

Fig. 3
figure 3

ROC curves for citation recommendation algorithms applied to paper abstract and citation data. The left uses cosine similarity for the prediction, while the right uses inner product

The experiments are repeated using only paper titles as text content; similar results are observed, as shown in Fig. 4. From the results we can observe that even with very little text information (such as paper titles), our method still works well.

Fig. 4
figure 4

ROC curves for citation recommendation algorithms applied to paper title and citation data. The left uses cosine similarity for the prediction, while the right uses inner product

5.2 Activity and leader detection from Enron email data

In an organization where various groups of people work on different subjects and engage in different activities, JointNMF can be used to detect such group structure, reveal the working subject/activities and find administrators/leaders in the organization. We assume that (1) within-group communications (e.g. emails) reflects the subject on which the team is working/activities engaged in and (2) people involved in multiple groups would likely hold a higher position in the organization, since they may be in charge of these groups. Each communication can be seen as a hypergraph edge that connects all people involved in the communication and the communication content is the text associated with the edge. Clustering the text data can distinguish and identify different working subjects/activities and clustering the graph data can divide people into workgroups. JointNMF utilizes both types of data simultaneously and therefore can distinguish different groups of people working on the same subject and different subjects worked on by the same group of people. After clustering, one can count and compare the number of groups/clusters each person belongs to—the more groups a person belongs to, the more likely the person is in a leadership or administrative position.

A subset of Enron email data extracted by a group from UC Berkeley,Footnote 4 containing 1702 emails is used. First we construct the term document matrix from email content and the hypergraph incidence matrix from email-sender/recipient relations. The hypergraph has Enron employees as vertices and their emails as edges, and a vertex is connected by an edge if and only if the corresponding employee is the sender or a recipient of the corresponding email. After that, we clean the data by removing terms that appear very infrequently and emails that are too short or duplicated, and extracting the largest connected components of the hypergraph. The tf-idf transformation is then applied to the term-document matrix, its columns are normalized to have unit 2-norm, which obtains the matrix X. S is computed using (7) in which M is the incidence matrix of the dual hypergraph. Finally, we apply JointNMF with \(\alpha = \Vert X\Vert _F^2/\Vert S\Vert _F^2\) and \(\beta =\alpha \Vert S\Vert _{max}\) to find 20 groups of employees. Note that since the dual hypergraph is used, the resulting clusters are clusters of emails rather than clusters of employees. To induce clusters of employees, one simply inserts employees involved in the same cluster of emails into one employee cluster. In this way, we can actually induce overlapping employee clusters from non-overlapping email clusters. It is assumed that an employee has j memberships if the employee belongs to j clusters. The number of memberships is counted for each employee and the frequency of each number is listed in Table 7.

Table 7 Frequency of number of memberships
Table 8 Employees that has j memberships (\(j\ge 6\)) and their positions in Enron

Employees that had at least 6 memberships are examined in online news and we found that they all held relatively high positions in Enron. Their names and positions are listed in Table 8. To see the effect of our algorithm on topic modeling, we list some topic keywords for each cluster in Table 9. It can be observed that some emails are communications about/with other companies and regulatory agencies (0,3,19); some are about administrative tasks or daily work (5,7,8,13,15,16,18); some are about legal issues (6,10); and some are related to the California energy crisis (2,11).

Table 9 Topic keywords of clusters

6 Conclusions and discussions

With a simple CLRA formulation in (4), JointNMF is able to solve a variety of problems. The basic application of JointNMF is to cluster hybrid data with both content and connection structure, where the connection structure can be either a graph or a hypergraph, and the content can be associated with either the hypergraph nodes or the edges. When X is any nonnegative feature-data matrix and S is a nonnegative data-data similarity matrix, the JointNMF formulation (4) naturally applies without any modification. When there are multiple feature-data matrices \(X_1,\ldots ,X_p\) and multiple similarity matrices \(S_1,\ldots ,S_q\), one can extend (4) to

$$\begin{aligned} \min _{W_i\ge 0, H\ge 0} \sum _{i=1}^p\alpha _i|| X_i-W_iH ||_F^2 + \sum _{j=1}^q\gamma _j || S_j-H^T H ||_F^2 \end{aligned}$$

JointNMF can also be applied to predict paper/patent citations and detect activities and leaders in an organization.

As a hybrid clustering method, JointNMF, with easy-to-set parameters, successfully improves the cluster quality over content-only and connection-only clustering algorithms. It also outperforms one of the leading hybrid clustering methods in the sense of average F1 score and rand index.

Although the current default parameters (\(\alpha =\Vert X\Vert _F^2/\Vert S\Vert _F^2\) and \(\beta =\alpha \Vert S\Vert _{max}\)) for JointNMF are usually good enough, it was noticed in our experimental results that these are not optimal. We plan to study this further in future research to better understand these parameter values.

Our next research effort, in addition to those noted above, is to accelerate the JointNMF algorithm using a divide-and-conquer approach, as in [17]. In our experiments, JointNMF also demonstrates excellent potential for predicting paper/patent citations and activities and leaders in an organization. The application of JointNMF to citation recommendation and activity/leader detection will be further explored and more experimental results on additional data sets will serve to compare JointNMF with other algorithms in these two important areas that have many applications in critical domains such as organized crime detection.