1 Introduction

Dimension reduction and clustering are two of the key tasks in machine learning and data analytics. Suppose a collection of \(n\) data items with \(m\) features is represented in a matrix \(X \in \mathbb {R}^{m \times n}\). In a low rank approximation, we are given a desired reduced rank \(k\) which is typically much smaller than \(m\) and \(n\), and we are to find \(C \in \mathbb {R}^{m \times k}\) and \(G \in \mathbb {R}^{n \times k}\) such that the difference between \(X\) and the product \( CG ^T\) is minimized as

$$\begin{aligned} X \approx CG ^T. \end{aligned}$$
(1)

This minimization problem can be formulated using various difference or distance measures. In this paper, we will focus on the Frobenius norm based minimization, i.e.

$$\begin{aligned} \min _{C, G} \Vert X - CG ^T\Vert _F. \end{aligned}$$
(2)

In nonnegative Matrix factorization (NMF), nonnegativity is imposed on the factors \(C\) and \(G\), i.e. we are to solve

$$\begin{aligned} \min _{C \ge 0, G \ge 0} \Vert X- CG ^T\Vert _F^2, \end{aligned}$$
(3)

where \(C \in \mathbb {R}_+^{m \times k},\,G \in \mathbb {R}_+^{n \times k}\), and \(\mathbb {R}_+\) denotes the set of nonnegative real numbers. The NMF can be defined for any matrix, but it makes more sense to consider NMF when the matrix \(X\) is nonnegative. Throughout this paper, we will assume that \( X \in \mathbb {R}_+^{m \times n}\) is nonnegative.

NMF has been shown to be an effective method in numerous applications [11, 20, 25, 29, 35]. In this paper, we will focus on the role of NMF as a clustering method. Note that NMF is posed as a constrained low rank approximation method, and accordingly, is a method for dimension reduction. However, dimension reduction and clustering are closely related. The following interpretation of the results of the low rank approximation illustrates this point: We consider the columns of \(C\) are the new basis for the reduced \(k\)-dimensional space for \(X\), and each column of \(G^T\) provides the \(k\)-dimensional representation of the corresponding column of \(X\) in the space spanned by the columns of \(C\).

In the case of singular value decomposition (SVD), the columns of \(C\) are ordered in a way that the first column is the most dominant vector (the leading left singular vector) that captures the largest variation in the data, and the next column is the second most dominant vector and orthogonal to the leading singular vector, etc. Therefore, the columns of \(C\) do not “equally” represent the column space spanned by the data matrix \(X\). In addition, the two factors \(C\) and \(G^T\) can have negative elements, and thus it will be difficult to interpret the \(i\)-th column of \(G^T\) as a “proportion distribution” with which the \(i\)-th data item has the component in the corresponding basis vector in \(C\). On the other hand, the columns of \(C\) in NMF cannot have negative signs, and accordingly cannot “cancel out” some directions that the more dominant columns of \(C\) may represent. Accordingly, the columns of \(C\) more or less “equally” represent the data set and each column in the factor \(G^T\) can be viewed as a distribution with which the \(i\)-th data item has the component in the corresponding column of \(C\). Since we can use \(G^T\) to derive an assignment of the \(n\) data points into \(k\) groups, clustering can be viewed as a special type of dimension reduction. The NMF gives a soft clustering result as explained above but we can also interpret the result as a hard clustering by assigning the \(i\)-th data point to the \(j\)-th cluster when the largest element among all components of the \(i\)-th column of \(G^T\) lies in the \(j\)-th position. For example, when NMF is applied to document clustering, the basis vectors in \(C\) represent \(k\) topics, and the coefficients in the \(i\)-th column of \(G^T\) indicate the degrees of membership for \(x_i\), the \(i\)-th document. NMF is well-known for the interpretability of the latent space it finds [38].

Another way to illustrate the capability of NMF as a clustering method is by observing its relationship to the objective function of the classical K-means clustering, which is arguably the most commonly used clustering method:

$$\begin{aligned} \min \sum _{i=1}^n \Vert x_i - c_{g_i}\Vert _2^2, \end{aligned}$$
(4)

where \(x_1,\ldots ,x_n\) are the columns of \(X,\,c_1,\ldots ,c_k\) are the \(k\) centroids, and \(g_i=j\) when the \(i\)-th data point is assigned to the \(j\)-th cluster (\(1 \le j \le k\)). Consider K-means formulated as a dimension reduction problem [30]:

$$\begin{aligned} \min _{G \in \{0,1\}^{n \times k}, G \mathbf 1 _k = \mathbf 1 _n} \Vert X- CG ^T\Vert _F^2, \end{aligned}$$
(5)

where \(\mathbf 1 _k \in \mathbb {R}^{k \times 1}, \mathbf 1 _n \in \mathbb {R}^{n \times 1}\) are column vectors whose elements are all 1’s. In the formulation (5), columns of \(C\) are the cluster centroids, and the single nonzero element in each column of \(G^T\) indicates the clustering assignment.

NMF as a clustering method has been proved to be superior to K-means on many types of data, including documents [64], images [8], and microarray data [27]. Although K-means and NMF have the same objective function \(\Vert X- CG ^T\Vert _F^2\) with different constraints, i.e. \(G \in \{0,1\}^{n \times k}, G \mathbf 1 _k = \mathbf 1 _n\) in the case of K-means, and \(C \ge 0\) and \(G \ge 0\) in the case of NMF, each has its best performance on different kinds of data sets. In order to apply NMF to the appropriate data sets, we must know the limitation of its capability in clustering. Most clustering methods have a clearly defined objective function to optimize such as (5) and (3). However, clustering is difficult to formulate mathematically in order to discover the hidden pattern [33]. Each clustering method has its own conditions under which it performs well. For example, K-means assumes that data points in each cluster follow a spherical Gaussian distribution [18]. In contrast, the NMF formulation (3) provides a better low-rank approximation of the data matrix \(X\) than the K-means formulation (5).

If \(k \le rank(X)\), since \(\text {rank}(X) \le \text {nonnegative-rank}(X)\) [4] and the low rank approximation by NMF gives a smaller objective function value when the columns of \(C\) (the cluster representatives) are linearly independent, it is for the best interest of NMF to produce linearly independent cluster representatives. This explains our earlier discovery that NMF performs well when different clusters correspond to linearly independent vectors [34]. The following artificial example illustrates this point. See Fig. 1, where the two cluster centers are along the same direction therefore the two centroid vectors are linearly dependent. While NMF still approximates all the data points well in this example, no two linearly independent vectors in a two-dimensional space can represent the two clusters shown in Fig. 1. Since K-means and NMF have different conditions under which each of them does clustering well, they may generate very different clustering results in practice. We are motivated by Fig. 1 to mention that the assumption of spherical K-means is that data points in each cluster follow a von Mises–Fisher distribution [3], which is similar to the assumption of NMF.

Fig. 1
figure 1

An example with two ground-truth clusters, with different clustering results

Therefore, NMF, originally a dimension reduction method, is not always a preferred clustering method. The success of NMF as a clustering method depends on the underlying data set, and its most success has been around document clustering [17, 30, 41, 52, 55, 64]. In a document data set, data points are often represented as unit-length vectors [47] and embedded in a linear subspace. For a term-document matrix \(X\), a basis vector \(c_j\) is interpreted as the term distribution of a single topic. As long as the term distributions of \(k\) topics are linearly independent, which are usually the case, NMF can extract the ground-truth clusters well. However, NMF has not been as successful in image clustering. For image data, it was shown that a collection of images tends to form multiple 1-dimensional nonlinear manifolds [60], one manifold for each cluster. This does not satisfy NMF’s assumption on cluster structures, and therefore NMF may not identify correct clusters.

In this paper, we study a more general formulation for clustering based on NMF, called Symmetric NMF (SymNMF), where an \(n \times n\) nonnegative and symmetric matrix \(A\) is given as an input instead of a nonnegative data matrix \(X\). The matrix \(A\) contains pairwise similarity values of a similarity graph, and is approximated by a lower rank matrix \( HH ^T\) instead of the product of two lower rank matrices \( CG ^T\). High-dimensional data such as documents and images are often embedded in a low-dimensional space, and the embedding can be extracted from their graph representation. We will demonstrate that SymNMF can be used for graph embedding and clustering and often performs better than spectral methods in terms of multiple evaluation measures for clustering.

The rest of this paper is organized as follows. In Sect. 2, we review previous work on nonnegative factorization of a symmetric matrix and introduce the novelty of the directions proposed in this paper. In Sect. 3, we present our new interpretation of SymNMF as a clustering method. In Sect. 4, we show the difference between SymNMF and spectral clustering in terms of their dependence on the spectrum. In Sects. 5 and 6, we propose two algorithms for SymNMF: A Newton-like algorithm and an alternating nonnegative least squares (ANLS) algorithm, and discuss their efficiency and convergence properties. In Sect. 7, we report experiment results on document and image clustering that illustrate that SymNMF is a competitive method for graph clustering. In Sect. 8, we apply SymNMF to image segmentation and show the unique properties of the obtained segments. In Sect. 9, we discuss future research directions.

2 Related work

In SymNMF, we look for the solution \(H \in \mathbb {R}_+^{n \times k}\),

$$\begin{aligned} \min _{H \ge 0} f(H)=\Vert A- HH ^T\Vert _F^2, \end{aligned}$$
(6)

given \(A \in \mathbb {R}_+^{n \times n}\) with \(A^T=A\) and \(k\). The integer \(k\) is typically much smaller than \(n\). In our graph clustering setting, \(A\) is called a similarity matrix: The \((i,j)\)-th entry of \(A\) is the similarity value between the \(i\)-th and \(j\)-th nodes in a similarity graph, or the similarity value between the \(i\)-th and \(j\)-th data items.

The above formulation has been studied in a number of previous papers. Ding et al. [15] transformed the formulation of NMF (3) to a symmetric approximation \(\Vert A- HH ^T\Vert _F^2\) where \(A\) is a positive semi-definite matrix, and showed that it has the same form as the objective function of spectral clustering. Li et al. [41] used this formulation for semi-supervised clustering where the similarity matrix was modified with prior information. Zass and Shashua [69] converted a completely positive matrix [5] to a symmetric doubly stochastic matrix \(A\) and used the formulation (6) to find a nonnegative \(H\) for probabilistic clustering. They also gave a reason why the nonnegativity constraint on \(H\) was more important than the orthogonality constraint in spectral clustering. He et al. [23] approximated a completely positive matrix directly using the formulation (6) with parallel update algorithms. In all of the above work, \(A\) was assumed to be a positive semi-definite matrix. Other related work that imposed additional constraints on \(H\) includes [2, 65, 66].

The SymNMF formulation has also been applied to non-overlapping and overlapping community detection in real networks [45, 50, 62, 71, 72]. For example, Nepusz et al. [50] proposed a formulation similar to (6) with sum-to-one constraints to detect soft community memberships; Zhang et al. [72] proposed a binary factorization model for overlapping communities and discussed the pros and cons of hard/soft assignments to communities. The adjacency matrix \(A\) involved in community detection is often an indefinite matrix.

Catral et al. [9] studied whether \(WH^T\) is symmetric and \(W=H\), when \(W\) and \(H\) are the global optimum for the problem \(\min _{W,H \ge 0} \Vert A-WH^T\Vert _F^2\) where \(A\) is nonnegative and symmetric. Ho [24] in his thesis related SymNMF to the exact SymNMF problem \(A= HH ^T\). Both of their theories were developed outside the context of graph clustering, and their topics are beyond the scope of this paper. Ho [24] also proposed a \(2n\)-block coordinate descent algorithm for (6). Compared to our two-block coordinate descent framework described in Sect. 6, Ho’s approach introduced a dense \(n \times n\) matrix which destroys the sparsity pattern in \(A\) and is not scalable.

Almost all the work mentioned above employed multiplicative update algorithms to optimize their objective functions with nonnegativity constraints. However, this type of algorithms does not have the property that every limit point is a stationary point [22, 42], and accordingly their solutions are not guaranteed to be local minima. In fact, the results of multiplicative update algorithms (e.g. [16]) only satisfy part of the KKT condition, but do not satisfy all the components of the KKT condition, for example, the sign of the gradient vector. In the three papers [24, 50, 71] that used gradient descent methods for optimization that did reach stationary point solutions, they performed the experiments only on graphs with up to thousands of nodes.

In this paper, we study the formulation (6) from different angles:

  1. 1.

    We focus on a more general case where \(A\) is a symmetric indefinite matrix and represents a general graph. Examples of such an indefinite matrix include a similarity matrix for high-dimensional data formed by the self-tuning method [70] as well as the pixel similarity matrix in image segmentation [56]. Real networks have additional structures such as the scale-free properties [59], and we will not include them in this work.

  2. 2.

    We focus on hard clustering and will give an intuitive interpretation of SymNMF as a graph clustering method. Hard clustering offers more explicit membership and easier visualization than soft clustering [72]. Unlike [15], we emphasize the difference between SymNMF and spectral clustering instead of their resemblance.

  3. 3.

    We propose two optimization algorithms that converge to stationary point solutions for SymNMF, namely Newton-like algorithm and ANLS algorithm. We also show that the new ANLS algorithm scales better to large data sets.

  4. 4.

    In addition to experiments on document and image clustering, we apply SymNMF to image segmentation using 200 images in the Berkeley Segmentation Data Set [1]. To the best of our knowledge, our work represents the first attempt for a thorough evaluation of nonnegativity-based methods for image segmentation.

Overall, we conduct a comprehensive study of SymNMF in this paper, covering from foundational justification for SymNMF for clustering, convergent and scalable algorithms, to real-life applications for text and image clustering as well as image segmentation. The Newton-like algorithm and some of the analysis of spectral clustering was first proposed in our previous work [34]. We include them in this paper for completeness.

3 Interpretation of SymNMF as a graph clustering method

Just as the nonnegativity constraint in NMF makes it interpretable as a clustering method, the nonnegativity constraint \(H \ge 0\) in (6) also gives a natural partitioning as well as interpretation of SymNMF. Now we provide an intuitive explanation of why this formulation is expected to extract cluster structures.

Figure 2 shows an illustrative example of SymNMF, where we have rearranged the rows and columns of \(A\) without loss of generality. If a similarity matrix has cluster structures embedded in it, several diagonal blocks (two diagonal blocks in Fig. 2) with large similarity values will appear. In order to approximate this similarity matrix with low-rank matrices and simultaneously extract cluster structures, we can approximate these diagonal blocks separately because each diagonal block indicates one cluster. As shown in Fig. 2, it is straightforward to use an outer product \(hh^T\) to approximate a diagonal block. Because \(h\) is a nonnegative vector, it serves as a cluster membership indicator: Larger values in \(h\) indicate stronger memberships to the cluster corresponding to the diagonal block. When multiple such outer products are added up together, they approximate the original similarity matrix, and each column of \(H\) represents one cluster. Due to the nonnegativity constraints in SymNMF, only “additive”, or “non-subtractive”, summation of rank-1 matrices is allowed to approximate both diagonal and off-diagonal blocks.

Fig. 2
figure 2

An illustration of SymNMF formulation \(\min _{H \ge 0} \Vert A- HH ^T\Vert _F^2\). Each cell is a matrix entry. Colored region has larger values than white region. Here \(n=7\) and \(k=2\). (Color figure online)

On the contrary, Fig. 3 illustrates the result of low-rank approximation of \(A\) without nonnegativity constraints. In this case, when using multiple \(hh^T\) outer products to approximate \(A\), cancellations of positive and negative numbers are allowed. Without nonnegativity enforced on \(h\)’s, the diagonal blocks need not be approximated separately by each term \(hh^T\). The elements in a vector \(h\) can have any sign (\(+,0,-\)) and magnitude, though the summation of all \(hh^T\) terms, i.e. \( HH ^T\), can approximate the large diagonal blocks and small off-diagonal blocks well. Thus, \(h\) cannot serve as a cluster membership indicator. In this case, the rows of the low-rank matrix \(H\) contain both positive and negative numbers and can be used for graph embedding. In order to obtain hard clusters, we need to post-process the embedded data points such as applying K-means clustering. This reasoning is analogous to the contrast between NMF and SVD [38].

SymNMF is flexible in terms of choosing similarities between data points. We can choose any similarity measure that describes the cluster structure well. In fact, the formulation of NMF (3) can be related to SymNMF when \(A=X^TX\) in (6)  [15]. This means that NMF implicitly chooses inner products as the similarity measure, which is not always suitable to distinguish different clusters.

4 SymNMF and spectral clustering

4.1 Objective functions

Spectral clustering represents a large class of graph clustering methods that rely on eigenvector computation [10, 51, 56]. Now we will show that spectral clustering and SymNMF are closely related in terms of the graph clustering objective but fundamentally different in optimizing this objective.

Many graph clustering objectives can be reduced to a trace maximization form [13, 36]:

$$\begin{aligned} \max \text {trace}(\tilde{H}^TA\tilde{H}), \end{aligned}$$
(7)

where \(\tilde{H} \in \mathbb {R}^{n \times k}\) (to be distinguished from \(H\) in the SymNMF formulation) satisfies \(\tilde{H}^T \tilde{H} = I, \tilde{H} \ge 0\), and each row of \(\tilde{H}\) contains one positive entry and at most one positive entry due to \(\tilde{H}^T \tilde{H} = I\). Clustering assignments can be drawn from \(\tilde{H}\) accordingly.

Fig. 3
figure 3

An illustration of \(\min \Vert A- HH ^T\Vert _F^2\) or \(\min _{ HH ^T=I} \Vert A- HH ^T\Vert _F^2\). Each cell is a matrix entry. Colored region has larger magnitudes than white region. Magenta cells indicate positive entries, green indicating negative. Here \(n=7\) and \(k=2\). (Color figure online)

Under the constraints on \(\tilde{H}^T \tilde{H} = I, \tilde{H} \ge 0\), we have [15]:

$$\begin{aligned}&\max \text {trace}\left( \tilde{H}^T A \tilde{H}\right) \nonumber \\&\quad \Leftrightarrow \min \text {trace}(A^TA) - 2\text {trace}\left( \tilde{H}^T A \tilde{H}\right) + \text {trace}(I) \nonumber \\&\quad \Leftrightarrow \min \text {trace}\left[ \left( A-\tilde{H}\tilde{H}^T\right) ^T\left( A-\tilde{H}\tilde{H}^T\right) \right] \nonumber \\&\quad \Leftrightarrow \min \Vert A-\tilde{H}\tilde{H}^T\Vert _F^2. \nonumber \end{aligned}$$

This objective function is the same as (6), except that the constraints on the low-rank matrices \(H\) and \(\tilde{H}\) are different. The constraint on \(\tilde{H}\) makes the graph clustering problem NP-hard [56], therefore a practical method relaxes the constraint to obtain a tractable formulation. In this respect, spectral clustering and SymNMF can be seen as two different ways of relaxation: While spectral clustering retains the constraint \(\tilde{H}^T\tilde{H}=I\), SymNMF retains \(\tilde{H} \ge 0\) instead. These two choices lead to different algorithms for optimizing the same graph clustering objective (7), which are shown in Table 1.

Table 1 Algorithmic steps of spectral clustering and SymNMF clustering

4.2 Spectral clustering and the spectrum

Normalized cut is a widely-used objective for spectral clustering [56]. Now we describe some scenarios where optimizing this objective may have difficulty in identifying correct clusters while SymNMF could be potentially better.

Although spectral clustering is a well-established framework for graph clustering, its success relies only on the properties of the leading eigenvalues and eigenvectors of the similarity matrix \(A\). It was pointed out in [51, 58] that the \(k\)-dimensional subspace spanned by the leading \(k\) eigenvectors of \(A\) is stable only when \(|\lambda _k(A)-\lambda _{k+1}(A)|\) is sufficiently large, where \(\lambda _i(A)\) is the \(i\)-th largest eigenvalue of \(A\). Now we show that spectral clustering could fail when this condition is not satisfied but the cluster structure may be perfectly represented in the block-diagonal structure of \(A\). Suppose \(A\) is composed of \(k=3\) diagonal blocks, corresponding to three clusters:

$$\begin{aligned} A = \left[ \begin{matrix} A_1 &{} 0 &{} 0 \\ 0 &{} A_2 &{} 0 \\ 0 &{} 0 &{} A_3 \end{matrix} \right] . \end{aligned}$$
(8)

If we construct \(A\) as in the normalized cut, then each of the diagonal blocks \(A_1,A_2,A_3\) has a leading eigenvalue 1. We further assume that \(\lambda _2(A_i) < 1\) for all \(i=1,2,3\) in exact arithmetic. Thus, the three leading eigenvectors of \(A\) correspond to the diagonal blocks \(A_1,A_2,A_3\) respectively. However, when \(\lambda _2(A_1)\) and \(\lambda _3(A_1)\) are so close to 1 that it cannot be distinguished from \(\lambda _1(A_1)\) in finite precision arithmetic, it is possible that the computed eigenvalues \(\tilde{\lambda }_j(A_i)\) satisfy \(\tilde{\lambda }_1(A_1) > \tilde{\lambda }_2(A_1) > \tilde{\lambda }_3(A_1) > \max (\tilde{\lambda }_1(A_2), \tilde{\lambda }_1(A_3))\). In this case, three subgroups are identified within the first cluster; the second and the third clusters cannot be identified, as shown in Fig. 4 where all the data points in the second and third clusters are mapped to \((0,0,0)\). Therefore, eigenvectors computed in a finite precision cannot always capture the correct low-dimensional graph embedding.

Fig. 4
figure 4

Three leading eigenvectors of the similarity matrix in (8) when \(\tilde{\lambda }_3(A_1) > \max (\tilde{\lambda }_1(A_2), \tilde{\lambda }_1(A_3))\). Here we assume that all the block diagonal matrices \(A_1, A_2, A_3\) have size \(3 \times 3\). Colored region has nonzero values. (Color figure online)

Now we demonstrate the above scenario using a concrete graph clustering example. Figure 5 shows (a) the original data points; (b) the embedding generated by spectral clustering; and (c, d) plots of the similarity matrix \(A\). Suppose the scattered points form the first cluster, and the two tightly-clustered groups correspond to the second and third clusters. We employ the widely-used Gaussian kernel [61] and normalized similarity values [56]:

$$\begin{aligned} e_{ij}&= \exp \left( -{\Vert x_i-x_j\Vert _2^2 \over \sigma ^2}\right) ,\nonumber \\ A_{ij}&= e_{ij}d_i^{-1/2}d_j^{-1/2}, \end{aligned}$$
(9)
Fig. 5
figure 5

A graph clustering example with three clusters (original data from [70]). a Data points in the original space. For illustration’s purpose, we use two-dimensional data points; however, the Gaussian kernel (9) maps them into a high-dimensional space. b 3-dimensional embedding of the data points as rows of three leading eigenvectors. c Block-diagonal structure of \(A\). d Block-diagonal structure of the submatrix of \(A\) corresponding to the two tightly-clustered groups in (a). Note that the data points in both a and b are marked with ground-truth labels

where \(x_i\)’s are the two-dimensional data points, \(d_i=\sum _{s=1}^n e_{is} (1 \le i \le n)\), and \(\sigma \) is a parameter set to 0.05 based on the scale of data points. In spectral clustering, the rows of the leading eigenvectors determine a mapping of the original data points, shown in Fig. 5b. In this example, the original data points are mapped to three unique points in a new space. However, the three points in the new space do not correspond to the three clusters in Fig. 5a. In fact, out of the 303 data points in total, 290 data points are mapped to a single point in the new space.

Let us examine the leading eigenvalues, shown in Table 2, where the fourth largest eigenvalue of \(A\) is very close to the third largest eigenvalue. This means that the second largest eigenvalue of a cluster, say \(\lambda _2(A_1)\), would be easily identified as one of \(\lambda _1(A_1),\,\lambda _1(A_2)\), and \(\lambda _1(A_3)\). The mapping of the original data points shown in Fig. 5b implies that the computed three largest eigenvalues come from the first cluster. This example is a noisier case of the scenario in Fig. 4.

On the contrary, we can see from Fig. 5c, d that the block-diagonal structure of \(A\) is clear, though the within-cluster similarity values are not on the same scale. Figure 6 shows the comparison of clustering results of spectral clustering and SymNMF in this case. SymNMF is able to separate the two tightly-clustered groups more accurately.

Table 2 Leading eigenvalues of the similarity matrix based on Fig. 5 with \(\sigma =0.05\)
Fig. 6
figure 6

Clustering results for the example in Fig. 5: a Spectral clustering, b SymNMF

4.3 A condition on SymNMF

How does the spectrum of \(A\) possibly influence SymNMF? We have seen that the solution of SymNMF relies on the block-diagonal structure of \(A\), thus it does not suffer from the situations in Sect. 4.2 where the eigengap between the \(k\)-th and \((k+1)\)-th eigenvalues is small. We will also see in later sections that algorithms for SymNMF do not depend on eigenvector computation. However, we do emphasize a condition that SymNMF must satisfy in order to make the formulation (6) valid. This condition is related to the spectrum of \(A\), specifically the number of nonnegative eigenvalues of \(A\). Note that \(A\) is only assumed to be symmetric and nonnegative, and is not necessarily positive semi-definite, therefore may have both positive and negative eigenvalues. On the other hand, in the approximation \(\Vert A- HH ^T\Vert _F,\, HH ^T\) is always positive semi-definite and has rank at most \(k\), therefore \( HH ^T\) would not be a good approximation if \(A\) has fewer than \(k\) nonnegative eigenvalues. We assume that \(A\) has at least \(k\) nonnegative eigenvalues when the given size of \(H\) is \(n \times k\).

This condition on \(A\) could be expensive to check. Here, by a simple argument, we claim that it is practically reasonable to assume that this condition is satisfied given a similarity matrix. Again, we use the similarity matrix \(A\) in (8) as an example. Suppose we know the actual number of clusters is three, and therefore \(H\) has size \(n \times 3\). Because \(A\) is nonnegative, each of \(A_1,A_2,A_3\) has at least one nonnegative eigenvalue according to Perron–Frobenius theorem [4], and \(A\) has at least three nonnegative eigenvalues. In a real data set, \(A\) may become much noisier with small entries in the off-diagonal blocks of \(A\). The eigenvalues are not dramatically changed by a small perturbation of \(A\) according to matrix perturbation theory [58], hence \(A\) would also have at least \(k\) nonnegative eigenvalues if its noiseless version does. In practice, the number of positive eigenvalues of \(A\) is usually much larger than that of negative eigenvalues, which is verified in our experiments.

5 A Newton-like algorithm for SymNMF

figure a

In this section, we will present an optimization algorithm to compute SymNMF where \(A\) is nonnegative and symmetric. The objective function in (6) is a fourth-order non-convex function with respect to the entries of \(H\), and has multiple local minima. For this type of problem, it is difficult to find a global minimum; thus a good convergence property we can expect is that every limit point is a stationary point [7]. We could directly apply standard gradient search algorithms, which lead to stationary point solutions; however, they suffer from either slow convergence or expensive computation cost.

5.1 Algorithm framework

First, we introduce our notations for clarity. Let \(H=[h_1,\ldots ,h_k] \in \mathbb {R}_+^{n \times k}\). A vector \(x\) of length \(nk\) is used to represent the vectorization of \(H\) by column, i.e. \(x=\text {vec}(H) = [h_1^T,\ldots ,h_k^T]^T \in \mathbb {R}_+^{nk \times 1}\). For simplicity, functions applied on \(x\) have the same notation as functions applied on \(H\), i.e. \(f(x) \equiv f(H)\). \([\cdot ]^+\) denotes the projection to the nonnegative orthant, i.e. replacing any negative element of a vector to be 0. Superscripts denote iteration indices, e.g. \(x^{(t)}=\text {vec}(H^{(t)})\) is the iterate of \(x\) in the \(t\)-th iteration. For a vector \(v,\,v_i\) denotes its \(i\)-th element. For a matrix \(M,\,M_{ij}\) denotes its \((i,j)\)-th entry; and \(M_{[i][j]}\) denotes its \((i,j)\)-th \(n \times n\) block, assuming that both the numbers of rows and columns of \(M\) are multiples of \(n\). \(M \succ 0\) refers to positive definiteness of \(M\). We define the projected gradient \(\nabla ^P f(x)\) at \(x\) as [43]:

$$\begin{aligned} \left( \nabla ^P f(x)\right) _i = \left\{ \begin{array}{lll} \left( \nabla f(x)\right) _i,&{}\quad \text {if}&{}\quad x_i > 0; \\ \left[ \left( \nabla f(x)\right) _i\right] ^+,&{}\quad \text {if}&{}\quad x_i = 0. \end{array}\right. \end{aligned}$$
(10)

Algorithm 1 describes a framework of gradient search algorithms applied to SymNMF, based on which we developed our Newton-like algorithm. This description does not specify iteration indices, but updates \(x\) in-place. The framework uses the “scaled” negative gradient direction as search direction. Except the scalar parameters \(\beta , \sigma , \mu \), the \(nk \times nk\) scaling matrix \(S^{(t)}\) is the only unspecified quantity. Table 3 lists two choices of \(S^{(t)}\) that lead to different gradient search algorithms: projected gradient descent (PGD) [43] and projected Newton (PNewton) [7].

Table 3 Comparison of PGD and PNewton for solving \(\min _{H \ge 0} \Vert A- HH ^T\Vert _F^2,\,H \in \mathbb {R}_+^{n \times k}\)

PGD sets \(S^{(t)} = I\) throughout all the iterations. It is known as one of steepest descent methods, and does not scale the gradient using any second-order information. This strategy often suffers from the well-known zigzagging behavior, thus has slow convergence rate [7]. On the other hand, PNewton exploits second-order information provided by the Hessian \(\nabla ^2 f(x^{(t)})\) as much as possible. PNewton sets \(S^{(t)}\) to be the inverse of a reduced Hessian at \(x^{(t)}\). The reduced Hessian with respect to index set \(R\) is defined as:

$$\begin{aligned} (\nabla _R^2 f(x))_{ij} = \left\{ \begin{array}{lll} \delta _{ij}, &{} \text { if }\quad i \in R \text { or } j \in R;\\ \left( \nabla ^2 f(x)\right) _{ij},&{} \text { otherwise}, \end{array}\right. \end{aligned}$$
(11)

where \(\delta _{ij}\) is the Kronecker delta. Both the gradient and the Hessian of \(f(x)\) can be computed analytically:

$$\begin{aligned} \nabla f(x)&= \text {vec} \left( 4( HH ^T-A)H\right) , \nonumber \\ \left( \nabla ^2 f(x)\right) _{[i][j]}&= 4 \left( \delta _{ij}( HH ^T-A) + h_jh_i^T + (h_i^Th_j) I_{n \times n} \right) . \nonumber \end{aligned}$$

We introduce the definition of an index set \(\mathcal {E}\) that helps to prove the convergence of Algorithm 1 [7]:

$$\begin{aligned} \mathcal {E} = \left\{ i|0 \le x_i \le \epsilon ,(\nabla f(x))_i>0\right\} , \end{aligned}$$
(12)

where \(\epsilon \) depends on \(x\) and is usually small (\(0<\epsilon <0.01\)) [26]. In PNewton, \(S^{(t)}\) is formed based on the reduced Hessian \(\nabla _{\mathcal {E}}^2 f(x^{(t)})\) with respect to \(\mathcal {E}\). However, because the computation of the scaled gradient \(S^{(t)} \nabla f(x^{(t)})\) involves the Cholesky factorization of the reduced Hessian, PNewton has a very large computational complexity of \(O(n^3k^3)\), which is prohibitive. Therefore, we propose a Newton-like algorithm that exploits second-order information in an inexpensive way.

5.2 Improving the scaling matrix

The choice of the scaling matrix \(S^{(t)}\) is essential to an algorithm that can be derived from the framework described in Algorithm 1. We propose two improvements on the choice of \(S^{(t)}\), yielding new algorithms for SymNMF. Our focus is to efficiently collect partial second-order information but meanwhile still effectively guide the scaling of the gradient direction. Thus, these improvements seek a tradeoff between convergence rate and computational complexity, with the goal of accelerating SymNMF algorithms as an overall outcome.

Our design of new algorithms must guarantee the convergence. Since the algorithm framework still follows Algorithm 1, we would like to know what property of the scaling matrix \(S^{(t)}\) is essential in the proof of the convergence result of PGD and PNewton. This property is described by the following lemma:

Definition 1

A scaling matrix \(S\) is diagonal with respect to an index set \(R\), if \(S_{ij}=0,\forall i \in R \text { and } j \ne i\) [6].

Lemma 1

Let \(S\) be a positive definite matrix which is diagonal with respect to \(\mathcal {E}\). If \(x\ge 0\) is not a stationary point, there exists \(\bar{\alpha }>0\) such that \(f\left( [x-\alpha S \nabla f(x)]^+\right) < f(x),\forall 0<\alpha <\bar{\alpha }\). (modified from [6])

Lemma 1 states the requirement on \(S^{(t)}\), which is satisfied by the choices of \(S^{(t)}\) in both PGD and PNewton. It guides our development of new ways to choose \(S^{(t)}\).

5.2.1 Improvement 1: fewer Hessian evaluations

A common method for reducing computation cost related to \(S^{(t)}\) is to periodically update \(S^{(t)}\) or evaluate \(S^{(t)}\) only at the 1st iteration (chord method) [26]. However, this method cannot be directly used in the framework of Algorithm 1, because \(S^{(t)}\) is not necessarily diagonal with respect to \(\mathcal {E}^{(t)}\) if \(\mathcal {E}^{(t)} \ne \mathcal {E}^{(1)}\), and the requirement for convergence is violated.

Our way to delay the update of \(S^{(t)}\) is to evaluate \(S^{(t)}\) only when \(\mathcal {E}^{(t)}\) changes. More precisely,

$$\begin{aligned} S^{(t)} = \left\{ \begin{array}{lll} S^{(t-1)},&{}\quad \text {if}\quad \mathcal {E}^{(t)} = \mathcal {E}^{(t-1)}; \\ \left( \nabla _{\mathcal {E}}^2 f(x^{(t)}) \right) ^{-1},&{}\quad \text {if} \quad \mathcal {E}^{(t)} \ne \mathcal {E}^{(t-1)} \\ &{}\quad \text {and } \nabla _{\mathcal {E}}^2 f(x^{(t)}) \succ 0; \\ I_{nk \times nk},&{}\quad \text { otherwise}.\\ \end{array}\right. \end{aligned}$$
(13)

Note that because \(f(x)\) is non-convex, we have to set \(S^{(t)} = I\) when \(\nabla _{\mathcal {E}}^2 f(x^{(t)})\) is not positive definite, which can be checked during its Cholesky factorization. We expect that this improvement can reduce the number of Hessian evaluations and Cholesky factorizations.

5.2.2 Improvement 2: cheaper Hessian evaluations

The second improvement in choosing \(S^{(t)}\) is inspired by the recently proposed coordinate gradient descent (CGD) method for solving covariance selection [68]. When CGD is directly applied to SymNMF, it updates one column of \(H\) in each iteration while the other columns are fixed, and the search direction is typically determined by solving a quadratic programming problem. The CGD method introduces additional overhead when determining the search direction; however, it implies a possibility of using second-order information without evaluating the entire Hessian.

Inspired by the incremental update framework of CGD, we propose to choose \(S^{(t)}\) to be a block-diagonal matrix in our batch update framework in Algorithm 1. Specifically,

$$\begin{aligned} S^{(t)}_{[i][j]} = \left\{ \begin{array}{lll} 0,&{}\quad \text {if}\quad i \ne j; \\ \left( \nabla _{\mathcal {E}}^2 f(x^{(t)})_{[i][j]} \right) ^{-1},&{}\quad \text {if } i=j \\ &{}\quad \text {and } \nabla _{\mathcal {E}}^2 f(x^{(t)})_{[i][j]} \succ 0; \\ I_{n \times n},&{}\quad \text { otherwise}.\\ \end{array}\right. \end{aligned}$$
(14)

Intuitively speaking, the \(i\)-th \(n \times n\) diagonal block of \(S^{(t)}\) corresponds to variables in the \(i\)-th column of \(H\), and \(S^{(t)}\) only involves second-order information within each column of \(H\). This choice of \(S^{(t)}\) has two advantages over the choice in PNewton algorithm: First, the computational complexity in each iteration is \(O(n^3k)\), much lower than the complexity of PNewton if \(k\) is not too small. Second, we can exploit partial second-order information even though the \(n\) diagonal blocks of \(\nabla _{\mathcal {E}}^2 f(x^{(t)})\) are not all positive definite, whereas PNewton requires the positive definiteness of all the \(n\) diagonal blocks as a necessary condition.

Our final strategy for solving SymNMF (6) is to combine Improvement 1 and Improvement 2. Note that the requirement on \(S^{(t)}\) described in Lemma 1 is satisfied in both of the improvements, and also in their combination. Thus, convergence is guaranteed in all of these variations.

6 An ANLS algorithm for SymNMF

In this section, we propose another optimization algorithm for SymNMF that converges to stationary points, a necessary condition for local minima. The algorithm is based on an alternative formulation of SymNMF, where it becomes straightforward to use the two-block coordinate descent framework that has been shown efficient for standard NMF [28, 29, 31, 32, 43].

6.1 Two-block coordinate descent framework

We first briefly review the two-block coordinate descent framework [28, 29, 32, 43] for standard NMF problems shown in (3):

$$\begin{aligned} \min _{C \ge 0, G \ge 0} \Vert X- CG ^T\Vert _F^2, \end{aligned}$$

which has our desired convergence property that every limit point is a stationary point. Separating the unknowns in \(C\) and \(G\) in the NMF formulation (3) into two blocks, we obtain the following subproblems:

  1. 1.

    Fix \(G\) and solve \(\min _{C \ge 0} \Vert GC^T-X^T\Vert _F^2\).

  2. 2.

    Fix \(C\) and solve \(\min _{G \ge 0} \Vert CG ^T-X\Vert _F^2\).

Each subproblem is a nonnegative least squares problem with multiple right-hand sides (NLS for short), and many efficient procedures have been developed to solve NLS, e.g. active-set method [28, 37], block pivoting [31, 32], PGD [43], etc. The key requirement in this framework is to obtain the optimal solution in each subproblem (see more discussions in [28]). This way, the original NMF formulation (3) has been reduced to an alternating NLS problem (ANLS for short).

6.2 A nonsymmetric formulation for SymNMF

In SymNMF, it is difficult to separate the \(nk\) unknowns in a straightforward way as in NMF, because the two factors \(H\) and \(H^T\) contain the same set of unknowns. We propose to re-formulate SymNMF in the context of NMF [24]:

$$\begin{aligned} \min _{W,H \ge 0} g(W,H) = \Vert A-WH^T\Vert _F^2 + \alpha \Vert W-H\Vert _F^2, \end{aligned}$$
(15)

where \(A\) still represents the \(n \times n\) similarity matrix, \(W,H\) are two low-rank factors of size \(n \times k\), and \(\alpha >0\) is a scalar parameter for the tradeoff between the approximation error and the difference of \(W\) and \(H\). Here we force the separation of unknowns by associating the two factors with two different matrices. If \(\alpha \) has a large enough value, the solutions of \(W\) and \(H\) will be close enough so that the clustering results will not be affected whether \(W\) or \(H\) are used as the clustering assignment matrix.

The nonsymmetric formulation can be easily cast into the two-block coordinate descent framework after some restructuring. In particular, we have the following subproblems for (15):

$$\begin{aligned} \min _{W \ge 0}&\left\| \left[ \begin{matrix} H \\ \sqrt{\alpha } I_k \\ \end{matrix}\right] W^T - \left[ \begin{matrix} A \\ \sqrt{\alpha } H^T \\ \end{matrix}\right] \right\| _F,\end{aligned}$$
(16)
$$\begin{aligned} \min _{H \ge 0}&\left\| \left[ \begin{matrix} W \\ \sqrt{\alpha } I_k \\ \end{matrix}\right] H^T - \left[ \begin{matrix} A \\ \sqrt{\alpha } W^T \\ \end{matrix}\right] \right\| _F, \end{aligned}$$
(17)

where \(\mathbf 1 _k \in \mathbb {R}^{k \times 1}\) is a column vector whose elements are all 1’s, and \(I_k\) is the \(k \times k\) identity matrix. Note that we have assumed \(A=A^T\). Solving subproblems (16) and (17) in an alternate fashion will lead to a stationary point solution, as long as an optimal solution is returned for every NLS subproblem encountered. We simplify and summarize this algorithm in Algorithm 2.

figure b

If \(W\) and \(H\) are expected to indicate more distinct cluster structures, sparsity constraints on the rows of \(W\) and \(H\) can also be incorporated into the nonsymmetric formulation easily, by adding \(L_1\) regularization terms [27, 28]:

$$\begin{aligned} \min _{W,H \ge 0} \tilde{g}(W,H) = \Vert A-WH^T\Vert _F^2 + \alpha \Vert W-H\Vert _F^2 + \beta \sum _{i=1}^n \Vert w_i\Vert _1^2 + \beta \sum _{i=1}^n \Vert h_i\Vert _1^2, \end{aligned}$$
(18)

where \(\alpha ,\beta >0\) are regularization parameters, \(w_i,h_i\) are the \(i\)-th rows of \(W,H\) respectively, and \(\Vert \cdot \Vert _1\) denotes vector 1-norm. Consequently, the two subproblems for (18) in the two-block coordinate descent framework are:

$$\begin{aligned} \min _{W \ge 0}&\left\| \left[ \begin{matrix} H \\ \sqrt{\alpha } I_k \\ \sqrt{\beta } \mathbf 1 _k^T \end{matrix}\right] W^T - \left[ \begin{matrix} A \\ \sqrt{\alpha } H^T \\ 0 \end{matrix}\right] \right\| _F,\end{aligned}$$
(19)
$$\begin{aligned} \min _{H \ge 0}&\left\| \left[ \begin{matrix} W \\ \sqrt{\alpha } I_k \\ \sqrt{\beta } \mathbf 1 _k^T \end{matrix}\right] H^T - \left[ \begin{matrix} A \\ \sqrt{\alpha } W^T \\ 0 \end{matrix}\right] \right\| _F. \end{aligned}$$
(20)

We can even use just one \(L_1\) regularization term in (18), that is, \(\beta \sum _{i=1}^n \Vert w_i\Vert _1^2\) or \(\beta \sum _{i=1}^n \Vert h_i\Vert _1^2\), since \(W\) and \(H\) are sufficiently close; however, using one or two \(L_1\) regularization terms does not make much difference computationally.

6.3 Implementation

Now we describe an efficient implementation of the ANLS algorithm for SymNMF. Our algorithm reduces to solving the NLS problem in line 5 of Algorithm 2. Consider a form of NLS with simplified notation: \(\min _{G \ge 0} \Vert CG ^T-X\Vert _F^2\). In many algorithms for NLS, the majority of time cost comes from the computation of \(C^TC\) and \(X^TC\). For example, in the active-set method [28] and block pivoting method [31, 32], we need to form the normal equation:

$$\begin{aligned} C^T CG ^T = C^TX. \end{aligned}$$

In PGD [43], we need to compute the gradient:

$$\begin{aligned} \nabla _G = 2G(C^TC) - 2X^TC. \end{aligned}$$

For more details of these algorithms for NLS, please refer to the original papers [28, 31, 32, 43]. Our strategy to solve the NLS problem in Algorithm 2 is to precompute \(C^TC\) and \(X^TC\):

$$\begin{aligned} C^TC = W^TW + \alpha I_k, \quad X^TC = A^TW + \alpha W \end{aligned}$$

without forming \(X = \left[ \begin{matrix} A \\ \sqrt{\alpha }W^T \end{matrix} \right] \) directly. Though this change sounds trivial, it is very costly to form \(X\) directly when \(A\) is large and sparse, especially when \(A\) is stored in the “compressed sparse column” format such as in Matlab and the Python scipy package. In our experiments, we observed that our strategy had considerable time savings in the iterative Algorithm 2.

For choosing the parameter \(\alpha \), we can gradually increase \(\alpha \) from 1 to a very large number, for example, by setting \(\alpha \leftarrow 1.01 \alpha \). We can stop increasing \(\alpha \) when \(\Vert W-H\Vert _F / \Vert H\Vert _F\) is negligible (say, \(<\)10\(^{-8}\)).

Theoretically, both the Newton-like algorithm and the ANLS algorithm are valid algorithms for SymNMF for any nonnegative and symmetric matrix \(A\). In practice, however, when a similarity matrix \(A\) is very sparse and the efficiencies of these two algorithms become very different. The Newton-like algorithm does not take into account the structure of SymNMF formulation (6), and a sparse input matrix \(A\) cannot contribute to speeding up the algorithm because of the formation of the dense matrix \( HH ^T\) in intermediate steps. On the contrary, in the ANLS algorithm, many algorithms for the NLS subproblem [28, 31, 32, 43] can often benefit from the sparsity of similarity matrix \(A\) automatically. This benefit comes from sparse-dense matrix multiplication inside these algorithms such as \(AH\) as well as the absence of large dense matrices such as \( HH ^T\). Therefore, we recommend using the ANLS algorithm for a sparse input matrix \(A\).

7 Experiments on document and image clustering

In this section, we show the performances of SymNMF on a number of text and image data sets, and compare SymNMF with the standard forms and variations of NMF, spectral clustering, and K-means. The SymNMF formulation is a nonconvex minimization problem. If we apply Newton-like algorithm or ANLS algorithm which is described in Sects. 5 and 6 to SymNMF, then it can find a local minimal solution but may not find a global one. Hence we need a global optimization method. Our proposed global optimization method for experiments on document and image clustering is based on a multistart global optimization algorithm [44, 53, 54] that combines random sampling with a local search procedure. That is, we choose 20 initial points uniformly within the nonnegative orthant and a local search procedure is applied to every initial point for improving it. We use Newton-like algorithm and ANLS algorithm for our local search method. Throughout the experiments, we use Matlab 7.9 (R2009b) with an Intel Xeon X5550 quad-core processor and 24GB memory.

7.1 Data preparation

We construct a sparse graph for each data set. Using sparse graphs makes large-scale clustering possible in terms of efficiency. We take the following three steps to form the similarity matrix:

  1. 1.

    Construct a complete graph. The edge weights between graph nodes are defined according to the type of data set.

    • For text data, all the document vectors are normalized to have unit 2-norm. The edge weight is the cosine similarity between two document vectors:

      $$\begin{aligned} e_{ij} = x_i^T x_j, \quad (i \ne j). \end{aligned}$$
      (21)
    • For image data, the self-tuning method [70] is used:

      $$\begin{aligned} e_{ij} = \exp \left( -{\Vert x_i-x_j\Vert _2^2 \over \sigma _i \sigma _j}\right) , \quad (i \ne j), \end{aligned}$$
      (22)

      where each data point has a local scale \(\sigma _i\), as opposed to a global scale \(\sigma \) in (9). \(\sigma _i\) is set to be the Euclidean distance between \(x_i\) and its \(\hat{k}\)-th neighbor. We use \(\hat{k}=7\) as suggested in [70].

    Note that we enforce self-edge weights \(e_{ii}=0\) (\(1 \le i \le n\)) in all cases [51].

  2. 2.

    Sparsify the graph. We only keep the edges that connect a node to its \(q\) nearest neighbors. More precisely, let

    $$\begin{aligned} N(i)=\left\{ j|x_j\text { is one of the }q\text { nearest neighbors of }x_i, j \ne i\right\} . \end{aligned}$$
    (23)

    Edge weights in the sparse graph are defined as:

    $$\begin{aligned} \hat{e}_{ij} = \left\{ \begin{array}{lll} e_{ij},&{}\quad \text { if } i \in N(j) or j \in N(i); \\ 0,&{} \quad \text { otherwise}. \end{array}\right. \end{aligned}$$
    (24)

    We choose \(q = \lfloor \log _2 n\rfloor + 1\) as suggested in [61].

  3. 3.

    Form the similarity matrix. We compute the normalized similarity values as in the normalized cut [51]:

    $$\begin{aligned} A_{ij} = \hat{e}_{ij} d_i^{-1/2} d_j^{-1/2}, \end{aligned}$$
    (25)

    where \(d_i=\sum _{s=1}^n \hat{e}_{is}\) (\(1 \le i \le n\)).

Note that the similarity matrix \(A\) constructed as above is symmetric, nonnegative, and usually indefinite.

7.2 Data sets

Document clustering was conducted on the following labeled corpuses: 1. TDT2 Footnote 1 contains 10,212 news articles from various sources (e.g. NYT, CNN, VOA) in 1998. 2. Reuters Footnote 2 contains 21,578 news articles from the Reuters newswire in 1987. 3. From the newly-released Reuters news collection RCV1 Footnote 3 [40] that contains over 800,000 articles in 1996–1997, we selected the training set containing 23,149 articles. Labels are assigned according to a topic hierarchy, and we only considered leaf topics as valid labels. 4. The research paper collection NIPS14-16 Footnote 4 contains 420 NIPS papers in 2001–2003 [21], which are associated with labels indicating the technical area (algorithms, learning theory, vision science, etc). For all these data sets, documents with multiple labels are discarded in our experiments. In addition, clusters representing different topics are highly unbalanced in size. We selected the largest 20, 20, 40, 9 clusters from these data sets respectively. While TDT2 and the two Reuters data sets were well maintained, the NIPS data set was extracted from PS and PDF files, resulting in very noisy texts, which can be seen from the list of terms available online (see footnote 4). For example, its vocabulary includes many symbols frequently used in formulas which are not semantically meaningful.

Image clustering was conducted on object and face recognition data sets: 1. COIL-20 Footnote 5 contains gray-scale images of 20 objects, rescaled to \(64 \times 64\) size. The viewpoints are equally spaced in the entire \(360^\circ \) range, resulting in 72 images for each object. 2. ORL Footnote 6 contains 400 face images of 40 persons with different facial expressions and slightly-varing pose. 3. From Extended YaleB Footnote 7 face data set (with the original YaleB data included) [39], we selected 2,414 frontal face images of 38 persons, with different illumination conditions. 4. From PIE Footnote 8 face data set [57], we selected 232 frontal face images of 68 persons, with different facial expressions. Compared to other variations in PIE data set such as illumination and lighting conditions, different facial expressions represent more variations in faces and the images are embedded in multiple manifolds [60]; moreover, only 3–4 images are available for each person, which makes clustering more challenging. Though ORL and the selected subset of PIE are not large-scale, they share the same characteristics: High variations within each class, with a handful of images per class. For all the image data sets, the identity information of the objects or faces is used as ground-truth labels. The statistics of the processed document and image data sets are summarized in Table 4.

Table 4 Data sets used in experiments

7.3 Algorithms for comparison

We experimented with a large variety of clustering algorithms for a comprehensive comparison. The algorithms in our experiment can be divided into four categories:

  1. 1.

    K-means variants (All these K-means variants include a batch-update phase and an additional online-update phase in each run [18]. We use both phases.)

    • Standard K-means (KM) The input matrix is constructed as follows. For text data, each column of the tf-idf matrix \(X\) [47] is scaled to have unit 2-norm; in addition, \(X\) is transformed into its normalized-cut weighted version \(XD^{-1/2}\) [64], where \(D\) is defined in Sect. 3 with \(e_{ij} = x_i^T x_j\). For image data, each column of \(X\) is scaled to the \([0, 1]\) interval.

    • Spherical K-means (SKM) Unlike standard K-means that uses Euclidean distance as the dissimilarity measure, spherical K-means uses \(1-\cos (x_i,x_j)\); therefore any scaling of columns of \(X\) does not take effect. Spherical K-means was proposed for document clustering, where cosine similarity is often a better measure than Euclidean distance [14]. As mentioned in Sect. 1, we believe that spherical K-means has a closer relationship to NMF than standard K-means.

    • Kernel K-means (KKM) Kernel K-means is a graph clustering method based on K-means. We use the weighted kernel K-means algorithm described in [13, 36] that minimizes the normalized cut objective. Because \(K\) is generally indefinite, the condition for convergence is violated. We terminate the algorithm as soon as the objective function value stops decreasing.

  2. 2.

    NMF variants

    • NMF We use the ANLS algorithm with block pivoting method for NMF [31, 32]. The same input matrix as in standard K-means is used. The hard clustering result is indicated by the largest entry in each row of \(H\).

    • GNMF Cai et al. [8] proposed Graph-regularized NMF (GNMF) by adding a graph-theoretic penalty term to (3) that takes neighboring relationship into account, so that the resulting method is better at clustering on manifolds. We use the algorithm and the parameters suggested in [8]. The input matrix is constructed in the same way as in standard K-means. However, the neighboring relationship based on the sparse graph is generated using the original data matrix, i.e. without the scaling of each \(x_i\). The clustering result is obtained by treating the rows of \(H\) as graph embedding and applying spherical K-means to the embedded points.

  3. 3.

    Spectral clustering variants

    • NJW algorithm (SpNJW) This refers to the algorithm proposed in Ng et al. [51]. The rows of the \(k\) leading eigenvectors of \(A\), where each row is normalized to have unit 2-norm, are used as the graph embedding of data points. Standard K-means is used in the final step to obtain clustering results, which is initialized by randomly choosing \(k\) samples as centroids.

    • YS algorithm (SpYS) This refers to the algorithm proposed in Yu and Shi [67]. The clustering results are obtained by finding the optimal orthogonal transformation of \(\tilde{H}=D^{-1/2}H\) into a partition matrix [67], where columns of \(H\) are the \(k\) leading eigenvectors of \(A\).

  4. 4.

    SymNMF We observed that the Newton-like algorithm for SymNMF gives better clustering quality on image data (more details in Sect. 7.5). On text data, however, the Newton-like algorithm is not efficient enough due to large problem sizes, and only the ANLS algorithm is applicable. When reporting the results, we use the general name “SymNMF” to refer to the algorithm of choice.

For the Newton-like algorithm (Algorithm 1), we use parameters \(\beta = 0.1, \sigma = 0.1\). We also empirically observe that choosing \(\epsilon \) in (12) to be a fixed value \(10^{-16}\) makes the Newton-like algorithm faster while having little influence on the clustering quality. For the ANLS algorithm, we solve the formulation (15), i.e. without sparsity constraints on \(W,H\) (Algorithm 2). We empirically observe that it is sufficient to use a fixed parameter \(\alpha =1\) in (15) to obtain a negligible \(\Vert W-H\Vert _F/\Vert H\Vert _F\). Note that the choice of a large enough value of \(\alpha \) should be aligned with the scale of the similarity values in \(A\). In our experiments, the matrix \(A\) contains normalized similarity values (25), thus the maximum possible value in \(A\) is 1, and most of the entries of \(A\) are smaller than 1. Finally, in both of our algorithms, the tolerance parameter \(\mu \) in the stopping criteria is set to \(10^{-4}\) and the maximum iteration count is set to 10,000 so that the outputs are stationary points.

For each data set, we run each algorithm 20 times with different random initializations and the known number of clusters \(k\) as input. Algorithms in the same category have the same initializations. In other words, the multistart global optimization method is applied with the described method in each category as for a local search procedure. Although the data sets are labeled, the labels are used only when evaluating the clustering quality, not by the clustering algorithms.

7.4 Clustering quality

We use clustering accuracy, the percentage of correctly clustered items given by the maximum bipartite matching, to evaluate the clustering quality (see more details in [64]). The average and maximum clustering accuracy over the 20 runs are shown in Tables 5 and 6, respectively. The maximum clustering accuracy for an algorithm is determined by the solution with the smallest objective function value among the 20 runs. We have the following observations:

  1. 1.

    SpYS and SymNMF achieves the highest clustering quality more frequently than other methods. Note that SpYS was proposed as a more principled way to obtain hard clustering than SpNJW, from the \(k\) leading eigenvectors of \(A\) [67]. Conceptually, both SpYS and SymNMF facilitate interpretation of the low-rank matrix that is used to approximate the graph similarity matrix, so that we can obtain hard clustering results directly from the low-rank matrix. However, comparing Tables 5 and 6, we observe that by employing the multistart global optimization method and picking the solution with the smallest objective function value, SpYS achieves higher accuracy than the average for four out of eight data sets, while SymNMF and SpNJW achieve higher accuracy than the average for 5 data sets, implying that the objective functions in SymNMF and SpNJW are slightly better proxies for the clustering problem than that in SpYS.

  2. 2.

    GNMF in our experiments does not show as dramatic improvement over SpNJW as the results reported in [8] where only maximum clustering accuracy was reported. One possible reason is that in [8], full graphs with cosine similarity are used, whereas we use sparse graphs and different similarity measures for better scalability and clustering quality (Sect. 7.1).

  3. 3.

    The K-means variants give exceedingly high accuracy on the RCV1 data set. We need more study to have a good explanation of their performances, for example, in what cases cosine dissimilarity is a better choice of distance measure than Euclidean distance. Note that RCV1 is the only data set where spherical K-means has the highest accuracy, and also the only data set where NMF performs better than almost all the other low-rank approximation methods (GNMF, SpNJW, SpYS, SymNMF). This consistency corroborated with our observation that spherical K-means has a closer relationship to NMF than standard K-means, and seems to explain why spherical K-means is often used as an initialization strategy for NMF [63].

Table 5 Average clustering accuracy for document and image data sets
Table 6 Maximum clustering accuracy for document and image data sets

7.5 Convergence and efficiency of SymNMF algorithms

We mentioned in Sect. 7.3 that the ANLS algorithm for SymNMF handles large data sets more efficiently, and the Newton-like algorithm achieves higher clustering accuracy. Here we discuss this tradeoff between efficiency and quality. The different properties exhibited by the two algorithms can be attributed to their different convergence behaviors, though both algorithms converge to stationary point solutions. In Fig. 7, we use COIL-20 data set to study their convergence by plotting the objective function \(f(H)\) and the projected gradient \(\Vert \nabla ^P f(H)\Vert _F\) throughout the iterations. As we could expect, \(f(H)\) is non-increasing in both algorithms; on the contrary, \(\Vert \nabla ^P f(H)\Vert _F\) is not guaranteed to drop in every iteration but is used to check stationarity.

Fig. 7
figure 7

Convergence behaviors of SymNMF algorithms, generated from a single run on COIL-20 data set with the same initialization

The Newton-like algorithm shows a divergent behavior in the initial stage of iterations, because the formulation (6) is nonconvex and the search step degrades to a steepest descent direction. However, when the intermediate iterate becomes close to a local minimum, the Hessian matrix becomes positive definite and the second-order information begins to help guide the search. Thus after this point, the algorithm converges very quickly to an accurate stationary point. In contrast, the ANLS algorithm shows a quick drop in both \(\Vert \nabla ^P f(H)\Vert _F\) and \(f(H)\) when the algorithm starts. However, near the final stage, it converges slowly to the appointed stationarity level. Overall, the Newton-like algorithm produces more accurate solutions and better clustering quality; however, it is overall less efficient than the ANLS algorithm due to heavier computational cost per iteration. We compare their clustering quality and timing performance in Table 7, with \(\mu =10^{-4}\) in the stopping criterion in both algorithms.

Table 7 Clustering accuracy and timing of the Newton-like and ANLS algorithms for SymNMF

8 Image segmentation experiments

In this section, we explore the application of SymNMF to image segmentation. Image segmentation methods have been heavily relying on spectral clustering [1, 12, 19, 46, 48]. We will demonstrate that SymNMF produces segmentation results that are closer to human-marked boundaries compared to spectral clustering. To the best of our knowledge, this is the first systematic evaluation of SymNMF applied to image segmentation.

8.1 Overview

Image segmentation is an important task in computer vision that organizes an image into a non-overlapping set of closed regions. It can be viewed as a graph clustering problem: The input is a nonnegative and symmetric matrix that contains similarity values between pairs of pixels; the output is a clustering of pixels where each cluster corresponds to a region.

In the graph represented by a pixel similarity matrix \(A\), a pixel is only connected to the pixels within some neighborhood. Thus, the input matrix \(A\) is typically a sparse matrix. The similarity value between two neighboring pixels can be computed based on brightness, color, and texture cues [46, 48]. The similarity value characterizes the discontinuity along the line connecting the two pixels and can be trained by a logistic model using human-marked boundaries as ground-truth [19].

Spectral clustering is one of the most common methods that solve the graph clustering problem in image segmentation. As we explained in Sects. 3 and 4, because eigenvectors contain both positive and negative numbers in general, they cannot be used as cluster indicators directly. A variety of methods have been proposed to post-process the graph embedding—the continuous-valued eigenvectors—to obtain closed regions. In contrast, the low-rank matrix \(H\) in the solution of SymNMF can not only be used as graph embedding, but also derive graph clustering results directly.

In the current paper, our focus is the gain in segmentation quality by replacing spectral clustering with SymNMF. We follow the steps in an early paper [19] to construct the similarity matrix as well as post-process the graph embedding when the produced low-rank matrix is viewed as graph embedding. The post-processing steps are:

  1. 1.

    Run K-means on the embedded points to generate an oversegmentation of an image. The oversegmentations are called superpixels and denoted as \(o_1, \ldots , o_K\), where \(K\) is an integer larger than the rank \(k\) of the low-rank matrix.

  2. 2.

    Build a contracted graph on the superpixels and represent it by a \(K \times K\) similarity matrix \(W\). The edge weight between the \(I\)-th and \(J\)-th superpixels (\(1 \le I,J \le K\)) is defined as:

    $$\begin{aligned} W_{IJ} = \sum _{i \in o_I} \sum _{j \in o_J} A_{ij}. \end{aligned}$$
    (26)
  3. 3.

    Recursively split the contracted graph to produce a hierarchy of regions [46].

We note that the baseline segmentation algorithm [19] used in our comparison between spectral clustering and SymNMF is not the best algorithm to date (for example, see [1]). However, we chose this baseline algorithm in order to simplify the experiment setting and make the comparison more visible. In our current workflow, both spectral and SymNMF use the same similarity matrix as an input; the resulting low-rank matrices are interpreted as either graph embedding to produce a hierarchy of regions or graph clustering to produce a flat partitioning of an image into regions. With more recent segmentation algorithms such as [1], the low-rank matrices would be interpreted in a more sophisticated way so that we do not know which component of the segmentation algorithm contributes to the gain in segmentation quality. We expect that the comparison result shown in this section will carry on to other segmentation algorithms.

8.2 Data and software

We use the Berkeley Segmentation Data Set 500Footnote 9 (BSDS500) [1] and choose the 200 color images used in [19]. The size of the original images is \(481 \times 321\). We resized the images to \(240 \times 160\) to be consistent with the experiments in [19, 48].

We compute the pixel similarity matrices and post-process the embedded points using the Berkeley Segmentation Engine.Footnote 10 We use the default settings: The number of eigenvectors in spectral clustering \(k\) (and also the lower rank in SymNMF) is set to 16; the number of oversegmentations \(K\) is set to 51. The neighborhood of a pixel is modified from default to a round disk centered at the pixel with radius of 20 pixels. The resulting similarity matrix has size \(n \times n\) where \(n=38{,}400\) and 44 million nonzeros. The same similarity matrix is given as an input to both spectral clustering and SymNMF.

8.3 Evaluation methods

The evaluation of segmentation results is based on the evaluation of boundary detection. In the experiments on document and image clustering, solving SymNMF and interpreting the low-rank result matrix as a cluster indicator yield a hard clustering of items. In order to evaluate SymNMF in the context of image segmentation and compare its performance with that of spectral clustering, we introduce our way to transform the hard clustering results to soft boundaries. First, we generate a probability of boundary (\(P_b\)) image from multiple segmentations of an image. Second, we evaluate the \(P_b\) image against human-marked boundaries.

  • We consider the following four ways to obtain multiple segmentations:

    1. 1.

      Spectral-Embed Compute the eigenvectors associated with the 16 largest eigenvalues and treat them as a graph embedding. Generate a hierarchy of regions following the procedures in Sect. 8.1. Each level of the hierarchy determines a segmentation of the image.

    2. 2.

      SymNMF-Embed Solve SymNMF with \(k=16\) and treat the rows of \(H\) as a graph embedding. Generate a hierarchy of regions following the procedures in Sect. 8.1. Each level of the hierarchy determines a segmentation of the image.

    3. 3.

      Spectral-NJW For each \(k=2,3,\ldots ,16\), compute the eigenvectors associated with the \(k\) largest eigenvalues, denoted as a matrix \(\hat{H} \in \mathbb {R}^{n \times k}\). Apply K-means to the rows of each matrix \(\hat{H}\), and the clustering result corresponds to a segmentation.

    4. 4.

      SymNMF-Clust Solve SymNMF with \(k=2,3,\ldots ,16\) and treat each matrix \(H\) as a cluster indicator. For each \(k\), the clustering result corresponds to a segmentation. Spectral-Embed and SymNMF-Embed produces 50 segmentations for each image. Spectral-NJW and SymNMF-Clust produces 15 segmentations for each image. The \(P_b\) value of a pixel is defined as the proportion of times the pixel lies on the boundary determined by the regions in a segmentation. Note that Spectral-NJW and SymNMF-Clust do not enforce hierarchies in their segmentations. Among these four ways of post-processing, only Spectral-Embed was used for evaluation against human-marked boundaries in existing work.

  • The data set includes a couple of human-marked boundaries for each image for evaluation. The \(P_b\) image has values in the \([0,1]\) interval. We can produce a binary boundary image using a threshold value \(t\) (\(0 < t < 1\)). Then the precision \(P\) is calculated as the fraction of true boundary pixels among all the detected boundary pixels; the recall \(R\) is calculated as the fraction of detected boundary pixels among all the true boundary pixels. The F-measure is defined as \(2PR/(P+R)\). We can draw a precision-recall curve using a series of threshold values (see more details in [49]). The best F-measure on this curve is regarded as a summary performance metric.

8.4 Results

We show the precision-recall curves for Spectral-Embed, SymNMF-Embed, Spectral-NJW, and SymNMF-Clust in Fig. 8. Using the best F-measure as the summary metric, both SymNMF versions have better segmentation quality than either of the spectral clustering methods.

Fig. 8
figure 8

Examples of the original images and \(P_b\) images from BSDS500. Pixels with brighter color in the \(P_b\) images have higher probability to be on the boundary. a Original, b Spectral-Embed, c Spectral-NJW, d SymNMF-Embed, e SymNMF-Clust

Fig. 9
figure 9

Precision-recall curves for image segmentation

SymNMF-Embed is much better than Spectral-Embed in the high-recall low-precision area, with the highest recall approaching 0.8.

SymNMF-Clust is much better than Spectral-Embed in the high-precision low-recall area, and consistently better than Spectral-Embed along the curve. When the threshold value \(t\) is close to 1, we can be much more confident about the detected regions using SymNMF-Clust than using Spectral-Embed.

SymNMF-Clust is only marginally better than Spectral-NJW, but is consistently better along the precision-recall curve.

Figure 9 shows several exemplar images from the BSDS500 data set. The segmentation results are consistent with our findings in the precision-recall curve. We notice that Spectral-Embed often subdivides a large flat area with uniform colors into multiple regions (grass, sky, etc.). This is a well-known problem of image segmentation methods that rely on K-means to post-process the eigenvectors, and the reason is that the embedded points for the pixels in those areas vary smoothly [1]. On the contrary, SymNMF-Clust often leaves those areas intact, which implies that the low-rank matrix generated by SymNMF is a better cluster indicator. Figure 10 shows the pixels plotted in the lower dimensional space produced by spectral clustering and SymNMF for a single image, which seems to support our reasoning above. We also notice that SymNMF-Clust tends to identify a few very small regions that correspond to noise in an image. This means that setting \(k\) larger than needed will not degrade its segmentation quality. If we remove the regions whose areas are smaller than some threshold, we will see that many of the remaining regions correspond to meaningful objects.

Fig. 10
figure 10

Illustration of different graph embeddings produced by spectral clustering and SymNMF for the third color image in Fig. 8. a The rows of the first three eigenvectors \(\hat{H} \in \mathbb {R}^{n \times 3}\) are plotted. b The rows of \(H \in \mathbb {R}_+^{n \times 3}\) in the result of SymNMF with \(k=3\) are plotted. Each dot corresponds to a pixel. a spectral clustering, b SymNMF

In summary, we can use SymNMF-Clust to detect salient objects and use SymNMF-Embed to discover more detailed segments.

9 Conclusion

In this paper, we studied SymNMF: \(\min _{H \ge 0}\Vert A- HH ^T\Vert _F^2\) as a graph clustering method that is suitable for clustering data points embedded in linear and nonlinear manifolds. Our method extends the applicability of NMF to more general cases, where data relationship is not described by distances in vector space but by similarity values in a latent space. Unlike previous work on SymNMF that imposed various additional constraints on the matrix \(H\), we showed that with nonnegativity constraints only, \(H\) can be well interpreted as a cluster indicator matrix. We justified SymNMF to be a valid graph clustering method by showing that it originates from the same formulation as spectral clustering but relaxes the constraint on \(H\) differently. While spectral clustering methods require post-processing the eigenvector-based data representation to obtain hard clusters, SymNMF does not depend on the spectrum and finds cluster memberships directly from \(H\). Compared to previous work on the extension of NMF to a positive semi-definite and nonnegative matrix, our approach only assumes that \(A\) is symmetric and nonnegative.

We developed two algorithms for SymNMF, a Newton-like algorithm and an ANLS-based algorithm, which should be used in different cases for best practices but both guaranteed to converge to stationary point solutions. We discussed the tradeoff between clustering quality and efficiency when choosing an algorithm for SymNMF. On one hand, the Newton-like algorithm often produces more accurate solutions and higher-quality clustering results, but is more appropriate when the problem size \(n\) is small, e.g. \(n < 3{,}000\). On the other hand, the ANLS algorithm is especially efficient for a sparse input matrix \(A\) and is scalable to very large data sets, e.g. \(n \approx 10^6\). For large-scale clustering, we have to construct a sparse similarity matrix instead of a dense one. For example, with \(n=10^5\) data points, it is difficult to store a dense similarity matrix (\(\sim \)75 GB) into the main memory of a contemporary machine.

We have shown the promise of SymNMF in document clustering and image clustering. We also conducted a comprehensive evaluation of SymNMF for image segmentation on 200 natural images. Overall, we developed a general framework in this paper, one with minimal constraints and flexible enough for extension. One limitation of our formulation is that an indefinite matrix \(A\) could be approximated by a positive semi-definite matrix \( HH ^T\). Its effect requires further study; however, we have not seen evidences that the clustering performance degraded due to this limitation. The proposed algorithms can be easily parallelized, for example, in the Newton-like algorithm, the evaluation and Cholesky factorization of different diagonal blocks of the Hessian can run in parallel; and in the ANLS algorithm, the nonnegative least squares problem with different right-hand sides can be made parallel as well.