Orthogonal nonnegative matrix factorization problems for clustering: A new formulation and a competitive algorithm

Dehghanpour, Ja’far; Mahdavi-Amiri, Nezam

doi:10.1007/s10479-022-04642-2

Orthogonal nonnegative matrix factorization problems for clustering: A new formulation and a competitive algorithm

Original Research
Published: 17 March 2022

Volume 339, pages 1481–1497, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Annals of Operations Research Aims and scope Submit manuscript

Orthogonal nonnegative matrix factorization problems for clustering: A new formulation and a competitive algorithm

Download PDF

266 Accesses
1 Citation
Explore all metrics

Abstract

Orthogonal Nonnegative Matrix Factorization (ONMF) with orthogonality constraints on a matrix has been found to provide better clustering results over existing clustering problems. Because of the orthogonality constraint, this optimization problem is difficult to solve. Many of the existing constraint-preserving methods deal directly with the constraints using different techniques such as matrix decomposition or computing exponential matrices. Here, we propose an alternative formulation of the ONMF problem which converts the orthogonality constraints into non-convex constraints. To handle the non-convex constraints, a penalty function is applied. The penalized problem is a smooth nonlinear programming problem with quadratic (convex) constraints that can be solved by a proper optimization method. We first make use of an optimization method with two gradient projection steps and then apply a post-processing technique to construct a partition of the clustering problem. Comparative performance analysis of our proposed approach with other available clustering methods on randomly generated test problems and hard synthetic data-sets shows the outperformance of our approach, in terms of the obtained misclassification error rate and the Rand index.

A competitive optimization approach for data clustering and orthogonal non-negative matrix factorization

Article 01 December 2020

A column-wise update algorithm for nonnegative matrix factorization in Bregman divergence with an orthogonal constraint

Article 11 March 2016

Binary Orthogonal Non-negative Matrix Factorization

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Optimization problems with orthogonality constraints posed as

$$\begin{aligned} \min _{X\in \mathbb {R}^{n\times k}} \ \ \ F(X) \ \ \ s.t. \ \ \ X^TX=I_k, \end{aligned}$$

(1a)

have wide applications in various areas such as polynomial optimization, combinatorics, eigenvalue problems, and clustering (Jiang and Dai 2015). These problems are difficult to solve since the orthogonality constraints may lead to several local solutions and, in particular, several of these problems are NP-hard. There is no guarantee that a global solution will be obtained except for a few simple cases; e.g., finding the extreme eigenvalues. It is not easy to even generate a sequence of feasible points since it can be numerically expensive to preserve the orthogonality constraints. Most existing methods for preserving constraints either use re-orthogonalization of the matrix or generate points along geodesics of ${\mathcal {M}}_n^k=\{ X\in \mathbb {R}^{n\times k} : X^TX=I\}$. Matrix factorizations such as singular value decomposition (SVD) are needed for the former, and the latter must solve partial differential equations (PDEs) or compute exponential matrices. To avoid these difficulties, we propose an alternative formulation that converts the orthogonality constraints into non-convex equality (or inequality) constraints. To handle these non-convex constraints, a penalty function is applied. The penalized problem is a smooth nonlinear programming problem with quadratic constraints that can be solved by a proper optimization method.

The nonnegative matrix factorization (NMF) problem proposed by Paatero and Tapper (1994) has a wide range of applications, such as pattern recognition, chemical engineering, fault diagnosis, and outlier detection (Banker et al. 2017; Duan et al. 2009; Tosyali et al. 2020). The orthogonal nonnegative matrix factorization (ONMF) problem can be interpreted as the NMF problem, with an additional orthogonality constraint that significantly changes the nature of the problem, making it suitable for clustering (Li et al. 2020; Peng et al. 2020). Ding et al (2006) studied NMF with orthogonality constraint for the first time and showed its effectiveness in clustering. Following that, several ONMF algorithms have been developed for a wide range of applications (Pompili et al. 2014). Most of these algorithms use a multiplicative updating framework on the Stiefel manifold ${\mathcal {M}}_n^k$ (iteratively updating matrices by taking the element-wise product with other computed non-negative matrices (He et al. 2020; Pan and Ng 2018)). Other approaches include hierarchical alternating least squares (HALS) (Kimura et al. 2015), and penalty function utilization for the orthogonality constraints (Del Buono 2009).

Here, we introduce an approach for solving the isoperimetry and ONMF problems using an efficient optimization algorithm. First, we present alternative formulations of the isoperimetry and ONMF problems, converting the orthogonality constraints into a smooth nonlinear programming problem with convex constraints in Section 2.2 and Section 3, respectively. Then, we solve these reformulated problems, in particular the ONMF, efficiently using a dedicated algorithm. It is remarkable that, instead of solving the ONMF problem for the matrix solutions, we convert the problem into subproblems and solve for vector solutions (see Section 3.1). Finally, we apply a post-processing technique to extract a solution to the clustering problem. Comparative computational results are provided in Section 4 and our concluding remarks are given in Section 5.

In summary, here our contributions are listed as follows:

Reformulation of the matrix orthogonality constraint as a set of non-convex smooth constraints.
Application of a penalty function including non-convex smooth constraints as penalty terms, leaving out only convex constraints.

2 Related works

This section reviews several works, namely k-means and isoperimetry problems, that are closely related to our work here. The k-means is one of the most popular unsupervised learning approaches being used for solving the well-known clustering problem. Many recent developments of k-means have been reported in the literature (see Fard et al. 2020; Fränti and Sieranoja 2018; Huang et al. 2021; Moreno et al. 2020; Sinaga and Yang 2020; Xia et al. 2020; Yu et al. 2018 for more details). The isoperimetry is an approach for finding a cluster structure in a data-set, characterizing the greatest similarity within a cluster and the greatest dissimilarity between the other clusters, by minimizing the sum of the weights of the edges connecting the specified cluster to the other clusters ( Dinler et al. 2020; Qin et al. 2017).

2.1 K-means

A fundamental problem of clustering, known as Minimum Sum-of-Squares Clustering (MSSC), is to partition n points based on a minimum sum-of - squares model into k clusters. Given a set X of n points in an m-dimensional Euclidean space, denoted by

$$\begin{aligned} X = \{ x_i = (x_{i1} , \dots , x_{im})^T \in \mathbb {R}^m, \ i=1,\dots , n \} , \end{aligned}$$

the partitional MSSC deals with the assignment of the n points into k disjoint clusters denoted by $A = (A_{1} , \dots , A_k)$ centered at cluster centers $c_j (j=1,\dots ,k)$ based on the total sum-of-squared Euclidean distances of the points $x_i$ from their respective assigned cluster centroids $c_i$, that is,

$$\begin{aligned} f(X,A) = \sum _{j=1}^{k} \sum _{i=1}^{|A_j|} || x_i^{(j)} - c_j ||^2, \end{aligned}$$

(2)

where $|A_j|$ is the number of points in $A_j$, and $x_i^{(j)}$ is the ith point in $A_j$. Note that if the clusters are known, then the function f(X, A) achieves its minimum when each points is assigned to its closest cluster center. On the other hand, if the points in clusters $A_j$ are fixed, then the function

$$\begin{aligned} f(X,A_j) = \sum _{i=1}^{|A_j|} || x_i^{(j)} - c_j||^2 \end{aligned}$$

(3)

is minimal when

$$\begin{aligned} c_j = \dfrac{1}{|A_j|} \sum _{i=1}^{|A_j|} x_i^{(j)}. \end{aligned}$$

(4)

The classical k-means algorithm (McQueen 1967) is described as follows:

(1)
Construct k clusters created randomly in a domain containing all the points.
(2)
Assign each point to the closest cluster center.
(3)
Recalculate cluster centers using current cluster points.
(4)
If the predefined criteria are met, then stop; otherwise, go to (2).

Another way to model the MSSC problem is based on the assignment problem. Let $Y=[y_{ij}]\in \mathbb {R}^{n\times k}$ be the assignment matrix defined by

$$\begin{aligned} y_{ij} = {\left\{ \begin{array}{ll} 1, &{} \hbox { if } x_i \hbox {is assigned to} A_j \\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

As a consequence, the cluster center of the cluster $A_j$ is defined by

$$\begin{aligned} c_j = \dfrac{\sum _{l=1}^n y_{lj}x_l}{\sum _{l=1}^n y_{lj}}, \end{aligned}$$

which is the mean of all the points in the cluster. Using this, Peng and Wei (2007) introduced the following model for the k-means problem:

$$\begin{aligned} \min _{y_{ij}} \ \ \ \&\sum _{j=1}^k \sum _{i=1}^n y_{ij} || x_i - c_j ||^2 \end{aligned}$$

(5a)

$$\begin{aligned} s.t. \ \ \ \ \&\sum _{j=1}^k y_{ij} = 1 \ \ (i=1,\dots , n), \end{aligned}$$

(5b)

$$\begin{aligned}&\sum _{i=1}^n y_{ij} \ge 1 \ \ (j=1,\dots , k), \end{aligned}$$

(5c)

$$\begin{aligned}&y_{ij} \in \{0,1\} \ \ (i=1,\dots , n , j=1, \dots , k). \end{aligned}$$

(5d)

The constraints (5b) ensure that each point $x_i$ is assigned to exactly one cluster, and (5c) ensures that there are exactly k clusters. We can show (5a) in matrix form by Ferebinous norm as follows (Bauckhage 2015):

$$\begin{aligned} \sum _{j=1}^k \sum _{i=1}^n y_{ij} ||x_i - c_j||^2 = ||X-CY||_F^2, \end{aligned}$$

where $X \in \mathbb {R}^{ m \times n}$ is a matrix of data vectors $x_i \in \mathbb {R}^m$, $C \in \mathbb {R}^{ m \times k }$ is a matrix of cluster centroids $c_j \in \mathbb {R}^m$ and $Y\in \mathbb {R}^{k \times n}$ is the assignment matrix.

2.2 Isoperimetry problem

Given a weighted graph $G=(X,E)$, for any partition (subpartition) $A=\{A_1 , \dots , A_k \}$, we define the vector v as follows:

$$\begin{aligned} v = \displaystyle { \left( \frac{w(A_1)}{|A_1|} , \dots , \frac{w(A_k)}{|A_k|} \right) }, \end{aligned}$$

where $w(A_i)$ is the sum of the weights of edges between $A_i$ and $A_i^c \ (i.e., \ X-A_i)$ and $|A_i|$ is the number of vertices in the cluster $A_i$, for $1\le i \le k$. We are to find a partition (or subpartition) of vertices so that the norm of v is minimized. For $p=1$, the mean version of the isoperimetry problem is defined as

$$\begin{aligned}&IPP_k^m(G) = \min _{\{A_i\}_1^k \in D_k(X) } \Vert v \Vert _{1} = \min _{\{A_i\}_1^k \in D_k(X) } \dfrac{1}{k} \displaystyle { \left( \sum _{i=1}^k \dfrac{c(A_i)}{|A_i|} \right) }, \end{aligned}$$

where $D_k(X)$ is the collection of all the k-subpartitions of the set X. Actually, the isoperimetry problem is a relaxed version of the normalized cut problem, in such a way that some points may not be assigned to any cluster.

In Dehghanpour-Sahron and Mahdavi-Amiri (2020), Dehghanpour and Mahdavi-Amiri formulated the isoperimetry problem as follows:

$$\begin{aligned} \min _{Y\in \mathbb {R}^{n\times k}} \ \ \&tr(Y^TLY) \end{aligned}$$

(6a)

$$\begin{aligned} s.t. \ \ \&Y^TY=I_k , \end{aligned}$$

(6b)

$$\begin{aligned}&Y\ge 0, \end{aligned}$$

(6c)

where L is the Laplacian matrix and $I_k$ is the $k\times k $ identity matrix. The orthogonal constraint $Y^TY = I_k$, together with the nonnegativity constraint $Y\ge 0$ ensure that each row of the matrix Y has at most one non-zero entry. Thus, matrix Y in (6) is closest to that in (5a) and shows the cluster assignment of the data set (here, the binary condition is omitted).

As noted, the orthogonality constraint (6b) and the nonnegative constraint (6c) indicate that there is at most one non-zero element in each row of Y. We note that every vector $x\in \mathbb {R}^n$ has at most one nonzero element if and only if $||x||_1 = ||x||_2$. Thus, constraint (6b) for matrix Y can be written as follows:

$$\begin{aligned}&||{\hat{y}}_i^T||_1^2 = ||{\hat{y}}_i^T||_2^2 \ \ ( i =1 , \dots , n ), \end{aligned}$$

(7a)

$$\begin{aligned}&||y_j||_2^2 = 1 \ \ (j=1,\dots , k), \end{aligned}$$

(7b)

where ${\hat{y}}_i$ and $y_j$ are the ith row and the jth column of Y, respectively. So, we have a new model for the clustering problem as follows:

$$\begin{aligned} \min _Y \ \ \ \&tr(Y^TLY) \end{aligned}$$

(8a)

$$\begin{aligned} s.t.\ \ \ \&||{\hat{y}}_i^T||_1^2 = ||{\hat{y}}_i^T||_2^2 \ \ ( i =1 , \dots , n ), \end{aligned}$$

(8b)

$$\begin{aligned}&||y_j||_2^2 = 1 \ \ (j=1,\dots , k), \end{aligned}$$

(8c)

$$\begin{aligned}&Y \ge 0 . \end{aligned}$$

(8d)

Problem (8) is still difficult to solve for two reasons. First, the constraint (8b) is nonconvex and second, constraints (8b) and (8c) are written across the rows and across the columns of Y, respectively. Therefore, it is difficult to applying methods based on decomposition, specially when size of the problem is large. To deal with these issues, we propose a penalized function as follows:

$$\begin{aligned} \min _{Y} \ \ \ \&tr(Y^TLY) + \dfrac{\rho }{2} \sum _{i=1}^n (||{\hat{y}}_i^T||_1^2 - ||{\hat{y}}_i^T||_2^2 ) \end{aligned}$$

(9a)

$$\begin{aligned} s.t. \ \ \ \&||y_j||_2^2 = 1 \ \ (j=1,\dots , k), \end{aligned}$$

(9b)

$$\begin{aligned}&Y \ge 0 , \end{aligned}$$

(9c)

where scalar $\rho >0$ is the penalty parameter.

Next, we use the same idea to reformulate the ONMF problem.

3 Orthogonal nonnegative matrix factorization

Orthogonal nonnegative matrix factorization (ONMF), an approximate matrix factorization technique with matrix orthogonality conditions and nonnegativity constraints, has recently been shown to work remarkably well for clustering tasks (Pompili et al. 2014). We consider an orthogonal nonnegative matrix factorization problem as follows. Given a d by n nonnegative matrix M and a rank k factorization (with $k<n$), we are to solve

$$\begin{aligned}&\min _{U\in \mathbb {R}^{d\times k } , V\in \mathbb {R}^{k\times n }} \Vert M - UV \Vert _F^2 \\&\ \ \ \ \ \ \ \ \ s.t. \ \ \ \ \ \ \ VV^T=I, \\&\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ U\ge 0 , \ V\ge 0. \end{aligned}$$

(ONMF)

Here, we consider each column of the matrix M as a point in $\mathbb {R}^{d}$ and construct the data-set $X=\{x_1 , \dots , x_n \}$, with $x_i$ being the ith column of M.

Similarly, if we apply to (ONMF) the same procedure used before to convert problem (6) into the problem (9), we get

$$\begin{aligned} \min _{U\in \mathbb {R}^{d\times k } , V\in \mathbb {R}^{k\times n }}&\Vert M - UV \Vert _F^2 \end{aligned}$$

(10a)

$$\begin{aligned} s.t.&||v_j||_1^2 = ||v_j||_2^2 \ \ ( j =1 , \dots , n ), \end{aligned}$$

(10b)

$$\begin{aligned}&||{\hat{v}}_i^T||_2^2 = 1 \ \ (i=1,\dots , k), \end{aligned}$$

(10c)

$$\begin{aligned}&U \ge 0 , \ V\ge 0. \end{aligned}$$

(10d)

And, the penalized function of the ONMF problem achieves as follow:

$$\begin{aligned} \min _{U\in \mathbb {R}^{d\times k } , V\in \mathbb {R}^{k\times n }} \ \ \ \&\Vert M - UV \Vert _F^2 + \dfrac{\rho }{2} \sum _{j=1}^n (||v_j||_1^2 - ||v_j||_2^2 ) \end{aligned}$$

(11a)

$$\begin{aligned} s.t. \ \ \ \&||{\hat{v}}_i^T||_2^2 = 1 \ \ (i=1,\dots , k), \end{aligned}$$

(11b)

$$\begin{aligned}&U \ge 0 , \ V\ge 0, \end{aligned}$$

(11c)

where ${\hat{v}}_i, i=1,\dots , k,$ and $v_j, j=1,\dots , n,$ are the ith row and the jth column of V, respectively. Note that V is an assignment matrix with $v_{ij}\ne 0$ indicating that $x_j$ is in cluster $A_i$, and nonzero elements of the matrix HV are the same as the ones in V, where H is a diagonal matrix with $0< h_{ii} \le 1 \ (i=1,\dots , k)$. So, from clustering point of view, both V and HV provide the same cluster assignment. Also, (U, V) and $(UH^{-1},HV)$ provide the same objective value in the ONMF problem, i.e. ,

$$\begin{aligned} \Vert M - UV \Vert _F^2 = \Vert M - UH^{-1}HV \Vert _F^2. \end{aligned}$$

Morever,

$$\begin{aligned} HV = \begin{pmatrix} h_{11} &{} 0 &{} 0 &{} 0\\ 0 &{}h_{22} &{} 0 &{} 0\\ 0 &{} 0&{} \ddots &{} 0 \\ 0 &{} 0 &{}0 &{} h_{kk} \end{pmatrix} \begin{pmatrix} {\hat{v}}_1^T \\ {\hat{v}}_2^T\\ .\\ .\\ {\hat{v}}_k^T \end{pmatrix} = \begin{pmatrix} h_{11}{\hat{v}}_1^T \\ h_{22}{\hat{v}}_2^T\\ .\\ .\\ h_{kk}{\hat{v}}_k^T \end{pmatrix}, \end{aligned}$$

where $||h_{ii}{\hat{v}}_i^T||_2^2 = h_{ii}||{\hat{v}}_i^T||_2^2 \le 1 \ (i=1,\dots , k)$. Therefore, clustering is not affected if we replace $||{\hat{v}}_i^T||_2^2 = 1 \ (i=1,\dots , k),$ by $||{\hat{v}}_i^T||_2^2 \le 1 \ (i=1,\dots , k)$, and choose $(UH^{-1},HV)$ instead of (U, V) in (11). As a result, we have a new model for the ONMF problem as follows:

$$\begin{aligned} \min _{U\in \mathbb {R}^{d\times k } , V\in \mathbb {R}^{k\times n }} \ \ \ \&\Vert M - UV \Vert _F^2 + \dfrac{\rho }{2} \sum _{j=1}^n (||v_j||_1^2 - ||v_j||_2^2 ) \end{aligned}$$

(12a)

$$\begin{aligned} s.t. \ \ \ \&||{\hat{v}}_i^T||_2^2 \le 1 \ \ (i=1,\dots , k), \end{aligned}$$

(12b)

$$\begin{aligned}&U \ge 0 , \ V\ge 0. \end{aligned}$$

(12c)

Now, problem (12) is a nonlinear programming problem with convex (quadratic) constraints, and an efficient optimization method can be applied to solve it.

The Lagrangian function corresponding to the isoperimetry problem without constraint (6c) can be written as

$$\begin{aligned} {\mathcal {L}} (Y,\Lambda ) = tr(Y^TLY) - \dfrac{1}{2} tr(\Lambda (Y^TY-I)), \end{aligned}$$

(13)

where $\Lambda \in \mathbb {R}^{k\times k}$ is comprised of the Lagrange multipliers. We define $\nabla F(Y)={\mathcal {D}}_Y {\mathcal {L}}(Y,\Lambda )$. Dehghanpour-Sahron and Mahdavi-Amiri (2020) showed that if the similarity matrix C is placed in problem (ONMF) rather than the matrix M, the solution of the problem is equivalent to the solution of the isoperimetry problem. As a result, we can solve the problem (12) instead of the problem (8). Numerical results show that problem (12) has advantages: first, it is easier to solve than problem (8), and second, in problem (12), instead of solving the main problem, we can solve k sub-problems, which significantly reducing the computing time.

3.1 Solving ONMF problem

We first apply the algorithm proposed by Bolte et al. (2014) known as the PALM algorithm to obtain a solution of problem (12). This algorithm uses two gradient projection steps for V and U. Suppose that $F_\rho (U,V) = \Vert M - UV \Vert _F^2 + \dfrac{\rho }{2} \sum _{j=1}^n (||v_j||_1^2 - ||v_j||_2^2 ) $. At iteration l of the algorithm, the variable V is obtained by solving the following problem:

$$\begin{aligned} V^{l+1} = \arg&\min _{V} ||V-B^l||_F^2 \nonumber \\&s.t. \ \ V\ge 0, \ ||{\hat{v}}_i^T||_2 \le 1, \ i=1, \dots , k, \end{aligned}$$

(14)

where $B^l=V^l-\dfrac{1}{t^l} \nabla _V F_\rho (U^l,V^l)$ and $t^l>0$ is a step size. It is remarkable that the objective function and constraints of (14) can be separated with respect to the rows of V, and we can decompose the updating of V to k subproblems as (15). At iteration l, the ${\hat{v}}_i \ ( i=1, \dots , k)$ are obtained as follows:

$$\begin{aligned}&{\tilde{v}}_i^{l+1} = \arg \min _{{\hat{v}}_i^T \ge 0, \ ||{\hat{v}}_i||_2 \le 1} ||{\hat{v}}_i - {\hat{b}}_i^l||_2^2, \end{aligned}$$

(15)

$$\begin{aligned}&{\hat{v}}_i^{l+1} = {\tilde{v}}_i^{l+1} + \tau ( {\tilde{v}}_i^{l+1} - {\hat{v}}_i^{l}), \end{aligned}$$

(16)

where ${\hat{b}}_i^l \ (i=i,\dots , k)$ are the rows of the matrix $B^l$. We note that problem (15) has a simple solution as follows: partition ${\hat{b}}_i^{l} = [ {\hat{b}}_{i-}^{l} , {\hat{b}}_{i+}^{l}]$, where $ {\hat{b}}_{i-}^{l}= \{ {\hat{b}}_i^{l}(j) | {\hat{b}}_i^{l}(j) \le 0\}$ and $ {\hat{b}}_{i+}= \{ {\hat{b}}_i^{l}(j) | {\hat{b}}_i^{l}(j) > 0\}$, for $ j=1,\dots , n$. Then, we get

$$\begin{aligned} {\hat{v}}_i^* (j)= {\left\{ \begin{array}{ll} 0, &{} \hbox { if } {{\hat{b}}_{i}^{l}}(j) \in {\hat{b}}_{i-}^{l} \\ \dfrac{ {{\hat{b}}^{l}_{i}}(j)}{ \max { ||{{\hat{b}}^{l}_{i+}}||_2, 1}} , &{} \hbox { if }\ {{\hat{b}}_{i}^{l}}(j) \in {{\hat{b}}_{i+}^{l}}. \end{array}\right. } \end{aligned}$$

Equation (16) introduced by Pock and Sabach (2016) is a correction step for the rows of matrix V, where $\tau \in (0,1)$ is a combination parameter. Numerical results show that if $\tau $ is chosen carefully, the resulting correction step turns to reduce the number of iterations of the algorithm significantly and as a result, the running time of the algorithm is reduced. Finally, using a post-processing technique proposed by Dehghanpour-Sahron and Mahdavi-Amiri (2020), we construct a partition for the clustering problem. The aim of this technique is to obtain a 0-1 assignment matrix V, which $v_{ij} = 1$ indicates $x_j$ is in cluster $A_i$. It constructs the assignment matrix by rounding the elements of the input matrix to 0 and 1 with predefined criteria. We explain this technique as follows. Suppose that the matrix V is an output of the proposed algorithm (Algorithm 1 below). In each column of the matrix V, the maximal element is preserved and the other elements are set to be zero (this ensures that every column of V has only one nonzero element). In each row i of the matrix V, the maximal element is found and named to be $M_i$. In the ith row, for $V_{ij}\ne M_i, \ \forall 1\le j\le n$, if $V_{ij} < \dfrac{4M_i}{3n}$, then we set $V_{ij}=0$, and an assignment matrix V is obtained.

Table 1 Compared clustering algorithms

Full size table

Table 2 Misclassification error rates and Rand indices on 4 test problems of [41]

Full size table

Table 3 Misclassification error rates and Rand indices on 4 test problems of [42]

Full size table

3.2 Convergence analysis of the proposed algorithm

We first notice that any local minimal solution of (12) is feasible for (10). We know that for a non-convex problem, a local minimal solution cannot be computed in general, and we can only obtain a stationary point under some proper conditions (Bertsekas 1999). Suppose that $(U^{\rho },V^{\rho })$ is bounded and is a stationary point of (12) and $(U^{\rho },V^{\rho }) \rightarrow (U^{*},V^{*})$ as $\rho \rightarrow \infty $. It is well known that if the Mangasarian-Fromovitz constraint qualification (MFCQ) (Facchinei and Pang 2007) holds for (10) at $(U^{*},V^{*})$, then $(U^{*},V^{*})$ is a stationary point of problem (10). According to the obtained convergence results for the algorithms of (Bolte et al. 2014) and (Pock and Sabach 2016), with a proper choice of $t^l$ (obtained from a proper line search to ensure convergence of the iterates (Bertsekas 1999)) along with increasing penalty parameter $\rho $, Algorithm 1 can obtain a bounded stationary point of problem (12). Moreover, the convergence rate of Algorithm 1 depends on the utilized algorithm for solving subproblems (15). In Shefi and Teboulle (2016), it is reported that the PALM algorithm has a global convergence with an asymptotic sublinear convergence rate.

4 Comparative results

We implemented our proposed algorithm and other clustering methods on MATLAB R2012a environment in a Windows 7 machine with a 2.40GHz CPU and 4.00 GB RAM.

The numerical results are presented in two parts. In Sect. 4.1, we compared our proposed algorithm (PFM) with other related algorithms (listed in Table 1) on some hard artificial benchmark problems. Moreover, we also report numerical results of PFM on randomly generated graphs in Sect. 4.2 for an extensive evaluation of the performance of our proposed algorithm. In all the tables, the best performance (having minimum error and highest Rand index) is highlighted in bold and the second best is specified by an underline.

Hyper-parameter settings

Here, we provide parameters used in our proposed algorithm. We set the tolerance for the stopping criterion as $\epsilon = 10^{-6}$, penalization parameter as $\rho = 10^{-5}$, and $\alpha = 1.1$. The correction step $\tau $ must be chosen carefully to reduce the number of iterations of the algorithm. We set this to be 0.01, 0.12, and 0.3 for Tables 2, 3, 4, 5 and 6, respectively; these values have been decided based on our experimentations. Note that, for exiting the inner loop of PFM, we should use the relative error as $||V^l - V^{l-1}||_F \le \epsilon ||V^{l}||_F$. Since $||V^{l}||_F$ is a fixed number due to orthogonality of $V^{l}$ (it is approximately $\sqrt{k}$, where k is the number of clusters), the absolute and relative errors are almost equivalent, and thus we made use of absolute error.

Table 4 The average misclassification rates and Rand indices for 100 randomly generated test problems: $n = 1000$, $k=5$

Full size table

4.1 Artificial data-sets

Here, we reported the numerical results obtained for the ten algorithms on two groups (four test problems of [41] and four test problems of [42]) of the hard benchmark clustering problems, as shown in Tables 2 and 3. For each entry of the tables, we reported the misclassification rate (the ratio of incorrect labelings to the total number of objects) for the corresponding algorithm. Also, to evaluate and compare the performance of the clustering methods, we reported the Rand index (a measure of similarity between the data clusters which has a value between 0 and 1, and the higher the value, the better the clustering) in the last row of the tables. In Tables 2 and 3, for each table, we reported this value as the average of the Rand indices over all the test problems obtained by each algorithm. From the obtained results it is obvious that PFM outperforms the other algorithms; PFM has the best or the second-best misclassification error rates and also the highest Rand index over all the test problems. Figures 1, 2, 3, 4, 5, 6, 7 and 8 illustrate the performance of ten algorithms corresponding to Tables 2 and 3. By observing the 2-dimensional shape of the test problems (depicted in Figs. 1, 2, 3, 4, 5, 6, 7 and 8), it is clear that PFM performs well in constructing the expected clusters.

4.2 Random graph generation and testing

Here, we investigate the performance of clustering algorithms and compare the obtained results on some randomly generated test problems. For a comparable performance evaluation, several sample random graphs were generated using the benchmark generator of Lancichinetti and Fortunato (2009) with parameters $\mu _t$ and $\mu _w$. Table 4 shows the misclassification rate for the algorithms on constructed randomly generated test problems using several values of the parameters. We note that all algorithms were executed once for these benchmarks and “MR Mean" is the average of the total error for each test problem. The last two rows of the table respectively give Rand indices of the clustering algorithms and the average running times of these algorithms on all the data-sets. For more statistical analysis, we utilized the Dolan and Moré (2002) performance profiles to compare the performance of the clustering algorithms. Fig. 9 shows a comparison of the obtained misclassification rates for the considered algorithms. We constructed this profile using all the test problems corresponding to Tables 4.

Table 5 The average misclassification rates and Rand indices for 100 randomly generated test problems: $n = 10000$, $\mu _w = 0.1$, $\mu _t = 0.01, 0.02, 0.03, 0.04, 0.05.$

Full size table

Table 6 The average misclassification rates and Rand indices for 100 randomly generated test problems: $n = 20000$, $\mu _w = 0.1$, $\mu _t = 0.01, 0.02, 0.03, 0.04, 0.05.$

Full size table

Tables 5 and 6 provide the average misclassification rates and Rand indices of the related algorithms on 100 randomly generated test problems, with parameters $\mu _w=0.1$ and $\mu _t = 0.01, 0.02, 0.03, 0.04, 0.05$, respectively, for 10000 and 20000 points. In these tables, we reported the Rand index as the average of the Rand indices of all the test problems.

5 Concluding remarks

We proposed an alternative formulation of the ONMF problem by converting the orthogonality constraints into convex constraints. Using a penalty function, we proposed a proper optimization method to solve this problem. We first used a gradient-based optimization algorithm and then applied a post-processing technique to extract a solution to the clustering problem. Utilizing different test problems, we considered the performance of our proposed algorithm in comparison with other available clustering algorithms, namely, Ng-Jordan-Weiss (NJW), Daneshgar-Javadi-ShariyatRazavi (DJS), Standard k-means (SKA), Pompili-Gillis-Absil-Glineur (PGAG), Kim-Park (KP), Kimura-Tanaka-Kudo (KTK), Yang-Fu-Sidiropoulos (YFS), Sinaga-Yang (SY) and Dehghanpour-Mahdavi-Amiri (DMA). Numerical results confirmed the practicality of our formulation and showed the capability of our proposed approach for constructing the expected clustering.

We compared our proposed algorithm with nine related clustering algorithms on hard synthetic data sets and some randomly generated test problems. For a proper statistical analysis, we utilized the Dolan-Mor$\acute{\text {e}}$ performance profiles to compare the obtained misclassification rate errors. Numerical results confirmed our proposed method to be successful in clustering; PFM had the best or the second-best misclassification rate and also the highest Rand index among all the compared methods.

References

Arthur, D., Sergi, V. (2007) K-means++: The Advantages of Careful Seeding. SODA ’ 07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 1027-1035 .
Banker, R. D., Chang, H., & Zheng, Z. (2017). On the use of super-efficiency procedures for ranking efficient units and identifying outliers. Ann Oper Res, 250(1), 21–35.
Article Google Scholar
Bauckhage, C. K-means clustering is matrix factorization. arXiv preprint arXiv:1512.07548, (2015).
Bertsekas, D. P. (1999). Nonlinear Programming (2nd ed.). Belmont, Massachusetts: Athena Scientific.
Google Scholar
Bolte, J., Sabach, S., & Teboulle, M. (2014). Proximal alternating linearized minimization for non-convex and non-smooth problems. Math Program, 146, 459–494.
Article Google Scholar
Daneshgar, A., Javadi, R., & Razavi, S. S. (2013). Clustering and outlier detection using isoperimetric number of trees. Pattern Recognition, 46(12), 3371–3382.
Article Google Scholar
Dehghanpour-Sahron, J., & Mahdavi-Amiri, N. (2020) A competitive optimization approach for data clustering and orthogonal non-negative matrix factorization. 4OR, 27 pages, , https://doi.org/10.1007/s10288-020-00445-y.
Del Buono N. (2009). A penalty function for computing orthogonal non-negative matrix factorizations. (pp. 1001–1005)
Ding, C., Li, T., Peng, W., & Park, H. (2006). Orthogonal nonnegative matrix t-factorizations for clustering. (pp. 126–135)
Dinler, D., Tural, M. K., & Ozdemirel, N. E. (2020). Centroid based Tree-Structured Data Clustering Using Vertex/Edge Overlap and Graph Edit Distance. Ann Oper Res, 289(1), 85–122.
Article Google Scholar
Dolan E D, & Moré J J (2002). Benchmarking optimization software with performance profiles. Mathematical Programming, 91(2), 201–213.
Duan, L., Xu, L., Liu, Y., et al. (2009). Cluster-based outlier detection. Ann. Oper Res, 168, 151–168.
Article Google Scholar
Facchinei, F., & Pang, J. S. (2007). Finite-dimensional variational inequalities and complementarity problems. Springer Science and Business Media.
Fard, M. M., Thonet, T., & Gaussier, E. (2020). Deep k-means: Jointly clustering with k-means and learning representations. Pattern Recognition Letters, 138, 185–192.
Article Google Scholar
Fränti, P., & Sieranoja, S. (2018). K-means properties on six clustering benchmark datasets. Applied Intelligence, 48(12), 4743–4759.
Article Google Scholar
He, P., Xu, X., Ding, J., & Fan, B. (2020). Low-rank nonnegative matrix factorization on Stiefel manifold. Information Sciences, 514, 131–148.
Article Google Scholar
Jiang, B., & Dai, Y. H. (2015). A framework of constraint preserving update schemes for optimization on Stiefel manifold. Mathematical Programming, 153(2), 535–575.
Article Google Scholar
Kim, J., & Park, H. (2011). Fast non-negative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientific Computing, 33(6), 3261–3281.
Article Google Scholar
Kimura, K., Tanaka, Y., & Kudo, M. (2015). A fast hierarchical alternating least squares algorithm for orthogonal nonnegative matrix factorization.
Kimura, K., Kudo, M., & Tanaka, Y. (2016). A column-wise update algorithm for nonnegative matrix factorization in Bregman divergence with an orthogonal constraint. Machine learning, 103(2), 285–306.
Article Google Scholar
Lancichinetti, A., & Fortunato, S. (2009). Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Physical Review E, 80, 016118.
Article Google Scholar
Huang, S., Kang, Z., Xu, Z., & Liu, Q. (2021). Robust deep k-means: An effective and simple method for data clustering. Pattern Recognition, 117, 107996.
Article Google Scholar
Lawrence, H., & Phipps, A. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Article Google Scholar
McQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Computer and Chemistry, 4, 257–272.
Google Scholar
Li, W., Li, J., Liu, X., & Dong, L. (2020). Two fast vector-wise update algorithms for orthogonal nonnegative matrix factorization with sparsity constraint. Journal of Computational and Applied Mathematics, 375, 112785.
Article Google Scholar
Moreno, S., Pereira, J., & Yushimito, W. (2020). A hybrid K-means and integer programming method for commercial territory design: a case study in meat distribution. Ann Oper Res, 286(1), 87–117.
Article Google Scholar
Ng, A. Y., Jordan, M. I., & Weiss, Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, 849-856 .
Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126.
Article Google Scholar
Pan, J., & Ng, M. K. (2018). Orthogonal nonnegative matrix factorization by sparsity and nuclear norm optimization. SIAM Journal on Matrix Analysis and Applications, 39(2), 856–875.
Article Google Scholar
Peng, J., & Wei, Y. (2007). Approximating k-means-type clustering via semidefinite programming. SIAM Journal on Optimization, 18(1), 186–205.
Article Google Scholar
Peng, S., Ser, W., Chen, B., & Lin, Z. (2020). Robust orthogonal nonnegative matrix tri-factorization for data representation. Knowledge-Based Systems, 201, 106054.
Pock, T., & SabachS. (2016). Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM Journal on Imaging Sciences, 9(4), 1756–1787.
Pompili, F., Gillis, N., Absil, P. A., & Glineur, F. (2014). Two algorithms for orthogonal non-negative matrix factorization with application to clustering. Neurocomputing, 141, 15–25.
Article Google Scholar
Qin, Z., Wan, T., & Zhao, H. (2017). Hybrid clustering of data and vague concepts based on labels semantics. Ann Oper Res, 256(2), 393–416.
Article Google Scholar
Shefi, R., & Teboulle, M. (2016). On the rate of convergence of the proximal alternating linearized minimization algorithm for convex problems. EURO J Comput Optim, 4, 27–46.
Article Google Scholar
Sinaga, K. P., & Yang, M. S. (2020). Unsupervised K-means clustering algorithm. IEEE. Access, 8, 80716–80727.
Article Google Scholar
Tosyali, A., Kim, J., Choi, J., et al. (2020). New node anomaly detection algorithm based on nonnegative matrix factorization for directed citation networks. Ann Oper Res, 288, 457–474.
Article Google Scholar
Xia, S., Peng, D., Meng, D., Zhang, C., Wang, G., Giem, E., & Chen, Z. (2020). A fast adaptive k-means with no bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2020.3008694
Article Google Scholar
Yang, B., Fu, X., & Sidiropoulos, N. D. (2017). Learning from hidden traits: Joint factor analysis and latent clustering. IEEE Transactions on Signal Processing, 65(1), 256–269.
Article Google Scholar
Yu, S. S., Chu, S. W., Wang, C. M., Chan, Y. K., & Chang, T. C. (2018). Two improved k-means algorithms. Applied Soft Computing, 68, 747–755.
Article Google Scholar
http://www.vision.caltech.edu/lihi/Demos/SelfTuningClustering.html.
http://cs.joensuu.fi/sipu/datasets/.

Download references

Acknowledgements

The authors thank the Research Council of Sharif University of Technology for supporting this work.

Author information

Authors and Affiliations

Faculty of Mathematical Sciences, Sharif University of Technology, P. O. Box: 11155-9415, Tehran, Iran
Ja’far Dehghanpour & Nezam Mahdavi-Amiri

Authors

Ja’far Dehghanpour
View author publications
You can also search for this author in PubMed Google Scholar
Nezam Mahdavi-Amiri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nezam Mahdavi-Amiri.

Ethics declarations

Conflict of interest

Authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dehghanpour, J., Mahdavi-Amiri, N. Orthogonal nonnegative matrix factorization problems for clustering: A new formulation and a competitive algorithm. Ann Oper Res 339, 1481–1497 (2024). https://doi.org/10.1007/s10479-022-04642-2

Download citation

Accepted: 18 February 2022
Published: 17 March 2022
Issue Date: August 2024
DOI: https://doi.org/10.1007/s10479-022-04642-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Orthogonal nonnegative matrix factorization problems for clustering: A new formulation and a competitive algorithm

Abstract

Similar content being viewed by others

A competitive optimization approach for data clustering and orthogonal non-negative matrix factorization

A column-wise update algorithm for nonnegative matrix factorization in Bregman divergence with an orthogonal constraint

Binary Orthogonal Non-negative Matrix Factorization

1 Introduction