1 Introduction

Optimization problems with orthogonality constraints posed as

$$\begin{aligned} \min _{X\in \mathbb {R}^{n\times k}} \ \ \ F(X) \ \ \ s.t. \ \ \ X^TX=I_k, \end{aligned}$$
(1a)

have wide applications in various areas such as polynomial optimization, combinatorics, eigenvalue problems, and clustering (Jiang and Dai 2015). These problems are difficult to solve since the orthogonality constraints may lead to several local solutions and, in particular, several of these problems are NP-hard. There is no guarantee that a global solution will be obtained except for a few simple cases; e.g., finding the extreme eigenvalues. It is not easy to even generate a sequence of feasible points since it can be numerically expensive to preserve the orthogonality constraints. Most existing methods for preserving constraints either use re-orthogonalization of the matrix or generate points along geodesics of \({\mathcal {M}}_n^k=\{ X\in \mathbb {R}^{n\times k} : X^TX=I\}\). Matrix factorizations such as singular value decomposition (SVD) are needed for the former, and the latter must solve partial differential equations (PDEs) or compute exponential matrices. To avoid these difficulties, we propose an alternative formulation that converts the orthogonality constraints into non-convex equality (or inequality) constraints. To handle these non-convex constraints, a penalty function is applied. The penalized problem is a smooth nonlinear programming problem with quadratic constraints that can be solved by a proper optimization method.

The nonnegative matrix factorization (NMF) problem proposed by Paatero and Tapper (1994) has a wide range of applications, such as pattern recognition, chemical engineering, fault diagnosis, and outlier detection (Banker et al. 2017; Duan et al. 2009; Tosyali et al. 2020). The orthogonal nonnegative matrix factorization (ONMF) problem can be interpreted as the NMF problem, with an additional orthogonality constraint that significantly changes the nature of the problem, making it suitable for clustering (Li et al. 2020; Peng et al. 2020). Ding et al (2006) studied NMF with orthogonality constraint for the first time and showed its effectiveness in clustering. Following that, several ONMF algorithms have been developed for a wide range of applications (Pompili et al. 2014). Most of these algorithms use a multiplicative updating framework on the Stiefel manifold \({\mathcal {M}}_n^k\) (iteratively updating matrices by taking the element-wise product with other computed non-negative matrices (He et al. 2020; Pan and Ng 2018)). Other approaches include hierarchical alternating least squares (HALS) (Kimura et al. 2015), and penalty function utilization for the orthogonality constraints (Del Buono 2009).

Here, we introduce an approach for solving the isoperimetry and ONMF problems using an efficient optimization algorithm. First, we present alternative formulations of the isoperimetry and ONMF problems, converting the orthogonality constraints into a smooth nonlinear programming problem with convex constraints in Section 2.2 and Section 3, respectively. Then, we solve these reformulated problems, in particular the ONMF, efficiently using a dedicated algorithm. It is remarkable that, instead of solving the ONMF problem for the matrix solutions, we convert the problem into subproblems and solve for vector solutions (see Section 3.1). Finally, we apply a post-processing technique to extract a solution to the clustering problem. Comparative computational results are provided in Section 4 and our concluding remarks are given in Section 5.

In summary, here our contributions are listed as follows:

  • Reformulation of the matrix orthogonality constraint as a set of non-convex smooth constraints.

  • Application of a penalty function including non-convex smooth constraints as penalty terms, leaving out only convex constraints.

2 Related works

This section reviews several works, namely k-means and isoperimetry problems, that are closely related to our work here. The k-means is one of the most popular unsupervised learning approaches being used for solving the well-known clustering problem. Many recent developments of k-means have been reported in the literature (see Fard et al. 2020; Fränti and Sieranoja 2018; Huang et al. 2021; Moreno et al. 2020; Sinaga and Yang 2020; Xia et al. 2020; Yu et al. 2018 for more details). The isoperimetry is an approach for finding a cluster structure in a data-set, characterizing the greatest similarity within a cluster and the greatest dissimilarity between the other clusters, by minimizing the sum of the weights of the edges connecting the specified cluster to the other clusters ( Dinler et al. 2020; Qin et al. 2017).

2.1 K-means

A fundamental problem of clustering, known as Minimum Sum-of-Squares Clustering (MSSC), is to partition n points based on a minimum sum-of - squares model into k clusters. Given a set X of n points in an m-dimensional Euclidean space, denoted by

$$\begin{aligned} X = \{ x_i = (x_{i1} , \dots , x_{im})^T \in \mathbb {R}^m, \ i=1,\dots , n \} , \end{aligned}$$

the partitional MSSC deals with the assignment of the n points into k disjoint clusters denoted by \(A = (A_{1} , \dots , A_k)\) centered at cluster centers \(c_j (j=1,\dots ,k)\) based on the total sum-of-squared Euclidean distances of the points \(x_i\) from their respective assigned cluster centroids \(c_i\), that is,

$$\begin{aligned} f(X,A) = \sum _{j=1}^{k} \sum _{i=1}^{|A_j|} || x_i^{(j)} - c_j ||^2, \end{aligned}$$
(2)

where \(|A_j|\) is the number of points in \(A_j\), and \(x_i^{(j)}\) is the ith point in \(A_j\). Note that if the clusters are known, then the function f(XA) achieves its minimum when each points is assigned to its closest cluster center. On the other hand, if the points in clusters \(A_j\) are fixed, then the function

$$\begin{aligned} f(X,A_j) = \sum _{i=1}^{|A_j|} || x_i^{(j)} - c_j||^2 \end{aligned}$$
(3)

is minimal when

$$\begin{aligned} c_j = \dfrac{1}{|A_j|} \sum _{i=1}^{|A_j|} x_i^{(j)}. \end{aligned}$$
(4)

The classical k-means algorithm (McQueen 1967) is described as follows:

  1. (1)

    Construct k clusters created randomly in a domain containing all the points.

  2. (2)

    Assign each point to the closest cluster center.

  3. (3)

    Recalculate cluster centers using current cluster points.

  4. (4)

    If the predefined criteria are met, then stop; otherwise, go to (2).

Another way to model the MSSC problem is based on the assignment problem. Let \(Y=[y_{ij}]\in \mathbb {R}^{n\times k}\) be the assignment matrix defined by

$$\begin{aligned} y_{ij} = {\left\{ \begin{array}{ll} 1, &{} \hbox { if } x_i \hbox {is assigned to} A_j \\ 0, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

As a consequence, the cluster center of the cluster \(A_j\) is defined by

$$\begin{aligned} c_j = \dfrac{\sum _{l=1}^n y_{lj}x_l}{\sum _{l=1}^n y_{lj}}, \end{aligned}$$

which is the mean of all the points in the cluster. Using this, Peng and Wei (2007) introduced the following model for the k-means problem:

$$\begin{aligned} \min _{y_{ij}} \ \ \ \&\sum _{j=1}^k \sum _{i=1}^n y_{ij} || x_i - c_j ||^2 \end{aligned}$$
(5a)
$$\begin{aligned} s.t. \ \ \ \ \&\sum _{j=1}^k y_{ij} = 1 \ \ (i=1,\dots , n), \end{aligned}$$
(5b)
$$\begin{aligned}&\sum _{i=1}^n y_{ij} \ge 1 \ \ (j=1,\dots , k), \end{aligned}$$
(5c)
$$\begin{aligned}&y_{ij} \in \{0,1\} \ \ (i=1,\dots , n , j=1, \dots , k). \end{aligned}$$
(5d)

The constraints (5b) ensure that each point \(x_i\) is assigned to exactly one cluster, and (5c) ensures that there are exactly k clusters. We can show (5a) in matrix form by Ferebinous norm as follows (Bauckhage 2015):

$$\begin{aligned} \sum _{j=1}^k \sum _{i=1}^n y_{ij} ||x_i - c_j||^2 = ||X-CY||_F^2, \end{aligned}$$

where \(X \in \mathbb {R}^{ m \times n}\) is a matrix of data vectors \(x_i \in \mathbb {R}^m\), \(C \in \mathbb {R}^{ m \times k }\) is a matrix of cluster centroids \(c_j \in \mathbb {R}^m\) and \(Y\in \mathbb {R}^{k \times n}\) is the assignment matrix.

2.2 Isoperimetry problem

Given a weighted graph \(G=(X,E)\), for any partition (subpartition) \(A=\{A_1 , \dots , A_k \}\), we define the vector v as follows:

$$\begin{aligned} v = \displaystyle { \left( \frac{w(A_1)}{|A_1|} , \dots , \frac{w(A_k)}{|A_k|} \right) }, \end{aligned}$$

where \(w(A_i)\) is the sum of the weights of edges between \(A_i\) and \(A_i^c \ (i.e., \ X-A_i)\) and \(|A_i|\) is the number of vertices in the cluster \(A_i\), for \(1\le i \le k\). We are to find a partition (or subpartition) of vertices so that the norm of v is minimized. For \(p=1\), the mean version of the isoperimetry problem is defined as

$$\begin{aligned}&IPP_k^m(G) = \min _{\{A_i\}_1^k \in D_k(X) } \Vert v \Vert _{1} = \min _{\{A_i\}_1^k \in D_k(X) } \dfrac{1}{k} \displaystyle { \left( \sum _{i=1}^k \dfrac{c(A_i)}{|A_i|} \right) }, \end{aligned}$$

where \(D_k(X)\) is the collection of all the k-subpartitions of the set X. Actually, the isoperimetry problem is a relaxed version of the normalized cut problem, in such a way that some points may not be assigned to any cluster.

In Dehghanpour-Sahron and Mahdavi-Amiri (2020), Dehghanpour and Mahdavi-Amiri formulated the isoperimetry problem as follows:

$$\begin{aligned} \min _{Y\in \mathbb {R}^{n\times k}} \ \ \&tr(Y^TLY) \end{aligned}$$
(6a)
$$\begin{aligned} s.t. \ \ \&Y^TY=I_k , \end{aligned}$$
(6b)
$$\begin{aligned}&Y\ge 0, \end{aligned}$$
(6c)

where L is the Laplacian matrix and \(I_k\) is the \(k\times k \) identity matrix. The orthogonal constraint \(Y^TY = I_k\), together with the nonnegativity constraint \(Y\ge 0\) ensure that each row of the matrix Y has at most one non-zero entry. Thus, matrix Y in (6) is closest to that in (5a) and shows the cluster assignment of the data set (here, the binary condition is omitted).

As noted, the orthogonality constraint (6b) and the nonnegative constraint (6c) indicate that there is at most one non-zero element in each row of Y. We note that every vector \(x\in \mathbb {R}^n\) has at most one nonzero element if and only if \(||x||_1 = ||x||_2\). Thus, constraint (6b) for matrix Y can be written as follows:

$$\begin{aligned}&||{\hat{y}}_i^T||_1^2 = ||{\hat{y}}_i^T||_2^2 \ \ ( i =1 , \dots , n ), \end{aligned}$$
(7a)
$$\begin{aligned}&||y_j||_2^2 = 1 \ \ (j=1,\dots , k), \end{aligned}$$
(7b)

where \({\hat{y}}_i\) and \(y_j\) are the ith row and the jth column of Y, respectively. So, we have a new model for the clustering problem as follows:

$$\begin{aligned} \min _Y \ \ \ \&tr(Y^TLY) \end{aligned}$$
(8a)
$$\begin{aligned} s.t.\ \ \ \&||{\hat{y}}_i^T||_1^2 = ||{\hat{y}}_i^T||_2^2 \ \ ( i =1 , \dots , n ), \end{aligned}$$
(8b)
$$\begin{aligned}&||y_j||_2^2 = 1 \ \ (j=1,\dots , k), \end{aligned}$$
(8c)
$$\begin{aligned}&Y \ge 0 . \end{aligned}$$
(8d)

Problem (8) is still difficult to solve for two reasons. First, the constraint (8b) is nonconvex and second, constraints (8b) and (8c) are written across the rows and across the columns of Y, respectively. Therefore, it is difficult to applying methods based on decomposition, specially when size of the problem is large. To deal with these issues, we propose a penalized function as follows:

$$\begin{aligned} \min _{Y} \ \ \ \&tr(Y^TLY) + \dfrac{\rho }{2} \sum _{i=1}^n (||{\hat{y}}_i^T||_1^2 - ||{\hat{y}}_i^T||_2^2 ) \end{aligned}$$
(9a)
$$\begin{aligned} s.t. \ \ \ \&||y_j||_2^2 = 1 \ \ (j=1,\dots , k), \end{aligned}$$
(9b)
$$\begin{aligned}&Y \ge 0 , \end{aligned}$$
(9c)

where scalar \(\rho >0\) is the penalty parameter.

Next, we use the same idea to reformulate the ONMF problem.

3 Orthogonal nonnegative matrix factorization

Orthogonal nonnegative matrix factorization (ONMF), an approximate matrix factorization technique with matrix orthogonality conditions and nonnegativity constraints, has recently been shown to work remarkably well for clustering tasks (Pompili et al. 2014). We consider an orthogonal nonnegative matrix factorization problem as follows. Given a d by n nonnegative matrix M and a rank k factorization (with \(k<n\)), we are to solve

$$\begin{aligned}&\min _{U\in \mathbb {R}^{d\times k } , V\in \mathbb {R}^{k\times n }} \Vert M - UV \Vert _F^2 \\&\ \ \ \ \ \ \ \ \ s.t. \ \ \ \ \ \ \ VV^T=I, \\&\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ U\ge 0 , \ V\ge 0. \end{aligned}$$
(ONMF)

Here, we consider each column of the matrix M as a point in \(\mathbb {R}^{d}\) and construct the data-set \(X=\{x_1 , \dots , x_n \}\), with \(x_i\) being the ith column of M.

Similarly, if we apply to (ONMF) the same procedure used before to convert problem (6) into the problem (9), we get

$$\begin{aligned} \min _{U\in \mathbb {R}^{d\times k } , V\in \mathbb {R}^{k\times n }}&\Vert M - UV \Vert _F^2 \end{aligned}$$
(10a)
$$\begin{aligned} s.t.&||v_j||_1^2 = ||v_j||_2^2 \ \ ( j =1 , \dots , n ), \end{aligned}$$
(10b)
$$\begin{aligned}&||{\hat{v}}_i^T||_2^2 = 1 \ \ (i=1,\dots , k), \end{aligned}$$
(10c)
$$\begin{aligned}&U \ge 0 , \ V\ge 0. \end{aligned}$$
(10d)

And, the penalized function of the ONMF problem achieves as follow:

$$\begin{aligned} \min _{U\in \mathbb {R}^{d\times k } , V\in \mathbb {R}^{k\times n }} \ \ \ \&\Vert M - UV \Vert _F^2 + \dfrac{\rho }{2} \sum _{j=1}^n (||v_j||_1^2 - ||v_j||_2^2 ) \end{aligned}$$
(11a)
$$\begin{aligned} s.t. \ \ \ \&||{\hat{v}}_i^T||_2^2 = 1 \ \ (i=1,\dots , k), \end{aligned}$$
(11b)
$$\begin{aligned}&U \ge 0 , \ V\ge 0, \end{aligned}$$
(11c)

where \({\hat{v}}_i, i=1,\dots , k,\) and \(v_j, j=1,\dots , n,\) are the ith row and the jth column of V, respectively. Note that V is an assignment matrix with \(v_{ij}\ne 0\) indicating that \(x_j\) is in cluster \(A_i\), and nonzero elements of the matrix HV are the same as the ones in V, where H is a diagonal matrix with \(0< h_{ii} \le 1 \ (i=1,\dots , k)\). So, from clustering point of view, both V and HV provide the same cluster assignment. Also, (UV) and \((UH^{-1},HV)\) provide the same objective value in the ONMF problem, i.e. ,

$$\begin{aligned} \Vert M - UV \Vert _F^2 = \Vert M - UH^{-1}HV \Vert _F^2. \end{aligned}$$

Morever,

$$\begin{aligned} HV = \begin{pmatrix} h_{11} &{} 0 &{} 0 &{} 0\\ 0 &{}h_{22} &{} 0 &{} 0\\ 0 &{} 0&{} \ddots &{} 0 \\ 0 &{} 0 &{}0 &{} h_{kk} \end{pmatrix} \begin{pmatrix} {\hat{v}}_1^T \\ {\hat{v}}_2^T\\ .\\ .\\ {\hat{v}}_k^T \end{pmatrix} = \begin{pmatrix} h_{11}{\hat{v}}_1^T \\ h_{22}{\hat{v}}_2^T\\ .\\ .\\ h_{kk}{\hat{v}}_k^T \end{pmatrix}, \end{aligned}$$

where \(||h_{ii}{\hat{v}}_i^T||_2^2 = h_{ii}||{\hat{v}}_i^T||_2^2 \le 1 \ (i=1,\dots , k)\). Therefore, clustering is not affected if we replace \(||{\hat{v}}_i^T||_2^2 = 1 \ (i=1,\dots , k),\) by \(||{\hat{v}}_i^T||_2^2 \le 1 \ (i=1,\dots , k)\), and choose \((UH^{-1},HV)\) instead of (UV) in (11). As a result, we have a new model for the ONMF problem as follows:

$$\begin{aligned} \min _{U\in \mathbb {R}^{d\times k } , V\in \mathbb {R}^{k\times n }} \ \ \ \&\Vert M - UV \Vert _F^2 + \dfrac{\rho }{2} \sum _{j=1}^n (||v_j||_1^2 - ||v_j||_2^2 ) \end{aligned}$$
(12a)
$$\begin{aligned} s.t. \ \ \ \&||{\hat{v}}_i^T||_2^2 \le 1 \ \ (i=1,\dots , k), \end{aligned}$$
(12b)
$$\begin{aligned}&U \ge 0 , \ V\ge 0. \end{aligned}$$
(12c)

Now, problem (12) is a nonlinear programming problem with convex (quadratic) constraints, and an efficient optimization method can be applied to solve it.

The Lagrangian function corresponding to the isoperimetry problem without constraint (6c) can be written as

$$\begin{aligned} {\mathcal {L}} (Y,\Lambda ) = tr(Y^TLY) - \dfrac{1}{2} tr(\Lambda (Y^TY-I)), \end{aligned}$$
(13)

where \(\Lambda \in \mathbb {R}^{k\times k}\) is comprised of the Lagrange multipliers. We define \(\nabla F(Y)={\mathcal {D}}_Y {\mathcal {L}}(Y,\Lambda )\). Dehghanpour-Sahron and Mahdavi-Amiri (2020) showed that if the similarity matrix C is placed in problem (ONMF) rather than the matrix M, the solution of the problem is equivalent to the solution of the isoperimetry problem. As a result, we can solve the problem (12) instead of the problem (8). Numerical results show that problem (12) has advantages: first, it is easier to solve than problem (8), and second, in problem (12), instead of solving the main problem, we can solve k sub-problems, which significantly reducing the computing time.

3.1 Solving ONMF problem

We first apply the algorithm proposed by Bolte et al. (2014) known as the PALM algorithm to obtain a solution of problem (12). This algorithm uses two gradient projection steps for V and U. Suppose that \(F_\rho (U,V) = \Vert M - UV \Vert _F^2 + \dfrac{\rho }{2} \sum _{j=1}^n (||v_j||_1^2 - ||v_j||_2^2 ) \). At iteration l of the algorithm, the variable V is obtained by solving the following problem:

$$\begin{aligned} V^{l+1} = \arg&\min _{V} ||V-B^l||_F^2 \nonumber \\&s.t. \ \ V\ge 0, \ ||{\hat{v}}_i^T||_2 \le 1, \ i=1, \dots , k, \end{aligned}$$
(14)

where \(B^l=V^l-\dfrac{1}{t^l} \nabla _V F_\rho (U^l,V^l)\) and \(t^l>0\) is a step size. It is remarkable that the objective function and constraints of (14) can be separated with respect to the rows of V, and we can decompose the updating of V to k subproblems as (15). At iteration l, the \({\hat{v}}_i \ ( i=1, \dots , k)\) are obtained as follows:

$$\begin{aligned}&{\tilde{v}}_i^{l+1} = \arg \min _{{\hat{v}}_i^T \ge 0, \ ||{\hat{v}}_i||_2 \le 1} ||{\hat{v}}_i - {\hat{b}}_i^l||_2^2, \end{aligned}$$
(15)
$$\begin{aligned}&{\hat{v}}_i^{l+1} = {\tilde{v}}_i^{l+1} + \tau ( {\tilde{v}}_i^{l+1} - {\hat{v}}_i^{l}), \end{aligned}$$
(16)

where \({\hat{b}}_i^l \ (i=i,\dots , k)\) are the rows of the matrix \(B^l\). We note that problem (15) has a simple solution as follows: partition \({\hat{b}}_i^{l} = [ {\hat{b}}_{i-}^{l} , {\hat{b}}_{i+}^{l}]\), where \( {\hat{b}}_{i-}^{l}= \{ {\hat{b}}_i^{l}(j) | {\hat{b}}_i^{l}(j) \le 0\}\) and \( {\hat{b}}_{i+}= \{ {\hat{b}}_i^{l}(j) | {\hat{b}}_i^{l}(j) > 0\}\), for \( j=1,\dots , n\). Then, we get

$$\begin{aligned} {\hat{v}}_i^* (j)= {\left\{ \begin{array}{ll} 0, &{} \hbox { if } {{\hat{b}}_{i}^{l}}(j) \in {\hat{b}}_{i-}^{l} \\ \dfrac{ {{\hat{b}}^{l}_{i}}(j)}{ \max { ||{{\hat{b}}^{l}_{i+}}||_2, 1}} , &{} \hbox { if }\ {{\hat{b}}_{i}^{l}}(j) \in {{\hat{b}}_{i+}^{l}}. \end{array}\right. } \end{aligned}$$

Equation (16) introduced by Pock and Sabach (2016) is a correction step for the rows of matrix V, where \(\tau \in (0,1)\) is a combination parameter. Numerical results show that if \(\tau \) is chosen carefully, the resulting correction step turns to reduce the number of iterations of the algorithm significantly and as a result, the running time of the algorithm is reduced. Finally, using a post-processing technique proposed by Dehghanpour-Sahron and Mahdavi-Amiri (2020), we construct a partition for the clustering problem. The aim of this technique is to obtain a 0-1 assignment matrix V, which \(v_{ij} = 1\) indicates \(x_j\) is in cluster \(A_i\). It constructs the assignment matrix by rounding the elements of the input matrix to 0 and 1 with predefined criteria. We explain this technique as follows. Suppose that the matrix V is an output of the proposed algorithm (Algorithm 1 below). In each column of the matrix V, the maximal element is preserved and the other elements are set to be zero (this ensures that every column of V has only one nonzero element). In each row i of the matrix V, the maximal element is found and named to be \(M_i\). In the ith row, for \(V_{ij}\ne M_i, \ \forall 1\le j\le n\), if \(V_{ij} < \dfrac{4M_i}{3n}\), then we set \(V_{ij}=0\), and an assignment matrix V is obtained.

figure a
Table 1 Compared clustering algorithms
Table 2 Misclassification error rates and Rand indices on 4 test problems of [41]
Fig. 1
figure 1

Results due to ten algorithms on the “2moon" test problem

Table 3 Misclassification error rates and Rand indices on 4 test problems of [42]
Fig. 2
figure 2

Results due to ten algorithms on the “4donut" test problem

3.2 Convergence analysis of the proposed algorithm

We first notice that any local minimal solution of (12) is feasible for (10). We know that for a non-convex problem, a local minimal solution cannot be computed in general, and we can only obtain a stationary point under some proper conditions (Bertsekas 1999). Suppose that \((U^{\rho },V^{\rho })\) is bounded and is a stationary point of (12) and \((U^{\rho },V^{\rho }) \rightarrow (U^{*},V^{*})\) as \(\rho \rightarrow \infty \). It is well known that if the Mangasarian-Fromovitz constraint qualification (MFCQ) (Facchinei and Pang 2007) holds for (10) at \((U^{*},V^{*})\), then \((U^{*},V^{*})\) is a stationary point of problem (10). According to the obtained convergence results for the algorithms of (Bolte et al. 2014) and (Pock and Sabach 2016), with a proper choice of \(t^l\) (obtained from a proper line search to ensure convergence of the iterates (Bertsekas 1999)) along with increasing penalty parameter \(\rho \), Algorithm 1 can obtain a bounded stationary point of problem (12). Moreover, the convergence rate of Algorithm 1 depends on the utilized algorithm for solving subproblems (15). In Shefi and Teboulle (2016), it is reported that the PALM algorithm has a global convergence with an asymptotic sublinear convergence rate.

Fig. 3
figure 3

Results due to ten algorithms on the “6moon" test problem

Fig. 4
figure 4

Results due to ten algorithms on the “spiral" test problem

4 Comparative results

We implemented our proposed algorithm and other clustering methods on MATLAB R2012a environment in a Windows 7 machine with a 2.40GHz CPU and 4.00 GB RAM.

The numerical results are presented in two parts. In Sect. 4.1, we compared our proposed algorithm (PFM) with other related algorithms (listed in Table 1) on some hard artificial benchmark problems. Moreover, we also report numerical results of PFM on randomly generated graphs in Sect. 4.2 for an extensive evaluation of the performance of our proposed algorithm. In all the tables, the best performance (having minimum error and highest Rand index) is highlighted in bold and the second best is specified by an underline.

Fig. 5
figure 5

Results due to ten algorithms on the “Flame" test problem

Fig. 6
figure 6

Results due to ten algorithms on the “R15" test problem

Hyper-parameter settings

Here, we provide parameters used in our proposed algorithm. We set the tolerance for the stopping criterion as \(\epsilon = 10^{-6}\), penalization parameter as \(\rho = 10^{-5}\), and \(\alpha = 1.1\). The correction step \(\tau \) must be chosen carefully to reduce the number of iterations of the algorithm. We set this to be 0.01, 0.12, and 0.3 for Tables 2, 3, 4, 5 and 6, respectively; these values have been decided based on our experimentations. Note that, for exiting the inner loop of PFM, we should use the relative error as \(||V^l - V^{l-1}||_F \le \epsilon ||V^{l}||_F\). Since \(||V^{l}||_F\) is a fixed number due to orthogonality of \(V^{l}\) (it is approximately \(\sqrt{k}\), where k is the number of clusters), the absolute and relative errors are almost equivalent, and thus we made use of absolute error.

Fig. 7
figure 7

Results due to ten algorithms on the “Aggregation" test problem

Fig. 8
figure 8

Results due to ten algorithms on the “D31" test problem

Table 4 The average misclassification rates and Rand indices for 100 randomly generated test problems: \(n = 1000\), \(k=5\)

4.1 Artificial data-sets

Here, we reported the numerical results obtained for the ten algorithms on two groups (four test problems of [41] and four test problems of [42]) of the hard benchmark clustering problems, as shown in Tables 2 and 3. For each entry of the tables, we reported the misclassification rate (the ratio of incorrect labelings to the total number of objects) for the corresponding algorithm. Also, to evaluate and compare the performance of the clustering methods, we reported the Rand index (a measure of similarity between the data clusters which has a value between 0 and 1, and the higher the value, the better the clustering) in the last row of the tables. In Tables 2 and 3, for each table, we reported this value as the average of the Rand indices over all the test problems obtained by each algorithm. From the obtained results it is obvious that PFM outperforms the other algorithms; PFM has the best or the second-best misclassification error rates and also the highest Rand index over all the test problems. Figures 1, 2, 3, 4, 5, 6, 7 and 8 illustrate the performance of ten algorithms corresponding to Tables 2 and 3. By observing the 2-dimensional shape of the test problems (depicted in Figs. 1, 2, 3, 4, 5, 6, 7 and 8), it is clear that PFM performs well in constructing the expected clusters.

Fig. 9
figure 9

The Dolan-Mor\(\acute{\text {e}}\) performance profiles comparing the misclassification rates by NJW, DJS, KP, PGAG, SKA, KTK, YFS, SY, DMA and PFM

4.2 Random graph generation and testing

Here, we investigate the performance of clustering algorithms and compare the obtained results on some randomly generated test problems. For a comparable performance evaluation, several sample random graphs were generated using the benchmark generator of Lancichinetti and Fortunato (2009) with parameters \(\mu _t\) and \(\mu _w\). Table 4 shows the misclassification rate for the algorithms on constructed randomly generated test problems using several values of the parameters. We note that all algorithms were executed once for these benchmarks and “MR Mean" is the average of the total error for each test problem. The last two rows of the table respectively give Rand indices of the clustering algorithms and the average running times of these algorithms on all the data-sets. For more statistical analysis, we utilized the Dolan and Moré (2002) performance profiles to compare the performance of the clustering algorithms. Fig. 9 shows a comparison of the obtained misclassification rates for the considered algorithms. We constructed this profile using all the test problems corresponding to Tables 4.

Table 5 The average misclassification rates and Rand indices for 100 randomly generated test problems: \(n = 10000\), \(\mu _w = 0.1\), \(\mu _t = 0.01, 0.02, 0.03, 0.04, 0.05.\)
Table 6 The average misclassification rates and Rand indices for 100 randomly generated test problems: \(n = 20000\), \(\mu _w = 0.1\), \(\mu _t = 0.01, 0.02, 0.03, 0.04, 0.05.\)

Tables 5 and 6 provide the average misclassification rates and Rand indices of the related algorithms on 100 randomly generated test problems, with parameters \(\mu _w=0.1\) and \(\mu _t = 0.01, 0.02, 0.03, 0.04, 0.05\), respectively, for 10000 and 20000 points. In these tables, we reported the Rand index as the average of the Rand indices of all the test problems.

5 Concluding remarks

We proposed an alternative formulation of the ONMF problem by converting the orthogonality constraints into convex constraints. Using a penalty function, we proposed a proper optimization method to solve this problem. We first used a gradient-based optimization algorithm and then applied a post-processing technique to extract a solution to the clustering problem. Utilizing different test problems, we considered the performance of our proposed algorithm in comparison with other available clustering algorithms, namely, Ng-Jordan-Weiss (NJW), Daneshgar-Javadi-ShariyatRazavi (DJS), Standard k-means (SKA), Pompili-Gillis-Absil-Glineur (PGAG), Kim-Park (KP), Kimura-Tanaka-Kudo (KTK), Yang-Fu-Sidiropoulos (YFS), Sinaga-Yang (SY) and Dehghanpour-Mahdavi-Amiri (DMA). Numerical results confirmed the practicality of our formulation and showed the capability of our proposed approach for constructing the expected clustering.

We compared our proposed algorithm with nine related clustering algorithms on hard synthetic data sets and some randomly generated test problems. For a proper statistical analysis, we utilized the Dolan-Mor\(\acute{\text {e}}\) performance profiles to compare the obtained misclassification rate errors. Numerical results confirmed our proposed method to be successful in clustering; PFM had the best or the second-best misclassification rate and also the highest Rand index among all the compared methods.