1 Introduction

Nonnegative matrix factorization (NMF) (Paatero and Tapper 1994; Lee and Seung 1999) is a popular and powerful technique for learning the product of two low-rank nonnegative matrix approximation of a nonnegative matrix. In particular, given a matrix \(V\in \mathbb {R}^{m\times n}\), in which each element is nonnegative, and an integer \(r<\min \{m,n\}\), NMF is to find two matrices \(W\in \mathbb {R}^{m \times r}\) and \(H\in \mathbb {R}^{r \times n}\) with nonnegative elements such that

$$\begin{aligned} V\approx WH. \end{aligned}$$
(1)

NMF has been studied for more than a decade and successfully applied to signal processing (Cichocki et al. 2006), cancer class discovery (Berry et al. 2007), text mining (Pauca 2004), and computer vision (Li et al. 2001).

A common way to find \(W\) and \(H\) is by formulating (1) as the following minimization problem:

$$\begin{aligned} \min _{W\ge 0,~H\ge 0}F(W,H):=\frac{1}{2}\Vert V-W\!H\Vert _{F}^{2}, \end{aligned}$$
(2)

where \(\Vert \cdot \Vert _{F}\) is the Frobenius norm, and \(W\ge 0\) and \(H\ge 0\) mean that all elements of \(W\) and \(H\) are nonnegative.

Lee and Seung developed a multiplicative update (MU) algorithm for solving (2) in their seminal work (Lee and Seung 2001). The MU algorithm updates matrix factors by multiplying each entry with a positive factor in each iteration round. More precisely,

$$\begin{aligned} W^{k+1}_{ia}&= W^{k}_{ia}\frac{\left( V(H^{k})^{T}\right) _{ia}}{\left( W^{k}H^{k}(H^{k})^{T}\right) _{ia}}, \quad \forall ~ i,a,\\ H^{k+1}_{bj}&= H^{k}_{bj}\frac{\left( (W^{k+1})^{T}V\right) _{bj}}{\left( (W^{k+1})^{T}W^{k+1}H^{k}\right) _{bj}}, \quad \forall ~ b,j. \end{aligned}$$

MU has been one of the most commonly used approach for NMF due to the simplicity. However, it has been observed to converge relatively slowly, especially when dealing with dense matrices (Berry et al. 2007; Kim et al. 2007; Lin 2007).

Another popular approach is the alternating nonnegative least squares (ANLS) framework:

Algorithm 1

ANLS Framework for NMF

  • Step 1. Initialize \(W^{1}\in \mathbb {R}^{m\times r}_{+}\) and \(H^{1}\in \mathbb {R}^{r\times n}_{+}\). Set \(k=1\).

  • Step 2. Update the matrices in the following way until a convergence criterion is satisfied:

    $$\begin{aligned} W^{k+1}&= \arg \min _{W\ge 0}F\left( W,H^{k}\right) \!,\end{aligned}$$
    (3)
    $$\begin{aligned} H^{k+1}&= \arg \min _{H\ge 0}F\left( W^{k+1},H\right) \!. \end{aligned}$$
    (4)

At each iteration, one of the two factors is fixed and the other is updated in such a way that the objective function is reduced. This approach is the “block coordinate descent” method in bound constrained optimization (Bertsekas 1999). Although the original problem (2) is non-convex and NP-hard (Vavasis 2009) with respect to variables \(W\) and \(H\), the subproblems (3) and (4) are convex problems for which optimal solutions can be found. However, these subproblems may have multiple optimal solutions because they are not strictly convex. Past convergence results for block coordinate descent methods are based on the assumption that the subproblems have unique solutions (Bertsekas 1999). Fortunately, when we have only two blocks, Grippo and Sciandrone (2000) proved convergence of Algorithm 1 to a stationary point of (2) without requiring the uniqueness.

Based on the ANLS framework, many algorithms were developed for solving (2), such as the active set method (Kim and Park 2008), the projected gradient method (PG) (Lin 2007), the projected Newton method (Gong and Zhang 2012), and the projected quasi-Newton method (Kim et al. 2007; Zdunek and Cichocki 2006). However, most of these mentioned methods are inefficient in finding a step length by using the line search. Recently, Guan et al. (2012) proposed the NeNMF algorithm for NMF by applying Nesterov’s optimal gradient method (OGM) (Nesterov 1983), which performs the projected gradient method to a point constructed by the linear combination of the recent two iterations, to the ANLS framework. The OGM does not require line search. But the convergence rate of OGM depends not only on the distance between the initial guess and a stationary point but also on the step length used by the projected gradient method which is determined by the Lipschitz constant of the gradient of the objective function, see Guan et al. (2012) and Nesterov (1983) for details. If the initial iterate is far away from a stationary point or the Lipschitz constant of the gradient is very large, then OGM may take thousands of iterations to reach a given tolerance which consequently degrades the efficiency of NeNMF.

As is well known the projected gradient method suffers from slow convergence like steepest descent (Bertsekas 1999). Barzilai and Borwein (1988) introduced the following two stepsizes that significantly improve the performance of the gradient methods:

$$\begin{aligned} a_{k}^{BB1}=\frac{\langle s_{k-1},y_{k-1}\rangle }{\langle s_{k-1},s_{k-1}\rangle }, \end{aligned}$$

and

$$\begin{aligned} a_{k}^{BB2}=\frac{\langle y_{k-1},y_{k-1}\rangle }{\langle s_{k-1},y_{k-1}\rangle }, \end{aligned}$$

where \(s_{k-1}=x^{k}-x^{k-1}\) and \(y_{k-1}=\nabla f(x^{k})-\nabla f(x^{k-1})\). The BB method can be viewed as a quasi-Newton method by replacing the inverse Hessian in the Newton iterate formula with a diagonal matrix \(a_{k}I\) where \(a_{k}\) is the BB stepsize and \(I\) is the identity matrix. Due to its simplicity and numerical efficiency, the BB approach has attracted much attention in recent literature (Dai and Liao 2002; Dai et al. 2006; Dai and Zhang 2001; Fletcher 2005; Grippo and Sciandrone 2002; Huang et al. 2013; Raydan 1997). By considering the combination of the global BB nonmonotone scheme (Raydan 1997) with classical projected gradient strategies, Birgin et al. (2000) proposed the so-called spectral gradient method for minimizing differentiable functions on closed convex sets. Dai and Fletcher (2005) developed a projected BB method that alternately uses the two BB stepsizes to solve box-constrained quadratic programming. In general, the projected BB method updates iterate by

$$\begin{aligned} x^{k+1}=P\left( x^{k}-a_{k}\nabla f\left( x^{k}\right) \right) , \end{aligned}$$
(5)

where \(a_{k}\) is either BB stepsize and the operator \(P(x)\) projects \(x\) into the feasible set. The projected BB approach has been useful for solving real applications in different areas, including compressive sensing (Figueiredo et al. 2007), image restoration (Bonettini et al. 2009), support vector machines (Cores et al. 2009), etc. For more detail about the BB method and its variants see Fletcher (2005), Birgin et al. (2012) and references therein.

As mentioned above, the subproblems (3) and (4) are minimizing a quadratic function over the nonnegtive constraints. Thus the methods in Birgin et al. (2000) and Dai and Fletcher (2005) can be applied to solve the NMF problem (2) in the ANLS framework. Recently, Han et al. (2009) proposed four projected BB algorithms for nonsmooth NMF. Among the four proposed algorithms, APBB2, which applies the SPG2 method in Birgin et al. (2000) to NMF, performs best. However, the projected BB strategy may not generate satisfactory decrease due to the inexact approximation of the Hessian. In addition, the nonmonotone line search scheme may increase the objective significantly. Therefore, APBB2 may cost many iterations to solve the subproblems which can be a serious disadvantage in the large-scale case. These issues motivate us to develop a new algorithm for solving NMF.

In this paper, we present a new efficient algorithm for NMF based on the ANLS framework. Our algorithm adopts a quadratic regularization projected Barzilai–Borwein (QRPBB) method as an essential subroutine. At each iteration, the QRPBB method first solves a strongly convex quadratic minimization problem, which admits a simple closed-form solution that is inexpensive to calculate, and then applies the SPG2 method in Birgin et al. (2000) to update the solution of NMF. Global convergence result is established under mild conditions. Numerical experiments on synthetic, image, and text datasets show that the proposed algorithm is efficient.

The rest of this paper is organized as follows. In Sect. 2, our new method for solving the subproblems (3) and (4) is described in detail. In Sect. 3, we prove the global convergence of our method. In Sect. 4, we present experimental comparisons among several NMF methods. We conclude the paper in Sect. 5.

2 Quadratic regularization projected Barzilai–Borwein method

In this section, we present a quadratic regularization projected Barzilai–Borwein (QRPBB) method for solving the subproblems (3) and (4). Notice that the role of matrices \(W\) and \(H\) is perfectly symmetric: any formula designed to update \(W\) directly translates into an update for \(H\) in the original problem. So we focus only on the update of the matrix \(W\).

We need to solve the following problem for updating the matrix \(W\),

$$\begin{aligned} \min _{W\ge 0}f^{k}(W):=F\left( W,H^{k}\right) =\frac{1}{2}\left\| V-W\!H^{k}\right\| _{F}^{2}, \end{aligned}$$
(6)

which is the minimization problem in (3). To simplify the notation, we use \(f(W)\) rather than \(f^{k}(W)\) in the following part of this paper.

Apparently, the objective function \(f(W)\) is convex and the gradient of which is Lipschitz continuous. We summarize these properties in the following lemma which has been shown by Guan et al. (2012) for the problem in (4).

Lemma 1

The following two statements are valid.

  1. (i)

    The objective function \(f(W)\) of (6) is convex.

  2. (ii)

    The gradient

    $$\begin{aligned} \nabla f(W)=\left( W\!H^{k}-V\right) \left( H^{k}\right) ^{T} \end{aligned}$$

    is Lipschitz continuous with constant \(L=\Vert H^{k}(H^{k})^{T}\Vert _{2}\).

For any fixed \(U\ge 0\), define the quadratic regularization function by

$$\begin{aligned} \phi (U,W)=f(U)+\langle \nabla f(U),W-U\rangle +\frac{L}{2}\Vert W-U\Vert _{F}^{2}. \end{aligned}$$

From (ii) of Lemma 1, it follows that \(\phi (U,W)\) is strictly convex in \(W\) for any fixed \(U\ge 0\). At the iteration \(t\), our QRPBB method first generates a point by solving the following strongly convex quadratic minimization problem

$$\begin{aligned} \min _{W\ge 0}\phi (W_{t},W). \end{aligned}$$
(7)

Clearly, problem (7) has a unique solution in closed-form

$$\begin{aligned} Z_{t}=P\left[ W_{t}-\frac{1}{L}\nabla _{W}F\left( W_{t},H^{k}\right) \right] , \end{aligned}$$
(8)

where the operator \(P(X)\) projects all the negative entries of \(X\) to zero.

After calculating the point \(Z_{t}\), QRPBB applies the SPG2 method by Birgin et al. (2000) to update the matrix:

$$\begin{aligned} W_{t+1}=Z_{t}+\lambda _{t}D_{t}, \end{aligned}$$
(9)

where \(\lambda _{t}\in (0,1]\) is the step length obtained by nonmonotone line search (Grippo and Sciandrone 2002) and

$$\begin{aligned} D_{t}=P\left[ Z_{t}-\alpha _{t}\nabla f(Z_{t})\right] -Z_{t} \end{aligned}$$
(10)

is the direction with \(\alpha _{t}\) be the BB stepsize with safeguards

$$\begin{aligned} \alpha _{t}=\min \left\{ \alpha _{\max }, \max \left\{ \alpha _{\min }, \frac{\langle S_{t-1},S_{t-1}\rangle }{\langle S_{t-1},Y_{t-1}\rangle }\right\} \right\} . \end{aligned}$$
(11)

Here, \(\alpha _{\max }>\alpha _{\min }>0, S_{t-1}=W_{t}-W_{t-1}\) and \(Y_{t-1}=\nabla f(W_{t})-\nabla f(W_{t-1})\).

The QRPBB method for (6) is summarized in Algorithm 2. Both \(W^{k}\) and \(H^{k}\) are obtained from the previous iteration in the ANLS framework. We apply QRPBB to solve the minimization problem in (4) as well.

Algorithm 2

Quadratic regularization projected BB method

  • Step 1. Initialize \(\alpha _{0}=1, M>0, \alpha _{\max }>\alpha _{\min }>0, \rho ,~\gamma \in (0,1), L=\Vert H^{k}(H^{k})^{T}\Vert _{2}\), and \(W_{0}=W^{k}\). Set \(t=0.\)

  • Step 2. If \(\Vert P\left[ W_{t}-\nabla f(W_{t})\right] -W_{t}\Vert =0\), stop.

  • Step 3. Compute \(Z_{t}=P[W_{t}-\displaystyle \frac{1}{L}\nabla f(W_{t})]\).

  • Step 4. Nonmonotone line search. Let \(m_{t}\) be the smallest nonnegative integer \(m\) satisfying

    $$\begin{aligned} f\left( Z_{t}+\rho ^{m} D_{t}\right) \le \max _{0\le j\le \min \{t,M-1\}}f(Z_{t-j})+ \gamma \rho ^{m}\left\langle \nabla f(Z_{t}),D_{t}\right\rangle , \end{aligned}$$
    (12)

    where \(D_{t}=P[Z_{t}-\alpha _{t}\nabla f(Z_{t})]-Z_{t}\). Set \(\lambda _{t}:=\rho ^{m_{t}}, W_{t+1}=Z_{t}+\lambda _{t}D_{t}\).

  • Step 5. Compute

    $$\begin{aligned} S_{t}=W_{t+1}-Z_{t} ~\mathrm and ~ Y_{t}=\nabla f(W_{t+1})-\nabla f(Z_{t}). \end{aligned}$$

    If \(\langle S_{t},S_{t}\rangle /\langle S_{t},Y_{t}\rangle \le 0\), set \(\alpha _{t+1}=\alpha _{\max }\); otherwise, set

    $$\begin{aligned} \alpha _{t+1}=\min \left\{ \alpha _{\max },\max \left\{ \alpha _{\min }, \langle S_{t},S_{t}\rangle /\langle S_{t},Y_{t}\rangle \right\} \right\} . \end{aligned}$$
  • Step 6. Set \(t=t+1\) and go to Step 2.

Note that the idea of applying projected BB methods to solve the subproblems is not new. QRPBB is closely related to the APBB2 in Han et al. (2009) but has the following three differences:

  1. (i)

    In the calculation of new iterate: our QRPBB method computes the new point \(Z_{t}\) and updates \(W_{t}\) by using \(Z_{t}\). It generates two sequences \(\{Z_{t}\}\) and \(\{W_{t}\}\) during the iterative process. The APBB2 method only produces a sequence \(\{W_{t}\}\).

  2. (ii)

    In the use of line search: our QRPBB method performs nonmonotone line search on the new point \(Z_{t}\) rather than \(W_{t}\). On the other hand, APBB2 applies nonmonotone line search directly to \(W_{t}\).

  3. (iii)

    In the use of initial BB stepsize: we initialize the BB stepsize as 1 at the beginning of the QRPBB method. The APBB2 method selects the maximum absolute of the element in the gradient to compute the initial BB stepsize, that is, \(1/\max \limits _{i,j}(|\nabla f(W_{0})|)\).

Now we explain the reason for using such schemes. Our aim is to minimize \(f(W)\) over the nonnegative constraints. As described in the former section, the projected BB strategy uses a nonmonotone line search may lead to a significant increase in the objective. In contrast, the quadratic approximation of \(f(W)\) at \(W_{t}\) given by \(\phi (W_{t},W)\) produces the point \(Z_{t}\) cheaply by (8). In next section, we will show that \(f(Z_{t})\le f(W_{t})-\frac{L}{2}\Vert Z_{t}-W_{t}\Vert ^{2}\), which implies that \(Z_{t}\) provides a reduction in the objective. Therefore, if the nonmonotone line search (12) increases the objective at the \(t\)th iteration, then \(f(W_{t+1})\) will be no larger than the maximum value of recent \(f(Z_{t})\) which is smaller than \(f(W_{t})\). As a result, the objective value will not increase very much and may reduce the iterations to solve the subproblems.

It is not difficult to see that, at each iteration, the major operation of QRPBB is to check the condition (12) and compute the gradient. Thus the time complexity of one iteration of QRPBB is same as that of PG by Lin (2007), i.e., \(O(mnr)+\#\mathrm sub-iterations \times O(tmr^2+tnr^2)\) with \(t\) being the trials of the nonmonotone line search procedure. See Lin (2007) for details of how to calculate the complexity and Guan et al. (2012) for the complexity of popular methods for NMF.

3 Convergence

In this section, we show that the sequence \(\{W_{t}\}\) generated by the QRPBB method converges to a stationary point of (6).

From the Karush–Kuhn–Tucker (KKT) optimality conditions, \(W\) is a stationary point of (6) if and only if

$$\begin{aligned} W\ge 0,~\nabla f(W)\ge 0,~ \nabla f(W)\otimes W=0, \end{aligned}$$
(13)

where \(\otimes \) is the Hadamard product.

Define the scaled projected gradient direction by

$$\begin{aligned} D_{\alpha }(W)=P\left[ W-\alpha \nabla f(W)\right] -W, \end{aligned}$$
(14)

for all \(\alpha >0, W\ge 0\). It is easy to prove the following lemma that will be used in our proof, see Birgin et al. (2000) for example.

Lemma 2

For all \(\alpha \in (0,\alpha _{\max }], W\ge 0\),

  1. (i)

    \(\langle \nabla f(W),D_{\alpha }(W)\rangle \le -\displaystyle \frac{1}{\alpha }\Vert D_{\alpha }(W)\Vert ^{2}\le -\displaystyle \frac{1}{\alpha _{\max }}\Vert D_{\alpha }(W)\Vert ^{2}\),

  2. (ii)

    \(W\) is a stationary point of (6) if and only if \(D_{\alpha }(W)=0\).

By Theorem 2.2 of Birgin et al. (2000), we know that the nonmonotone line search Step 3 of Algorithm 2 is well defined. Therefore, Algorithm 2 is well defined.

Let \(l(t)\) be an integer such that \(t-\min \{t,M-1\}\le l(t)\le t\) and

$$\begin{aligned} f(Z_{l(t)})=\max \limits _{0\le j\le \min \{t,M-1\}}f(Z_{t-j}). \end{aligned}$$

By Lemma 2, we have the following lemma.

Lemma 3

Assume that the sequence \(\{Z_{t}\}\) is generated by Algorithm 2, then the following statements are valid.

  1. (i)

    \(f(Z_{t})\le f(W_{t})-\displaystyle \frac{L}{2}\Vert Z_{t}-W_{t}\Vert ^{2}.\)

  2. (ii)

    The sequence \(\{f(Z_{l(t)})\}\) is nonincreasing.

Proof

By (8) and the left inequality of (i) in Lemma 2, one has

$$\begin{aligned} \left\langle \nabla f(W_{t}),Z_{t}-W_{t}\right\rangle \le -L\left\| Z_{t}-W_{t}\right\| ^{2}, \end{aligned}$$

It follows that

$$\begin{aligned} f(Z_{t})&\le f(W_{t})+\left\langle \nabla f(W_{t}),Z_{t}-W_{t}\right\rangle +\frac{L}{2}\left\| Z_{t}-W_{t}\right\| ^{2}\\&\le f(W_{t})-\frac{L}{2}\left\| Z_{t}-W_{t}\right\| ^{2}, \end{aligned}$$

which proves the first part.

For the second part, taking into account that

$$\begin{aligned} f(Z_{l(t+1)})&=\max \limits _{0\le j\le \min \{t+1,M-1\}}f(Z_{t+1-j}) \\&\le \max \left\{ f(Z_{l(t)}),f(Z_{t+1})\right\} \\&\le \max \left\{ f(Z_{l(t)}),f(W_{t+1})\right\} \\&=f(Z_{l(t)}), \end{aligned}$$

where the last equality follows from (12). This completes the proof. \(\square \)

It follows from Lemma 3 that

$$\begin{aligned} f(W_{t+1})\le f(Z_{l(t)})\le f(Z_{l(t-1)})\le f(Z_{0}). \end{aligned}$$

Moreover, for any initial iterate \(W_{0}\ge 0\) the sequences \(\{W_{t}\}\) and \(\{Z_{t}\}\) generated by Algorithm 2 are contained in the level set

$$\begin{aligned} \mathcal {L}(W_{0})=\left\{ W|f(W)\le f(W_{0}), \quad W\ge 0\right\} . \end{aligned}$$

Now we use similar argument to the one in Lemma 2.1 of Zhang and Hager (2004) to prove lower bound for the step length \(\lambda _{t}\).

Lemma 4

If \(Z_{t}\) is not a stationary point of (6), then there exists a constant \(\tilde{\lambda }>0\) such that \(\lambda _{t}\ge \tilde{\lambda }\).

Proof

Since by Lemma 1, \(\nabla f(W)\) is Lipschitz continuous with constant \(L\), then

$$\begin{aligned} f(Z_{t}+\lambda D_{t})-f(Z_{t})&=\lambda \left\langle \nabla f(Z_{t}),D_{t}\right\rangle +\int _{0}^{\lambda }\left\langle \nabla f(Z_{t}+tD_{t})- \nabla f(Z_{t}), D_{t}\right\rangle dt \nonumber \\&\le \lambda \left\langle \nabla f(Z_{t}),D_{t}\right\rangle +\int _{0}^{\lambda }Lt\Vert D_{t}\Vert ^{2}dt \nonumber \\&=\lambda \left\langle \nabla f(Z_{t}),D_{t}\right\rangle +\frac{L}{2}\lambda ^{2}\Vert D_{t}\Vert ^{2}. \end{aligned}$$
(15)

As we know that either \(\lambda _{t}=1\) or the inequality (12) will be failed at least once. Therefore,

$$\begin{aligned} f\left( Z_{t}+\frac{\lambda _{t}}{\rho }D_{t}\right)&> f(Z_{l(t)})+ \gamma \frac{\lambda _{t}}{\rho }\left\langle \nabla f(Z_{t}),D_{t}\right\rangle \\&\ge f(Z_{t})+ \gamma \frac{\lambda _{t}}{\rho }\left\langle \nabla f(Z_{t}),D_{t}\right\rangle . \end{aligned}$$

Combining this with (15) and (ii) of Lemma 2, we obtain

$$\begin{aligned} \lambda _{t}\ge \min \left\{ 1,\frac{2\rho (1-\gamma )}{L}\frac{|\langle \nabla f(Z_{t}),D_{t}\rangle |}{\Vert D_{t}\Vert ^{2}}\right\} . \end{aligned}$$

From the right inequality of (i) of Lemma 2, one has

$$\begin{aligned} \langle \nabla f(Z_{t}),D_{t}\rangle \le - \frac{1}{\alpha _{\max }}\Vert D_{t}\Vert ^{2}. \end{aligned}$$

Thus,

$$\begin{aligned} \lambda _{t}\ge \min \left\{ 1,\frac{2\rho (1-\gamma )}{L\alpha _{\max }}\right\} :=\tilde{\lambda }. \end{aligned}$$

\(\square \)

The following theorem establishes the global convergence of the QRPBB method.

Theorem 1

Suppose that the level set \(\mathcal {L}(W_{0})\) is bounded, then any accumulation point of the sequence \(\{Z_{t}\}\) generated by Algorithm 2 is a stationary point of (6).

Proof

Since \(\mathcal {L}(W_{0})\) is bounded, then \(\{Z_{t}\}\) admits an accumulation point, and relabel \(\{Z_{t}\}\) a subsequence converging to \(Z_{*}\).

Let us suppose by way of contradiction that \(Z_{*}\) is not a stationary point of (6). Then it follows from (ii) of Lemma 2 that \(\Vert D_{\alpha }(Z_{*})\Vert >0\) for all \(\alpha \in (0,\alpha _{\max }]\). By continuity and compactness, there exists \(\eta >0\) such that \(\Vert D_{\alpha }(Z_{*})\Vert >\eta \) for all \(\alpha \in [\tilde{\lambda },\alpha _{\max }]\). Thus, for sufficiently large \(t, \Vert D_{\alpha _{t}}(Z_{*})\Vert >\displaystyle \frac{\eta }{2}\). Hence, using Lemma 2, Lemma 3, and Lemma 4, we obtain

$$\begin{aligned} f(Z_{l(t)})&\le f(W_{l(t)})\\&= f\left( Z_{l(t)-1}+\lambda _{l(t)-1}D_{l(t)-1}\right) \\&\le f\left( Z_{l(l(t)-1)}\right) + \gamma \lambda _{l(t)-1}\left\langle \nabla f(Z_{l(t)-1}),D_{l(t)-1}\right\rangle \\&\le f\left( Z_{l(l(t)-1)}\right) -\frac{\gamma \tilde{\lambda }}{\alpha _{\max }}\Vert D_{l(t)-1}\Vert ^{2}\\&\le f\left( Z_{l(l(t)-1)}\right) -\frac{\gamma \tilde{\lambda }\eta ^{2}}{4\alpha _{\max }}, \end{aligned}$$

which implies that

$$\begin{aligned} f(Z_{l(t)})\rightarrow -\infty ,~t\rightarrow \infty . \end{aligned}$$

Here is the desired contradiction. Therefore, \(Z_{*}\) is a stationary point of (6). \(\square \)

By the proof of Theorem 1 and Lemma 3, we have the following result.

Corollary 1

Suppose that the level set \(\mathcal {L}(W_{0})\) is bounded, then any accumulation point of the sequence \(\{W_{t}\}\) generated by Algorithm 2 is a stationary point of (6).

4 Experiments

We have tested the ANLS using our QRPBB method and compared it with other three ANLS-based methods including the NeNMF (Guan et al. 2012), the projected gradient method (PGFootnote 1 Lin (2007)), and the projected BB method (APBB2Footnote 2 Han et al. (2009)). We compared these methods on both synthetic datasets and real-world problems using the ORL image databaseFootnote 3, CBCL face image databaseFootnote 4, the Reuters-21578 (Lewis et al. 2004) and TDT-2 (Cieri et al. 1999) text corpus.Footnote 5 All the methods were implemented in MATLAB. (Codes of our QRPBB method are available under request.) The numerical tests were carried out on a 3.10 GHz Core i5 PC with 4 GB of RAM under Windows 7.

4.1 Stopping criterion

The KKT conditions of (2) can be written as

$$\begin{aligned} \nabla _{H}^{P}F(W,H)=0,~\nabla _{W}^{P}F(W,H)=0, \end{aligned}$$
(16)

where \(\nabla _{W}^{P}F(W,H)\) is defined by

$$\begin{aligned} \nabla _{W}^{P}F(W,H)\!=\!\left\{ \!\begin{array}{lr} \nabla _{W}F(W,H)_{ij}, &{} (W)_{ij}>0, \\ \min \{0,\nabla _{W}F(W,H)_{ij}\}, &{} (W)_{ij}=0, \end{array}\right. \end{aligned}$$

and \(\nabla _{H}^{P}F(W,H)\) is defined in the same way. We use the following stopping criterion as Lin (2007),

$$\begin{aligned}&\Big \Vert \big [\nabla _{H}^{P}F(W^{k},H^{k}),\nabla _{W}^{P}F(W^{k},H^{k})^{T}\big ]\Big \Vert _{F} \nonumber \\&\quad \le \epsilon \Big \Vert \left[ \nabla _{H}F(W^{1},H^{1}),\nabla _{W}F(W^{1},H^{1})^{T}\right] \Big \Vert _{F}. \end{aligned}$$
(17)

Here, \(\epsilon >0\) is a tolerance.

For the subproblems (3) and (4), we stop the iterative procedures if

$$\begin{aligned} \left\| \nabla _{W}^{P}F\left( W^{k+1},H^{k}\right) \right\| _{F}\le \epsilon _{W}~\mathrm and ~ \left\| \nabla _{H}^{P}F\left( W^{k+1},H^{k+1}\right) \right\| _{F}\le \epsilon _{H}, \end{aligned}$$
(18)

where

$$\begin{aligned} \epsilon _{W}&=\epsilon _{H}=\max (10^{-3},\epsilon ) \Big \Vert \left[ \nabla _{H}F\left( W^{1},H^{1}\right) ,\nabla _{W}F\left( W^{1},H^{1}\right) ^{T}\right] \Big \Vert _{F}. \end{aligned}$$

If QRPBB solves (3) without any iterations, we decrease the stopping tolerance by \(\epsilon _{W}=0.1\epsilon _{W}\). The same strategy is adopted to solve (4). Note that other three algorithms also use the same stopping criterion (17).

We also used 10,000 as the maximal number of iterations allowed and 1,000 seconds as the maximum CPU time allowed for each method. In the implementation of QRPBB, we allowed QRPBB to iterate at most 1,000 iterations for solving the subproblems (3) and (4).

4.2 Synthetic data

We first tested QRPBB and the other three methods on synthetic datasets. The matrix \(V\) is randomly generated by the rand function in MATLAB. All initial points \((W^{1},H^{1})\) were randomly generated in the same way as \(V\).

We use the following parameters for our QRPBB method in all the experiments of the remain subsections:

$$\begin{aligned} \alpha _{\max }=10^{20},~\alpha _{\min }=10^{-20},~\rho =0.25,~\gamma =10^{-4}. \end{aligned}$$

This setting is popular for most BB type methods and same as that of APBB2. Note that \(M\) stands for the number of recent objectives among which the maximum is chosen for nonmontone line search. Since our tests do not show meaningful differences with \(M=5,10\) and our QRPBB method requires few iterations to solve the subproblems, we use \(M=5\). All parameters for other methods were set to their default values.

For each problem, 10 different initial points were randomly generated and the average results from using these initial values were reported in Table 1. The number iter denotes the number of iterations needed when the termination criterion (17) was met. The numbers, niter, pgn, residual, and time denote the total number of sub-iterations for solving (3) and (4), the final value of projected gradient norm \(\Vert \big [\nabla _{H}^{P}F(W^{k},H^{k}),\nabla _{W}^{P}F(W^{k},H^{k})^{T}\big ]\Vert _{F}\), the final value of \(\Vert V-W^{k}H^{k}\Vert _{F}/\Vert V\Vert _{F}\), and the CPU time used at the termination of each algorithm, respectively.

Table 1 Experimental results on synthetic datasets (dense matrices) with \(\epsilon =10^{-7}\)

As shown in Table 1, all of these compared methods appeared to satisfy the convergence criterion within a reasonable number of iterations. For most of the test problems, QRPBB showed the shortest execution time and the smallest number of sub-iterations, especially for large problems. The final values of pgn generated by QRPBB are smaller than those by NeNMF and PG, and close to those by APBB2. On the other hand, the final values of residual obtained by QRPBB and APBB2 are slightly larger than those by NeNMF and PG. The following two factors can attribute to this phenomenon. First, since the NMF problem (2) is not convex, these methods may converge to different points. Second, these methods may have different final \((W^{k},H^{k})\) due to different update strategies.

Since QRPBB is closely related to APBB2, we further examined the behavior of QRPBB and APBB2. We present comparison of QRPBB and APBB2 on two random generated problems of the form (3) in Fig. 1. We terminate the methods if the stop criterion described by the first inequality in (18) is satisfied with \(\epsilon _{W}=10^{-4}\) or the iteration number exceeds 1,000. Clearly, the curve generated by QRPBB is not as sharp as that by APBB2. Moreover, QRPBB requires less than 1/2 iterations of APBB2 to reach the given tolerance. Results here are consistent with our analysis in Sect. 2. The improvement becomes more obvious for solving NMF problems. Take the last problem in Table 1 for example, QRPBB needs only 16.4 iterations and 292.5 sub-iterations in average while APBB2 requires around 6 times iterations and 7 times sub-iterations of those by QRPBB. In addition, most of the residual obtained by QRPBB are same as by APBB2. However, QRPBB can sometimes give slightly larger final values of pgn and residual than APBB2. This phenomenon can be explained by the above statements.

Fig. 1
figure 1

Objective value versus iteration on random generated problem \(\min \limits _{W\ge 0}~\frac{1}{2}\Vert V-W\!H\Vert _{F}^{2}\) with \((m~n~r)=(100~50~10)\) (left) and \((m~n~r)=(2{,}000~1{,}000~50)\) (right)

4.3 Image data

The ORL image database contains 400 facial images of 40 different people with 10 images per person. The size of each image is \(92\times 112\) pixels, with 256 grey levels per pixel. The matrix \(V\) whose columns represent these images has \(92\times 112=\)10,304 rows and 400 columns.

The CBCL image database consists of 2,429 facial images. Each face image has \(19\times 19\) pixels. We obtained \(361\times \)2,429 matrix.

For each database, we try 10 different randomly generated initial iterates with \(\epsilon =10^{-7}\) in (17) and present the average results in Table 2. We can find that for both databases, our QRPBB method converges in less iterations and CPU time. Although the residual by QRPBB is not the smallest for each database, the values of pgn imply that solutions by QRPBB are closer to stationary points.

Table 2 Experimental results on the CBCL and ORL databases with \(\epsilon =10^{-7}\)

We set \(\epsilon \) in (17) to different values in order to investigate the convergence speed to a stationary point and show the results in Fig. 2. The performance of these four methods are close when a loose tolerance was given. However, when a tighter tolerance was used, QRPBB was clearly faster than other three methods.

Fig. 2
figure 2

CPU time (s) versus tolerance values on the CBCL (left) and ORL (right) datasets. All methods were executed with the same initial values, and the average results using 10 different initial values are presented

4.4 Text data

The Reuters-21578 corpus contains 21,578 documents in 135 categories. Those documents with multiple category labels are discarded. It left us with 8,293 documents in 65 categories. After preprocessing, this corpus contains 18,933 distinct terms represented by an 18,933\(\times \)8,293-dimension matrix.

The TDT-2 corpus contains news articles from various sources such as ABC, CNN, VOA, NYT, PRI, and APW in 1998. It consists of 11,201 on-topic documents which are classified into 96 semantic categories. Those documents appearing in two or more categories were removed, and only the largest 30 categories were kept, thus leaving us with 9,394 documents represented by a 36,771\(\times \)9,394-dimension matrix.

Similar as the former subsection, we first set \(\epsilon =10^{-7}\) and then set different tolerance values. The average results with 10 different initial iterates are presented in Table 3 and Fig. 3. We observe from Table 3 that NeNMF requires less iterations than other methods when the tolerance is set to \(10^{-7}\), especially for the TDT2 dataset. But the residual and pgn by NeNMF are large. In contrast, NeNMF is observed slower than QRPBB which has the fastest final convergence for a tight tolerance in Fig. 3.

Fig. 3
figure 3

CPU time (s) versus tolerance values on the Reuters-21578 (left) and TDT2 (right) datasets. All methods were executed with the same initial values, and the average results using 10 different initial values are presented

Table 3 Experimental results on the Reuters-21578 and TDT2 datasets with \(\epsilon =10^{-7}\)

4.5 Document clustering

Recently, the use of NMF for document clustering has attracted much interest (Cai et al. 2005, 2011; Cichocki et al. 2009; Pauca 2004; Shahnaz et al. 2006). We conducted clustering experiments on the Reuters-21578 and TDT2 datasets. Each document was represented with a normalized term frequency vector. There are 50 randomly cases for each cluster number from 2 to 10. These randomly selected data in MATLAB format are downloaded from http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html. We use the randn function in MATLAB to generate initial iterates for each experiment. We normalize the matrix \(W\) during the iterate process in a similar way as Xu et al (2003). We present the average accuracy and the average normalized mutual information (NMI) (Xu et al 2003) obtained by the 50 trials for each cluster number. The corresponding average CPU time is reported as well.

Figure 4 shows that QRPBB, PG, and APBB2 can obtain fairly good accuracy that better than NeNMF. For most cluster numbers, NMI obtained by PG is slightly larger than QRPBB and APBB2 while NeNMF gives the smallest NMI. QRPBB converges faster than PG and APBB2 and slightly slower than NeNMF, especially for a large cluster number. From Fig. 5, similar conclusions can be drawn as from Fig. 4. QRPBB, PG, and APBB2 outperform NeNMF in both accuracy and NMI. For some cluster number, QRPBB yields lower accuracy and smaller NMI than either PG or APBB2. One possible reason for the above observation is document clusters in TDT2 are generally more compact and focused than the clusters in Reuters. As a result, the accuracy and NMI of TDT2 maybe more sensitive to the values of residual than those of Reuters. Moreover, it has been shown by Table 3 that PG and APBB2 can produce smaller residual than QRPBB and NeNMF with tolerance \(10^{-7}\). Overall our QRPBB method is competitive to other three compared methods.

Fig. 4
figure 4

Accuracy, normalized mutual information, and CPU time (s) versus cluster number obtained by PG, NeNMF, APBB2, and QRPBB on the Reuter-21578 dataset. For each cluster number, average results by 50 trials with tolerance \(10^{-7}\) are presented

Fig. 5
figure 5

Accuracy, normalized mutual information, and CPU time (s) versus cluster number obtained by PG, NeNMF, APBB2, and QRPBB on the TDT2 dataset. For each cluster number, average results by 50 trials with tolerance \(10^{-7}\) are presented

As we know, the clustering performance can be degraded if the objective is optimized too much. We set different tolerance values to investigate the performance of QRPBB on clustering the Reuter-21578 and TDT-2 datasets and present the results in Fig. 6. For the Reuter-21578 dataset, we find that when the tolerance is set to \(10^{-9}\), 5 of 9 clustering accuracy results are higher than those obtained with loose tolerance. Similar results are obtained for the TDT2 dataset, where the improvement of the clustering accuracy is more obvious. However, we can not expect better performance for all the cases with a tight tolerance.

Fig. 6
figure 6

Accuracy versus tolerance values on clustering the Reuter-21578 (left) and TDT-2 (right) datasets by QRPBB. For each cluster number, average results by 50 trials are presented

4.6 Nonnegative sparse coding

Sparse coding aims to find a small set of basis vectors that capture high-level patterns in the input data. By adding a penalty term to the NMF, Hoyer (2004) proposed the nonnegative sparse coding

$$\begin{aligned} \min _{W\ge 0,~ H\ge 0}\frac{1}{2}\Vert V-WH\Vert ^{2}_{F}+\beta \Vert H\Vert _{1}, \end{aligned}$$
(19)

where \(\Vert \cdot \Vert _{1}\) is the standard \(L_{1}\)-norm and \(\beta \) is a constant to control the sparsity of \(H\). It was shown by Guan et al. (2012) that the gradient of the objective in (19) is Lipschitz continuous either with respect to \(H\). Since \(H\ge 0\), (19) is equivalent to

$$\begin{aligned} \min _{W\ge 0,~ H\ge 0}\frac{1}{2}\Vert V-WH\Vert ^{2}_{F}+\beta \sum _{i,j}H_{ij}. \end{aligned}$$
(20)

Thus we can apply QRPBB in ANLS framework to solve (20).

We tested these four methods on a set of natural images used in Hoyer (2004). This provides us a 288\(\times \)10,000-dimension matrix. We also compared the results with those obtained by the algorithm (NNSC) proposed by Hoyer (2004).

Since NNSC uses the MU scheme to update matrix, the stopping criterion defined in (17) is not suitable for it. We use the following stopping criterion suggested by Guan et al. (2012):

$$\begin{aligned} \frac{F(W^{k},H^{k})-F(W^{k+1},H^{k+1})}{F(W^{1},H^{1})-F(W^{k+1},H^{k+1})}\le \epsilon . \end{aligned}$$
(21)

We set \(\beta =10\) and \(\epsilon =10^{-4}\) for all methods and run them from the same initial point. The stepsize for NNSC is set to 1. We report the results in Table 4, where spH is the sparseness of the final matrix \(H^{k}\) measured by

$$\begin{aligned} \mathrm spH =\frac{\mathrm{Number\, of\, Zero\, Entries\, in }~ H}{\mathrm{Total\, Number\, of\, Entries\, in }~ H}, \end{aligned}$$

where \(H_{ij}\) is considered a zero entry if \(|H_{ij}|<10^{-6}\). We can see from Table 4 that all the five methods give similar residual and yield a sparse nonnegative factorization in this example. In particular, NNSC generates the smallest pgn and NeNMF gives the sparsest solution and smaller pgn than other three methods. However, QRPBB produces a solution as sparse as that by NeNMF and takes around 1/4 CPU time costed by NeNMF.

Table 4 Experimental results on natural image dataset with \(\epsilon =10^{-4}\), where \(m=288,~n=10,000,~r=72\). The final objective value is the approximately \(1.34\times 10^{6}\) for all methods

5 Conclusions

We have proposed an efficient method for nonnegative matrix factorization (NMF) based on the alternating nonnegative least squares (ANLS) framework. Our approach adopts a quadratic regularization projected Barzilai–Borwein (QRPBB) method to solve the subproblems that leads to fast convergence. Experimental comparisons with other three ANLS-based NMF algorithms on both synthetic and real-world datasets show that the QRPBB method is promising.

Our QRPBB method can be applied to other types of NMF, such as the nonsmooth NMF (Pascual-Montano et al. 2006; Han et al. 2009) and graph regularized NMF (Cai et al. 2011). It is interesting to see whether the QRPBB method performs well in solving these problems. We have tested these methods on some nonsmooth NMF problems which showed that the QRPBB method converges faster in terms of CPU time. More thorough experiments involving different types of data are needed to make a comprehensive comparison.