1 Introduction

Nonnegative matrix factorization (NMF) [16, 17] is the problem of approximating a given large nonnegative matrix \(V\) by the product \(WH\) of two flat nonnegative matrices \(W\) and \(H\). If we consider the columns of \(V\) as data vectors, the columns of \(W\) and those of \(H\) are interpreted as a set of nonnegative basis vectors and a set of nonnegative coefficient vectors, respectively. Each data vector is thus reproduced approximately by a linear combination of the basis vectors with coefficients stored in the corresponding column of \(H\). In this sense, NMF can generate a reduced representation of the original data. Moreover, the basis vectors often represent parts of the object because of the nonnegativity constraints [16]. This is a significant difference between NMF and other factorization methods such as principal component analysis. So far, NMF has been successfully applied to various problems in machine learning, signal processing and so on [2, 5, 7, 15, 16, 20, 22, 25].

Usually, NMF is formulated as a constrained optimization problem in which the approximation error has to be minimized with respect to \(W\) and \(H\) subject to the nonnegativity of these matrices. Lee and Seung [17] considered the cases where the approximation error is measured by the Euclidean distance and the I-divergence, and proposed iterative methods called the multiplicative updates. These updates are widely used as simple and efficient computational methods for NMF because of the following three advantages. First, the updates do not contain parameters like the step size in gradient decent methods, and therefore parameter tuning is not needed. Second, nonnegativity of the matrices \(W^{k}\) and \(H^{k}\), the solution after \(k\) iterations, is automatically satisfied if the initial matrices \(W^{0}\) and \(H^{0}\) are chosen to be positive. Third, implementation is easy because the update formulae are very simple.

However, the multiplicative updates of Lee and Seung have a serious drawback that their global convergence is not guaranteed theoretically. By global convergence, we mean that, for any initial solution, the sequence of solutions contains at least one convergent subsequence and the limit of any convergent subsequence is a stationary point of the corresponding optimization problem. The main difficulty in proving global convergence is that the updates, which are expressed in the form of a fraction, are not defined for all pairs of nonnegative matrices. Hence the convergence analysis of the multiplicative updates and their variants is an important research issue in NMF, and many authors have addressed this problem so far [1, 10, 12, 18]. Finesso and Spreij [10] studied convergence properties of the multiplicative update based on the I-divergence minimization and proved, under the assumption that \(W^{k}\) is normalized after each update so that its Frobenius norm becomes one, that the sequences of \(W^{k}\) and \(W^{k}H^{k}\) always converge. However, their result does not guarantee convergence of the sequence of \(H^{k}\). Lin [18] considered the case of the Euclidean distance minimization and showed that some modifications to the original multiplicative update can make it well-defined and globally convergent. However, since Lin’s modified update is not multiplicative but additive in some cases, this result cannot be directly applied to the original update. Recently, Badeau et al. [1] studied local stability of a generalized multiplicative update, which includes the multiplicative updates of Lee and Seung as special cases, using Lyapunov’s stability theory and showed that the local optimal solution of the corresponding optimization problem is asymptotically stable if one of two matrices \(W^{k}\) and \(H^{k}\) is fixed for all \(k\).

The objective of this paper is to show that a slight modification can guarantee global convergence of the multiplicative updates of Lee and Seung [17]. Our attention is focused on the modification proposed by Gillis and Glineur [12]. Their update, which is a modified version of the Euclidean distance-based multiplicative update of Lee and Seung [17], returns a user-specified positive constant if the original update returns a value less than the constant. Note that unlike the updates of Lin [18] and Finesso and Spreij [10], normalization procedure is not involved. Gillis and Glineur proved that their modified multiplicative update decreases the objective function monotonically and that if a sequence of solutions generated by the update has a limit point then it is necessarily a stationary point of the corresponding optimization problem [12]. However, this does not imply global convergence of the update.

In this paper, we consider not only the Euclidean distance-based multiplicative update but also the I-divergence-based one, and prove that their global convergence is guaranteed if they are modified as described by Gillis and Glineur [12]. Our proof is based on Zangwill’s global convergence theorem [28, p. 91] which is a fundamental result in optimization theory and has played important roles in the convergence analysis of many algorithms in machine learning [21, 23, 26]. We also propose two algorithms based on the modified updates. They always stop within a finite number of iterations after finding an approximate stationary point of the optimization problem.

There are many other approaches that attempt to solve NMF optimization problems. For example, some authors modified the multiplicative updates of Lee and Seung by adding a small positive constant to the denominators so that they are defined for all nonnegative matrices [3, 18]. Also, some authors proposed to apply different optimization techniques to NMF optimization problems [3, 7, 19]. Furthermore, some authors derived a variety of multiplicative updates by considering various types of divergence between \(V\) and \(WH\) [1, 6, 9, 27]. Although these updates are potentially superior in some cases, we will not consider them in this paper.

The rest of this paper is organized as follows. In Sect. 2, we introduce briefly the NMF optimization problems and the multiplicative updates of Lee and Seung. In Sect. 3, the modified multiplicative updates based on the idea of Gillis and Glineur are first introduced and then convergence theorems for these updates are presented. In addition, algorithms based on the modified multiplicative updates are proposed and their finite termination is proved. In Sect. 4, the convergence theorems in Sect. 3 are proved using Zangwill’s global convergence theorem. Finally, in Sect. 5, we conclude the paper with a brief summary.

Part of this paper (Theorem 1 in Sect. 3) was presented in the authors’ conference paper [14]. However, no rigorous proof was given there because of space limitation. In this paper, not only Theorem 1 but also some new results (Theorems 2, 3 and 4) are presented with their complete proofs.

2 Nonnegative matrix factorization and multiplicative updates

Given a nonnegative matrix \(V \in \mathbb{R}_{+}^{n \times m}\) where \(\mathbb{R}_{+}\) denotes the set of nonnegative real numbers, and a positive integer \(r \leq \min(n,m)\), NMF is the problem of finding two nonnegative matrices \(W\in \mathbb{R}_{+}^{n \times r}\) and \(H \in \mathbb{R}_{+}^{r \times m}\) such that \(V\) is approximately equal to \(WH\) (see Fig. 1). Throughout this paper, we assume the following.

Fig. 1
figure 1

Nonnegative matrix factorization. A given nonnegative matrix \(V\) is approximated by the product of two nonnegative matrices \(W\) and \(H\). The \(j\)-th column of \(V\) is approximated by the linear combination of the columns of \(W\) where coefficients are the elements of the \(j\)-th column of \(H\)

Assumption 1

Each row and column of \(V\) has at least one nonzero element.

Let us consider each column of \(V\) as a data vector. If the value of \(r\) is sufficiently small, a compact expression for the original data can be obtained through NMF because the total number of elements in the factor matrices \(W\) and \(H\) is less than that of the original matrix \(V\). Moreover, the columns of \(W\) are regarded as a kind of basis for the space spanned by the columns of \(V\) because each data vector can be approximated by a linear combination of the columns of \(W\) (see Fig. 1).

Lee and Seung [17] employed the Euclidean distance and the I-divergence for the approximation error between \(V\) and \(WH\), and formulated NMF as two types of optimization problems. In the former case, the problem is expressed as

$$ \begin{aligned} &\mbox{minimize} \quad f_{\rm E}(W,H)=\|V-WH \|^{2} \\ &\mbox{subject to} \quad W \geq 0,\quad H \geq 0, \end{aligned} $$
(1)

where \(\|\cdot\|\) represents the Frobenius norm, that is,

$$ \|V-WH\|^{2}=\sum_{i=1}^{n} \sum _{j=1}^{m} \bigl(V_{ij}-(WH)_{ij} \bigr)^{2}, $$

and the inequality \(W \geq 0\) (resp. \(H \geq 0\)) means that all elements of the matrix \(W\) (resp. \(H\)) are nonnegative. In the latter case, the problem is expressed as

$$ \begin{aligned} &\mbox{minimize}\quad f_{\rm D}(W,H)=D(V\|WH) \\ &\mbox{subject to}\quad W \geq 0,\quad H \geq 0, \end{aligned} $$
(2)

where \(D(\cdot\|\cdot)\) is defined by

$$ D(V\|WH)=\sum_{i=1}^{n} \sum _{j=1}^{m} \biggl\{V_{ij} \log \frac{V_{ij}}{(WH)_{ij}}-V_{ij}+(WH)_{ij} \biggr\}. $$

It is difficult in both cases to find a global optimal solution because the objective functions \(f_{\rm E}(W,H)\) and \(f_{\rm D}(W,H)\) are not convex. In fact, NP-hardness of NMF was proved by Vavasis [24]. Therefore, we have to take the second best way, that is, we try to find a local optimal solution instead of a global one. For this purpose, Lee and Seung [17] proposed the update rule

$$ \begin{aligned} H_{aj}^{k+1} & = H_{aj}^{k} \frac{((W^{k})^{T}V)_{aj}}{((W^{k})^{T}W^{k}H^{k})_{aj}}, \\ W_{ia}^{k+1} & = W_{ia}^{k} \frac{(V(H^{k+1})^{T})_{ia}}{(W^{k}H^{k+1}(H^{k+1})^{T})_{ia}}, \end{aligned} $$
(3)

for the optimization problem (1), and the update rule

$$ \begin{aligned} H_{aj}^{k+1} &= H_{aj}^{k} \frac{\sum_{i=1}^{n} W_{ia}^{k}V_{ij}/(W^{k}H^{k})_{ij}}{\sum_{i=1}^{n} W_{ia}^{k}}, \\ W_{ia}^{k+1} &= W_{ia}^{k} \frac{\sum_{j=1}^{m} H^{k+1}_{aj}V_{ij}/(W^{k}H^{k+1})_{ij}}{\sum_{j=1}^{m} H_{aj}^{k+1}}, \end{aligned} $$
(4)

for the optimization problem (2), where \(k\) represents the iteration count.Footnote 1 The updates like (3) and (4) are called the multiplicative updates because the new estimate is given by the product of the current estimate and some factor. An advantage of these multiplicative updates is that, unlike conventional gradient descent methods, there are no parameters to tune. Another advantage is that positiveness of \(W^{k}\) and \(H^{k}\) is guaranteed for all \(k\) under Assumption 1 if the initial matrices \(W^{0}\) and \(H^{0}\) are chosen to be positive [19]. For these reasons, the multiplicative updates (3) and (4) are widely used as simple and effective methods for finding local optimal solutions of (1) and (2).

3 Modified multiplicative updates and their global convergence

The most serious drawback of the multiplicative update rules described by (3) and (4) is that the right-hand sides are not defined for all nonnegative matrices \(W^{k}\) and \(H^{k}\) (or \(H^{k+1}\)). For example, in the case of Euclidean distance, we cannot obtain \(H^{k+1}\) by the update rule (3) when \(H^{k}=0\), because the denominator of the first equation vanishes.

As mentioned in Sect. 2, \(W^{k}\) and \(H^{k}\) are positive for all \(k\) if the initial matrices \(W^{0}\) and \(H^{0}\) are chosen to be positive. Hence the updates can be performed infinitely many times. However, even though the sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) converges, it is not guaranteed that both \(\lim_{k \rightarrow \infty} W^{k}\) and \(\lim_{k \rightarrow \infty} H^{k}\) are positive. This means that the update rules may not be defined at \(\lim_{k \rightarrow \infty}(W^{k},H^{k})\), which makes it difficult to prove their global convergence using known results such as Zangwill’s global convergence theorem [28, p. 91].

In this section, we introduce slightly modified versions of the update rules (3) and (4) which are based on the idea of Gillis and Glineur [12], and present convergence theorems. We also propose two algorithms based on the modified updates and prove their finite termination.

3.1 Euclidean distance

In order to prevent elements of matrices \(W^{k}\) and \(H^{k}\) from vanishing, Gillis and Glineur [12] have proposed to modify the update rule (3) as

$$ \begin{aligned} H_{aj}^{k+1} &=\max \biggl(H_{aj}^{k} \frac{((W^{k})^{T}V)_{aj}}{((W^{k})^{T}W^{k}H^{k})_{aj}}, \epsilon \biggr), \\ W_{ia}^{k+1} &=\max \biggl(W_{ia}^{k} \frac{(V(H^{k+1})^{T})_{ia}}{(W^{k}H^{k+1}(H^{k+1})^{T})_{ia}}, \epsilon \biggr), \end{aligned} $$
(5)

where \(\epsilon\) is any positive constant specified by the user. Each update in (5) returns the positive constant \(\epsilon\) if the corresponding original update in (3) returns a value less than \(\epsilon\). This is the only difference between these two update rules. With the modification of the update rule from (3) to (5), we have to modify also the optimization problem (1) as follows:

$$ \begin{aligned} &\mbox{minimize}\quad f_{\rm E}(W,H)=\|V-WH\|^{2} \\ &\mbox{subject to} \quad W_{ia} \geq \epsilon, \quad H_{aj} \geq \epsilon, \quad \forall i,a,j. \end{aligned} $$
(6)

The feasible region of this optimization problem is denoted by \(X\), that is,

$$ X = \bigl\{(W,H)\mid W_{ia} \geq \epsilon, H_{aj} \geq \epsilon, \forall i,a,j\bigr\}. $$

Karush–Kuhn–Tucker (KKT) conditions [4] for the problem (6) are expressed as follows:Footnote 2

W i a ϵ,i,a,
(7)
H a j ϵ,a,j,
(8)
(9)
(10)
(11)
(12)

where

$$\begin{aligned} \nabla_{W}f_{\rm E}(W,H) =& 2(WH-V)H^{T}, \\ \nabla_{H}f_{\rm E}(W,H) =& 2W^{T}(WH-V). \end{aligned}$$

Therefore, a necessary condition for a point \((W,H)\) to be a local optimal solution of (6) is that the conditions (7)–(12) are satisfied. Hereafter, we call a point \((W,H)\) a stationary point of (6) if it satisfies (7)–(12), and denote the set of all stationary points of (6) by \(S_{\rm E}\).

The global convergence property of the modified update rule (5) is stated as follows.

Theorem 1

Let \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) be any sequence generated by the modified update rule (5) with the initial point \((W^{0},H^{0}) \in X\). Then \((W^{k},H^{k}) \in X\) holds for all positive integers \(k\). Moreover, the sequence has at least one convergent subsequence and the limit of any convergent subsequence is a stationary point of the optimization problem (6).

Proof of Theorem 1 will be given in the next section.

By making use of Theorem 1, we can immediately construct an algorithm that terminates within a finite number of iterations. To do so, we relax the conditions (9)–(12) as

(13)
(14)
(15)
(16)

where \(\delta_{1}\) and \(\delta_{2}\) are any positive constants specified by the user, and employ these relaxed conditions as a stopping criterion. Let \(\bar{S}_{\rm E}\) be the set of all \((W,H) \in X\) satisfying (13)–(16). Then the proposed algorithm is described in Algorithm 1.

Algorithm 1
figure 2

Modified multiplicative update algorithm with a termination criterion for Euclidean distance-based NMF

Theorem 2

For any positive constants \(\epsilon\), \(\delta_{1}\) and \(\delta_{2}\), Algorithm  1 stops within a finite number of iterations.

Proof

Let \(\{(W^{k_{l}},H^{k_{l}})\}_{l=1}^{\infty}\) be any convergent subsequence of the sequence \(\{(W^{k}, H^{k})\}_{k=0}^{\infty}\) generated by the modified update rule (5), and \((\bar{W},\bar{H}) \in X\) be the limit of the subsequence. Then, by Theorem 1, \((\bar{W},\bar{H})\) satisfies

Recall that \(\nabla_{W} f_{\rm E}(W,H)\) and \(\nabla_{H} f_{\rm E}(W,H)\) are continuous for all \((W,H) \in X\). For all \((i,a)\) such that \((\nabla_{W} f_{\rm E}(\bar{W},\bar{H}))_{ia}=0\), there exists a positive integer \(L_{ia}^{1}\) such that

$$\bigl|\bigl(\nabla_{W} f_{\rm E}\bigl(W^{k_{l}},H^{k_{l}} \bigr)\bigr)_{ia}\bigr| \leq \delta_{1}, \quad \forall l \geq L_{ia}^{1}. $$

For all \((i,a)\) such that \((\nabla_{W} f_{\rm E}(\bar{W},\bar{H}))_{ia}>0\), there exists a positive integer \(L_{ia}^{1}\) such that

$$\bigl(\nabla_{W} f_{\rm E}\bigl(W^{k_{l}},H^{k_{l}} \bigr)\bigr)_{ia} \geq -\delta_{1}\quad \mbox{and}\quad W^{k_{l}}_{ia}-\epsilon \leq \delta_{2}, \quad \forall l \geq L_{ia}^{1}, $$

because \(\bar{W}_{ia}=\epsilon\) holds. For all \((a,j)\) such that \((\nabla_{H} f_{\rm E}(\bar{W},\bar{H}))_{aj}=0\), there exists a positive integer \(L_{aj}^{2}\) such that

$$\bigl|\bigl(\nabla_{W} f_{\rm E}\bigl(W^{k_{l}},H^{k_{l}} \bigr)\bigr)_{ia}\bigr| \leq \delta_{1}, \quad \forall l \geq L_{ia}^{2}. $$

For all \((a,j)\) such that \((\nabla_{H} f_{\rm E}(\bar{W},\bar{H}))_{aj}>0\), there exists a positive integer \(L_{aj}^{2}\) such that

$$\bigl(\nabla_{H} f_{\rm E}\bigl(W^{k_{l}},H^{k_{l}} \bigr)\bigr)_{aj} \geq -\delta_{1} \quad \mbox{and} \quad H^{k_{l}}_{aj}-\epsilon \leq \delta_{2}, \quad \forall l \geq L_{aj}^{2}, $$

because \(\bar{H}_{aj}=\epsilon\) holds. From these considerations, we immediately see that there exists a positive integer \(L\) such that the stopping criterion of Algorithm 1 is satisfied for all \((W^{k_{l}},H^{k_{l}})\) with \(l \geq L\). This means that Algorithm 1 always stops within a finite number of iteration. □

3.2 I-divergence

As in the case of Euclidean distance, we modify the update rule (4) as

$$ \begin{aligned} H_{aj}^{k+1} &=\max \biggl(H_{aj}^{k} \frac{\sum_{i=1}^{n} W_{ia}^{k}V_{ij}^{k}/(W^{k}H^{k})_{ij}}{\sum_{\mu=1}^{n} W_{\mu a}^{k}}, \epsilon \biggr), \\ W_{ia}^{k+1} &=\max \biggl(W_{ia}^{k} \frac{\sum_{j=1}^{m} H^{k+1}_{aj}V_{ij}/(W^{k}H^{k+1})_{ij}}{\sum_{\nu=1}^{m} H_{a \nu}^{k}}, \epsilon \biggr), \end{aligned} $$
(17)

where \(\epsilon\) is any positive constant specified by the user. The modified update rule corresponds to modifying the optimization problem (2) as follows:

$$ \begin{aligned} &\mbox{minimize} \quad f_{\rm D}(W,H)=D(V\| WH) \\ &\mbox{subject to} \quad W_{ia} \geq \epsilon,\quad H_{aj} \geq \epsilon, \quad \forall i,a,j. \end{aligned} $$
(18)

The feasible region of this optimization problem is \(X\) as in the case of (6). KKT conditions for the problem (18) are expressed as follows:Footnote 3

W i a ϵ,i,a,
(19)
H a j ϵ,a,j,
(20)
(21)
(22)
(23)
(24)

where

$$\begin{aligned} \bigl(\nabla_{W}f_{\rm D}(W,H)\bigr)_{ia} =& \sum _{j=1}^{m} \biggl\{H_{aj}- \frac{V_{ij}H_{aj}}{(WH)_{ij}} \biggr\}, \\ \bigl(\nabla_{H}f_{\rm D}(W,H)\bigr)_{aj} =& \sum _{i=1}^{n} \biggl\{W_{ia}- \frac{V_{ij}W_{ia}}{(WH)_{ij}} \biggr\}. \end{aligned}$$

Therefore, a necessary condition for a point \((W,H)\) to be a local optimal solution of (18) is that the conditions (19)–(24) are satisfied. Hereafter, we call a point \((W,H)\) a stationary point of (18) if it satisfies (19)–(24), and denote the set of all stationary points of (18) by \(S_{\rm D}\).

The global convergence property of the modified update rule (17) is stated as follows.

Theorem 3

Let \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) be any sequence generated by the modified update rule (17) with the initial point \((W^{0},H^{0}) \in X\). Then \((W^{k},H^{k}) \in X\) holds for all positive integers \(k\). Moreover, the sequence has at least one convergent subsequence and the limit of any convergent subsequence is a stationary point of the optimization problem (18).

Proof of Theorem 3 will be given in the next section.

By making use of Theorem 3, we can easily construct an algorithm that terminates within a finite number of iterations. To do so, we relax the conditions (21)–(24) as

(25)
(26)
(27)
(28)

where \(\delta_{1}\) and \(\delta_{2}\) are any positive constants specified by the user, and employ these relaxed conditions as a stopping criterion. Let \(\bar{S}_{\rm D}\) be the set of all \((W,H) \in X\) satisfying (25)–(28). Then the proposed algorithm is described in Algorithm 2.

Algorithm 2
figure 3

Modified multiplicative update algorithm with a termination criterion for I-divergence-based NMF

Theorem 4

For any positive constants \(\epsilon\), \(\delta_{1}\) and \(\delta_{2}\), Algorithm  2 stops within a finite number of iterations.

We omit the proof of Theorem 4 because it is almost same as the proof of Theorem 2.

3.3 Related works

The modified update rule (5) was first proposed by Gillis and Glineur [12], as stated above. They proved not only that \(f_{\rm E}(W^{k},H^{k})\) is nonincreasing under (5) but also that if a sequence of solutions generated by (5) has a limit point then it is necessarily a stationary point of the optimization problem (6), but these facts are not sufficient to prove global convergence of (5). As a matter of fact, we cannot rule out, for example, the existence of a sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) such that \(f_{\rm E}(W^{k},H^{k})\) takes the same value for all \(k\) and the sequence visits a finite number of distinct points periodically. However, on the other hand, in another paper [13], they showed through numerical experiments that (5) works better than the original update rule (3) in some cases. This indicates that (5) is important not only from a theoretical point of view but also in practice.

Lin [18] proposed a modified version of (3) and proved that any sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) generated by the modified rule has at least one convergent subsequence and their limits are stationary points of the optimization problem (1). However, Lin’s update rule considerably differs from the original one because of many extra operations. In particular, in the case where \((\nabla_{H}f_{\rm E}(W^{k},H^{k}))_{aj}\) is negative and \(H_{aj}^{k}\) is less than a user-specified small positive constant, Lin’s update rule is not multiplicative but additive. Also, the matrix \(W^{k}\) must be normalized after each update in order to guarantee that the sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) is in a bounded set. In contrast, the normalization is not required in the modified update rule (5). Nevertheless, the boundedness of the sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) generated by (5) is guaranteed as shown in the next section.

Finesso and Spreij [10] studied the convergence properties of the multiplicative update (4) by interpreting it as an alternating minimization procedure [8]. Under the assumption that the matrix \(W^{k}\) is normalized after each update, they proved that any sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) generated by (4) satisfies the following properties: (1) \(W^{k}\) converges to a nonnegative matrix. (2) For each triple \((i,a,j)\), \(W_{ia}^{k}H_{aj}^{k}\) converges to a nonnegative number. (3) For each pair \((a,j)\), \(H_{aj}^{k}\) converges to a nonnegative number if \(\lim_{k \rightarrow \infty} \sum_{i=1}^{n} W_{ia}^{k} > 0\) [10, Theorem 6.1]. However, they said nothing about the convergence of \(H_{aj}^{k}\) for the case where \(\lim_{k \rightarrow \infty} \sum_{i=1}^{n} W_{ia}^{k} = 0\).

Badeau et al. [1] studied the local stability of a generalized multiplicative update, which includes (3) and (4) as special cases, using Lyapunov’s stability theory and showed that the local optimal solution of the corresponding optimization problem is asymptotically stable if one of two factor matrices \(W\) and \(H\) is fixed to a nonnegative constant matrix.

4 Proofs of Theorems 1 and 3

We will prove Theorems 1 and 3 in this section. The first parts of these theorems apparently follow from the update rules (5) and (17). In order to prove the second parts, we make use of Zangwill’s global convergence theorem [28, p. 91], which is a fundamental result in optimization theory. Let \(A\) be a point-to-point mappingFootnote 4 from \(X\) into itself and \(S\) be a subset of \(X\). Then Zangwill’s global convergence theorem claims the following: if the mapping \(A\) satisfies the following three conditions then, for any initial point \((W^{0},H^{0}) \in X\), the sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) generated by \(A\) contains at least one convergent subsequence and the limit of any convergent subsequence belongs to \(S\).

  1. 1.

    All points in the sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) belong to a compact set in \(X\).

  2. 2.

    There is a function \(z:X \rightarrow \mathbb{R}\) satisfying the following two conditions.

    1. (a)

      If \((W,H) \not\in S\) then \(z(A(W,H))<z(W,H)\).

    2. (b)

      If \((W,H) \in S\) then \(z(A(W,H)) \leq z(W,H)\).

  3. 3.

    The mapping \(A\) is continuous in \(X \setminus S\).

In the following, we will first prove Theorem 1 by showing that these conditions are satisfied when the mapping \(A\) is defined by (5) and \(S\) is set to \(S_{\rm E}\). We will next prove Theorem 2 by showing that these conditions are satisfied when the mapping \(A\) is defined by (17) and \(S\) is set to \(S_{\rm D}\).

4.1 Proof of Theorem 1

Let us rewrite (5) as

$$\begin{aligned} H^{k+1} &= A_{1}\bigl(W^{k},H^{k}\bigr), \\ W^{k+1} &= A_{2}\bigl(W^{k},H^{k+1}\bigr), \end{aligned}$$

or, more simply,

$$\bigl(W^{k+1}, H^{k+1}\bigr) = A\bigl(W^{k},H^{k} \bigr), $$

where the mapping \(A\) is defined by

$$A(W,H) = \bigl(A_{2}\bigl(W,A_{1}(W,H)\bigr),A_{1}(W,H) \bigr). $$

Let us also set \(S=S_{\rm E}\). Since the mapping \(A\) is continuous in \(X\), the third condition of Zangwill’s global convergence theorem is satisfied. We will thus show in the following that \(A\) also satisfies the remaining two conditions.

The following lemma guarantees that the first condition is satisfied.

Lemma 1

For any initial point \((W^{0},H^{0}) \in X\), the sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) generated by the mapping \(A\) belongs to a compact set in \(X\).

Proof

Let \((W,H)\) be any point in \(X\). Then we have

$$\begin{aligned} H_{aj}\frac{(W^{T}V)_{aj}}{(W^{T}WH)_{aj}} =& H_{aj}\frac{\sum^{n}_{i=1}W_{ia}V_{ij}}{\sum^{r}_{l=1}(W^{T}W)_{al}H_{lj}} \\ =& H_{aj}\frac{\sum^{n}_{i=1}W_{ia}V_{ij}}{\sum^{r}_{l=1}(\sum^{n}_{i=1}W_{ia}W_{il})H_{lj}} \\ =& H_{aj}\frac{\sum^{n}_{i=1}W_{ia}V_{ij}}{\sum^{n}_{i=1}W_{ia}^{2}H_{aj}+\sum^{r}_{l=1,l \neq a}(\sum^{n}_{i=1}W_{ia}W_{il})H_{lj}} \\ =& \frac{\sum^{n}_{i=1}W_{ia}V_{ij}}{\sum^{n}_{i=1}W_{ia}^{2}+\sum^{r}_{l=1,l \neq a}(\sum^{n}_{i=1}W_{ia}W_{il})(H_{lj}/H_{aj})} \\ <& \frac{\sum^{n}_{i=1}W_{ia}V_{ij}}{\sum^{n}_{i=1}W_{ia}^{2}} \\ =& \frac{\sum^{n}_{i=1} (W_{ia}/\sqrt{{\sum^{n}_{\mu=1}W_{\mu a}^{2}}} )V_{ij}}{\sqrt{\sum^{n}_{i=1}W_{ia}^{2}}} \\ \leq& \frac{\sqrt{\sum_{i=1}^{n} V_{ij}^{2}}}{\epsilon \sqrt{n}}, \end{aligned}$$

from which the inequality

$$ \bigl(A_{1}(W,H)\bigr)_{aj} \leq \max \biggl( \frac{\sqrt{\sum_{i=1}^{n}V_{ij}^{2}}}{\epsilon\sqrt{n}},\epsilon \biggr) $$

holds for any pair \((a,j)\). Note that the right-hand side is a constant which depends on neither \(W\) nor \(H\). Similarly, we have

$$W_{ia} \frac{(VH^{T})_{ia}}{(WHH^{T})_{ia}} < \frac{\sqrt{\sum_{j=1}^{m} V_{ij}^{2}}}{\epsilon \sqrt{m}}, $$

from which the inequality

$$ \bigl(A_{2}(W,H)\bigr)_{ia} \leq \max \biggl( \frac{\sqrt{\sum_{j=1}^{m} V_{ij}^{2}}}{\epsilon \sqrt{m}}, \epsilon \biggr) $$

holds for any pair \((i,a)\). Note that the right-hand side is a constant which depends on neither \(W\) nor \(H\). Hence \(A(W,H)\) belongs to a compact set in \(X\). This means that, for any initial point \((W^{0},H^{0}) \in X\), the sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) generated by the mapping \(A\) belongs to a compact set in \(X\). □

The last step is to prove that the second condition of Zangwill’s global convergence theorem is also satisfied. To do this, we first need to introduce two auxiliary functions for \(f_{\rm E}\). Let \((\hat{W},\hat{H})\) be any point in \(X\). Let the function \(g_{\rm E}^{\hat{W}}: [\epsilon,\infty)^{r \times m} \times [\epsilon,\infty)^{r \times m} \rightarrow \mathbb{R}\) be defined by

$$ g_{\rm E}^{\hat{W}}\bigl(H,H'\bigr) = f_{\rm E} \bigl(\hat{W},H'\bigr)+\sum_{a=1}^{r} \sum_{j=1}^{m} g_{{\rm E}aj}^{\hat{W}} \bigl(H_{aj},H'\bigr), $$

where the function \(g_{{\rm E}aj}^{\hat{W}}:[\epsilon,\infty) \times [\epsilon,\infty)^{r \times m} \rightarrow \mathbb{R}\) is defined by

$$\begin{aligned} g_{{\rm E}aj}^{\hat{W}}\bigl(H_{aj},H'\bigr) =& 2\bigl(\hat{W}^{T}\bigl(\hat{W}H'-V\bigr) \bigr)_{aj}\bigl(H_{aj}-H'_{aj}\bigr) \\ &{}+\frac{(\hat{W}^{T}\hat{W}H')_{aj}}{H'_{aj}} \bigl(H_{aj}-H'_{aj} \bigr)^{2}. \end{aligned}$$
(29)

Similarly, let the function \(h_{\rm E}^{\hat{H}}: [\epsilon,\infty)^{n \times r} \times [\epsilon,\infty)^{n \times r} \rightarrow \mathbb{R}\) be defined by

$$ h_{\rm E}^{\hat{H}}\bigl(W,W'\bigr) = f_{\rm E}(W,\hat{H})+\sum_{i=1}^{n} \sum_{a=1}^{r} h_{{\rm E}ia}^{\hat{H}} \bigl(W_{ia},W'\bigr), $$
(30)

where the function \(h_{{\rm E}ia}^{\hat{H}}:[\epsilon,\infty) \times [\epsilon,\infty)^{n \times r} \rightarrow \mathbb{R}\) is defined by

$$\begin{aligned} h_{{\rm E}ia}^{\hat{H}}\bigl(W_{ia},W'\bigr) =& 2\bigl(\bigl(W'\hat{H}-V\bigr)\hat{H}^{T} \bigr)_{ia}\bigl(W_{ia}-W'_{ia}\bigr) \\ &{}+\frac{(W'\hat{H}\hat{H}^{T})_{ia}}{W'_{ia}} \bigl(W_{ia}-W'_{ia} \bigr)^{2}. \end{aligned}$$

The functions \(g_{\rm E}^{\hat{W}}\) and \(h_{\rm E}^{\hat{H}}\) are essentially the same as the auxiliary functions considered by Lee and Seung [17], though mathematical expressions are slightly different. However, note that the domains of \(g_{\rm E}^{\hat{W}}\) and \(h_{\rm E}^{\hat{H}}\) are restricted to \([\epsilon,\infty)^{r \times m} \times [\epsilon,\infty)^{r \times m}\) and \([\epsilon,\infty)^{n \times r} \times [\epsilon,\infty)^{n \times r}\), respectively, in the present paper. This is an important difference between our functions and theirs.

In the following, we give five lemmas which are needed to prove that the second condition of Zangwill’s global convergence theorem is satisfied. Although some of them can be immediately obtained from some of the results given by Lee and Seung [17], we will provide proofs for all lemmas in order to make this paper self-contained.

Lemma 2

For any \(\hat{W} \in [\epsilon,\infty)^{n \times r}\), the function \(g_{\rm E}^{\hat{W}}\) satisfies the following two conditions:

$$\begin{aligned} g_{\rm E}^{\hat{W}}(H,H) =& f_{\rm E}(\hat{W},H), \quad \forall H \in [\epsilon,\infty)^{r \times m}, \end{aligned}$$
(31)
$$\begin{aligned} g_{\rm E}^{\hat{W}}\bigl(H,H'\bigr) \geq& f_{\rm E}(\hat{W},H), \quad \forall H, H' \in [\epsilon, \infty)^{r \times m}. \end{aligned}$$
(32)

Also, for any \(\hat{H} \in [\epsilon,\infty)^{r \times m}\), the function \(h_{\rm E}^{\hat{H}}\) satisfies the following two conditions:

$$\begin{aligned} h_{\rm E}^{\hat{H}}(W,W) =& f_{\rm E}(W,\hat{H}), \quad \forall W \in [\epsilon,\infty)^{n \times r}, \\ h_{\rm E}^{\hat{H}}\bigl(W,W'\bigr) \geq& f_{\rm E}(W,\hat{H}), \quad \forall W, W' \in [\epsilon, \infty)^{n \times r}. \end{aligned}$$

Proof

We prove only the first part because the second one can be proved in the same way. Since \(g_{{\rm E}aj}^{\hat{W}}(H_{aj},H)=0\) holds for all \(H \in [\epsilon,\infty)^{r \times m}\) and indices \(a\) and \(j\), the first condition (31) is satisfied. To see that the second condition (32) is also satisfied, we first rewrite \(f_{\rm E}(\hat{W},H)\) using the Taylor series expansion as

$$\begin{aligned} f_{\rm E}(\hat{W},H) =& f_{\rm E}\bigl(\hat{W},H' \bigr)+\sum_{a=1}^{r} \sum _{j=1}^{m} 2\bigl(\hat{W}^{T}\bigl( \hat{W}H'-V\bigr)\bigr)_{aj} \bigl(H_{aj}-H'_{aj} \bigr) \\ &{} + \sum_{a=1}^{r} \sum _{b=1}^{r} \sum_{j=1}^{m} \bigl(\hat{W}^{T} \hat{W}\bigr)_{ab} \bigl(H_{aj}-H'_{aj} \bigr) \bigl(H_{bj}-H'_{bj}\bigr). \end{aligned}$$

Then we have

$$ g_{\rm E}^{\hat{W}}\bigl(H,H'\bigr)-f_{\rm E}( \hat{W},H) = \sum_{j=1}^{m} \sum _{a=1}^{r} \sum_{b=1}^{r} M^{(j)}_{ab} \biggl(\frac{H_{aj}-H'_{aj}}{H'_{aj}} \biggr) \biggl( \frac{H_{bj}-H'_{bj}}{H'_{bj}} \biggr), $$
(33)

where

$$ M^{(j)}_{ab} = \delta_{ab} \bigl( \hat{W}^{T}\hat{W}H'\bigr)_{aj}H'_{bj}- \bigl(\hat{W}^{T}\hat{W}\bigr)_{ab} H'_{aj}H'_{bj}, $$
(34)

and \(\delta_{ab}\) represents the Kronecker’s delta. We next show that the matrices \(M^{(j)}=[M^{(j)}_{ab}]\) \((j=1,2,\ldots,m)\) are positive semi-definite for all \(\hat{W} \in [\epsilon,\infty)^{n \times r}\) and \(H' \in [\epsilon,\infty)^{r \times m}\). If this is true, the right-hand side of (33) is nonnegative for all \(H, H' \in [\epsilon,\infty)^{r \times m}\). Since the right-hand side of (34) can be rewritten as

$$\begin{aligned} M^{(j)}_{ab} =& \delta_{ab} \sum _{l=1}^{r} \bigl(\hat{W}^{T}\hat{W} \bigr)_{al} H'_{lj}H'_{bj} -\bigl(\hat{W}^{T}\hat{W}\bigr)_{ab}H'_{aj}H'_{bj} \\ =& \left \{ \begin{array}{l@{\quad}l} \sum_{l=1,l \neq a}^{r} (\hat{W}^{T}\hat{W})_{al} H'_{lj}H'_{aj}, & \mbox{if } a=b \\ -(\hat{W}^{T}\hat{W})_{ab} H'_{aj}H'_{bj}, & \mbox{if } a \neq b , \end{array} \right . \end{aligned}$$

the matrix \(M^{(j)}\) satisfies

$$ M^{(j)}_{aa}=\sum_{l=1,l \neq a}^{r} |M^{(j)}_{al}|, \quad a=1,2,\ldots,r, $$

which means that \(M^{(j)}\) is real, symmetric and diagonally dominant with positive diagonal elements. Therefore, \(M^{(j)}\) \((j=1,2,\ldots,m)\) are positive semi-definite. □

Lemma 3

Let \((\hat{W},\hat{H})\) be any point in \(X\). Then \(g_{\rm E}^{\hat{W}}(H,\hat{H})\), which is considered as a function of \(H\), is strictly convex in \([\epsilon,\infty)^{r \times m}\). Also, \(h_{\rm E}^{\hat{H}}(W,\hat{W})\), which is considered as a function of \(W\), is strictly convex in \([\epsilon,\infty)^{n \times r}\).

Proof

The second-order partial derivatives of \(g_{\rm E}^{\hat{W}}(H,\hat{H})\) are given by

$$\frac{\partial^{2} g_{\rm E}^{\hat{W}}(H,\hat{H})}{\partial H_{aj} \partial H_{a'j'}}= \left \{ \begin{array}{l@{\quad}l} \frac{(\hat{W}^{T} \hat{W} \hat{H})_{aj}}{\hat{H}_{aj}}, & \mbox{if } (a,j)=(a',j') \\ 0, & \mbox{otherwise}, \end{array} \right . $$

where \((a,j),(a',j') \in \{1,2,\ldots,r\} \times \{1,2,\ldots,m\}\). Since \((\hat{W}^{T} \hat{W} \hat{H})_{aj}/\hat{H}_{aj}\) is a positive constant, \(g_{\rm E}^{\hat{W}}(H,\hat{H})\) is strictly convex in \([\epsilon,\infty)^{r \times m}\). The second part can be proved in the same way. □

Lemma 4

Let \((\hat{W},\hat{H})\) be any point in \(X\). The optimization problem

$$ \begin{aligned} &\textit{minimize} \quad g^{\hat{W}}_{\rm E}(H, \hat{H}) \\ &\textit{subject to} \quad H_{aj} \geq \epsilon, \quad \forall a,j \end{aligned} $$
(35)

has a unique optimal solution which is given by \(A_{1}(\hat{W},\hat{H})\). Also, the optimization problem

$$ \begin{aligned} &\textit{minimize} \quad h^{\hat{H}}_{\rm E}(W, \hat{W}) \\ &\textit{subject to} \quad W_{ia} \geq \epsilon, \quad \forall i,a \end{aligned} $$
(36)

has a unique optimal solution which is given by \(A_{2}(\hat{W},\hat{H})\).

Proof

It suffices for us to show that for any pair \((a,j)\), the optimization problem

$$ \begin{aligned} &\mbox{minimize} \quad g_{{\rm E}aj}^{\hat{W}}(H_{aj}, \hat{H}) \\ &\mbox{subject to} \quad H_{aj} \geq \epsilon \end{aligned} $$
(37)

has a unique optimal solution which is given by \((A_{1}(\hat{W},\hat{H}))_{aj}\) and that for any pair \((i,a)\), the optimization problem

$$ \begin{aligned} &\mbox{minimize} \quad h_{{\rm E}ia}^{\hat{H}}(W_{ia}, \hat{W}) \\ &\mbox{subject to} \quad W_{ia} \geq \epsilon \end{aligned} $$
(38)

has a unique optimal solution which is given by \((A_{2}(\hat{W},\hat{H}))_{aj}\). In the following, we consider only the first part because the second part can be proved similarly. Since \(g_{{\rm E}aj}^{\hat{W}}(H_{aj},\hat{H})\) is strictly convex in \([\epsilon,\infty)\), the equation \({\rm d}g_{{\rm E}aj}^{\hat{W}}(H_{aj},\hat{H})/{\rm d}H_{aj}=0\) has at most one solution in \([\epsilon, \infty)\). By solving this equation, we have

$$H_{aj}= \hat{H}_{aj} \frac{(\hat{W}^{T}V)_{aj}}{(\hat{W}^{T} \hat{W} \hat{H})_{aj}}. $$

Let the right-side hand be denoted by \(H_{aj}^{\ast}\) which is a nonnegative number. If \(H_{aj}^{\ast} \geq \epsilon\) then \(H_{aj}^{\ast}\) is apparently the optimal solution of (37). If \(H_{aj}^{\ast}<\epsilon\) then \(\epsilon\) is the optimal solution of (37) because \(g_{{\rm E}aj}^{\hat{W}}(H_{aj},\hat{H})\) is strictly monotone increasing in \([\epsilon,\infty)\). Therefore the optimal solution of (37) is identical with \((A_{1}(\hat{W},\hat{H}))_{aj}\). □

Lemma 5

The inequality \(f_{\rm E}(A(\hat{W},\hat{H})) \leq f_{\rm E}(\hat{W},\hat{H})\) holds for all \((\hat{W},\hat{H}) \in X\).

Proof

By Lemmas 2 and 4, we have

$$\begin{aligned} f_{\rm E}\bigl(\hat{W},A_{1}(\hat{W},\hat{H})\bigr) \leq& g_{\rm E}^{\hat{W}}\bigl(A_{1}(\hat{W},\hat{H}),\hat{H} \bigr) \leq g_{\rm E}^{\hat{W}}(\hat{H},\hat{H}) \\ =& f_{\rm E}(\hat{W},\hat{H}),\quad \forall (\hat{W},\hat{H}) \in X \end{aligned}$$

and

$$\begin{aligned} f_{\rm E}\bigl(A_{2}(\hat{W},\hat{H}),\hat{H}\bigr) \leq& h_{\rm E}^{\hat{H}}\bigl(A_{2}(\hat{W},\hat{H}),\hat{W} \bigr) \leq h_{\rm E}^{\hat{H}}(\hat{W},\hat{W}) \\ =& f_{\rm E}(\hat{W},\hat{H}),\quad \forall (\hat{W},\hat{H}) \in X. \end{aligned}$$

From these two inequalities, we have

$$\begin{aligned} f_{\rm E}\bigl(A(\hat{W},\hat{H})\bigr) =& f_{\rm E} \bigl(A_{2}\bigl(\hat{W},A_{1}(\hat{W},\hat{H}) \bigr),A_{1}(\hat{W},\hat{H})\bigr) \\ \leq& f_{\rm E}\bigl(\hat{W},A_{1}(\hat{W},\hat{H})\bigr) \leq f_{\rm E}(\hat{W},\hat{H}) \end{aligned}$$
(39)

which completes the proof. □

Lemma 6

\((\hat{W},\hat{H}) \in S_{\rm E}\) if and only if \(\hat{H}\) and \(\hat{W}\) are the optimal solutions of (35) and (36), respectively.

Proof

It suffices for us to show that \((\hat{W},\hat{H}) \in S_{\rm E}\) if and only if \(\hat{H}_{aj}\) is the optimal solution of (37) for any pair \((a,j)\) and \(\hat{W}_{ia}\) is the optimal solution of (38) for any pair \((i,a)\). By the definition (29) of \(g_{{\rm E}aj}^{\hat{W}}(H_{aj},\hat{H})\), we have

$$ \frac{\partial g_{{\rm E}aj}^{\hat{W}}(H_{aj},\hat{H})}{\partial H_{aj}} \bigg\vert_{H_{aj}=\hat{H}_{aj}} =\bigl(\nabla_{H}f_{\rm E}( \hat{W},\hat{H})\bigr)_{aj}. $$

Since \(g_{{\rm E}aj}^{\hat{W}}(H_{aj},\hat{H})\) is strictly convex in \([\epsilon,\infty)\), the necessary and sufficient condition for \(\hat{H}_{aj}\) to be the optimal solution of (37) is given by

$$\begin{aligned} \bigl(\nabla_{H}f_{\rm E}(\hat{W},\hat{H}) \bigr)_{aj} \left\{ \begin{array}{l@{\quad}l} =0, & \mbox{if } \hat{H}_{aj}>\epsilon \\ \geq 0, & \mbox{if } \hat{H}_{aj}=\epsilon, \end{array} \right. \end{aligned}$$

which is equivalent to the set of conditions (8) and (10). By the definition (30) of \(h_{{\rm E}ia}^{\hat{H}}(W_{ia},\hat{W})\), we have

$$ \frac{\partial h_{{\rm E}ia}^{\hat{H}}(W_{ia},\hat{W})}{\partial W_{ia}}\bigg\vert_{W_{ia}=\hat{W}_{ia}} =\bigl(\nabla_{W}f_{\rm E}( \hat{W},\hat{H})\bigr)_{ia}, \quad \forall i,a. $$

Hence the necessary and sufficient condition for \(\hat{W}\) to be the optimal solution of (36) is given by

$$\begin{aligned} \bigl(\nabla_{W}f_{\rm E}(\hat{W},\hat{H}) \bigr)_{ia} \left\{ \begin{array}{l@{\quad}l} =0, & \mbox{if } \hat{W}_{ia}>\epsilon \\ \geq 0, & \mbox{if } \hat{W}_{ia}=\epsilon \end{array} \right. \quad \forall i,a, \end{aligned}$$

which is equivalent to the set of conditions (7) and (9). □

From Lemmas 4–6, we derive the following lemma which claims that the second condition of Zangwill’s global convergence theorem is satisfied by setting \(z=f_{\rm E}\).

Lemma 7

Let \((\hat{W},\hat{H})\) be any point in \(X\). If \((\hat{W},\hat{H}) \in S_{\rm E}\) then \(A(\hat{W},\hat{H})=(\hat{W},\hat{H})\). Otherwise \(f_{\rm E}(A(\hat{W},\hat{H}))<f_{\rm E}(\hat{W},\hat{H})\). That is, \(S_{\rm E}\) is identical with the set of fixed points of the mapping \(A\).

Proof

We first consider the case where \((\hat{W},\hat{H}) \in S_{\rm E}\). By Lemma 6, \(\hat{H}\) and \(\hat{W}\) are unique optimal solutions of (35) and (36), respectively. By Lemma 4, this implies \(A_{1}(\hat{W},\hat{H})=\hat{H}\) and \(A_{2}(\hat{W},\hat{H})=\hat{W}\). Therefore, we have

$$A(\hat{W},\hat{H}) = \bigl(A_{2}\bigl(\hat{W},A_{1}(\hat{W}, \hat{H})\bigr),A_{1}(\hat{W},\hat{H})\bigr) = \bigl(A_{2}( \hat{W},\hat{H}),\hat{H}\bigr) = (\hat{W},\hat{H}). $$

We next consider the case where \((\hat{W},\hat{H}) \not\in S_{\rm E}\). In this case, by Lemma 6, at least one of the following statements must be false: (1) \(\hat{H}\) is the unique optimal solution of (35). (2) \(\hat{W}\) is the unique optimal solution of (36). Suppose that the statement (1) does not hold true. Then, by Lemma 4, we have \(g_{\rm E}^{\hat{W}}(A_{1}(\hat{W},\hat{H}),\hat{H}) < g_{\rm E}^{\hat{W}}(\hat{H},\hat{H})\) which implies that the second inequality of (39) holds as a strict inequality. Therefore, \(f_{\rm E}(A(\hat{W},\hat{H}))\) is strictly less than \(f_{\rm E}(\hat{W},\hat{H})\). Suppose next that the statement (1) holds true but (2) does not. Then, by Lemma 4, we have \(A_{1}(\hat{W},\hat{H})=\hat{H}\) and \(h_{\rm E}^{\hat{H}}(A_{2}(\hat{W},\hat{H}),\hat{W}) < h_{\rm E}^{\hat{H}}(\hat{W},\hat{W})\). From these facts and (39), we have

$$\begin{aligned} f_{\rm E}\bigl(A(\hat{W},\hat{H})\bigr) =& f_{\rm E} \bigl(A_{2}\bigl(\hat{W},A_{1}(\hat{W},\hat{H}) \bigr),A_{1}(\hat{W},\hat{H})\bigr) \\ =& f_{\rm E}\bigl(A_{2}(\hat{W},\hat{H}),\hat{H}\bigr) < f_{\rm E}(\hat{W},\hat{H}). \end{aligned}$$

Therefore, \(f_{\rm E}(A(\hat{W},\hat{H}))\) is strictly less than \(f_{\rm E}(\hat{W},\hat{H})\). □

4.2 Proof of Theorem 3

As in the proof of Theorem 1, let us rewrite (17) as

$$\begin{aligned} H^{k+1} &= A_{1}\bigl(W^{k},H^{k}\bigr), \\ W^{k+1} &= A_{2}\bigl(W^{k},H^{k+1}\bigr), \end{aligned}$$

or, more simply,

$$\bigl(W^{k+1}, H^{k+1}\bigr) = A\bigl(W^{k},H^{k} \bigr), $$

where the mapping \(A\) is defined by

$$A(W,H) = \bigl(A_{2}\bigl(W,A_{1}(W,H)\bigr),A_{1}(W,H) \bigr). $$

Let us also set \(S=S_{\rm D}\). Since the mapping \(A\) is continuous in \(X\), the third condition of Zangwill’s global convergence theorem is satisfied. We will thus show in the following that \(A\) also satisfies the remaining two conditions.

The following lemma guarantees that the first condition is satisfied.

Lemma 8

For any initial point \((W^{0},H^{0}) \in X\), the sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) generated by the mapping \(A\) belongs to a compact set in \(X\).

Proof

Let \((W,H)\) be any point in \(X\). Then we have

$$\begin{aligned} H_{aj}\frac{\sum^{n}_{i=1}W_{ia}V_{ij}/(WH)_{ij}}{\sum_{i=1}^{n}W_{ia}} =& H_{aj}\sum ^{n}_{i=1}\frac{W_{ia}V_{ij}}{\sum^{n}_{\mu=1}W_{\mu a} \sum^{r}_{l=1}W_{il}H_{lj}} \\ =& H_{aj}\sum^{n}_{i=1} \frac{W_{ia}V_{ij}}{\sum^{n}_{\mu=1}W_{\mu a}(W_{ia}H_{aj}+\sum^{r}_{l=1,l\neq a}W_{il}H_{lj})} \\ =& \sum^{n}_{i=1}\frac{W_{ia}V_{ij}}{\sum^{n}_{\mu=1}W_{\mu a} \{W_{ia}+\sum^{r}_{l=1, l \neq a}W_{il}(H_{lj}/H_{aj})\}} \\ <& \sum^{n}_{i=1}\frac{W_{ia}V_{ij}}{(\sum^{n}_{\mu=1}W_{\mu a})W_{ia}} \\ =& \sum^{n}_{i=1}\frac{V_{ij}}{\sum^{n}_{\mu=1}W_{\mu a}} \\ \leq& \frac{\sum^{n}_{i=1}V_{ij}}{\epsilon n}, \end{aligned}$$

from which the inequality

$$ \bigl(A_{1}(W,H)\bigr)_{aj} \leq \max \biggl( \frac{\sum^{n}_{i=1}V_{ij}}{\epsilon n}, \epsilon \biggr) $$

holds for any pair \((a,j)\). Note that the right-hand side is a constant which depends on neither \(W\) nor \(H\). Similarly, we have

$$W_{ia} \frac{\sum_{j=1}^{m} H_{aj}V_{ij}/(WH)_{ij}}{\sum_{j=1}^{m} H_{aj}} < \frac{\sum^{m}_{j=1} V_{ij}}{\epsilon m}, $$

from which the inequality

$$\bigl(A_{2}(W,H)\bigr)_{ia} \leq \max \biggl( \frac{\sum^{m}_{j=1}V_{ij}}{\epsilon m},\epsilon \biggr) $$

holds for any pair \((i,a)\). Note that the right-hand side is a constant which depends on neither \(W\) nor \(H\). Hence \(A(W,H)\) belongs to a compact set in \(X\). This means that, for any initial point \((W^{0},H^{0}) \in X\), the sequence \(\{(W^{k},H^{k})\}_{k=0}^{\infty}\) generated by the mapping \(A\) belongs to a compact set in \(X\). □

The last step is to prove that the second condition of Zangwill’s global convergence theorem is also satisfied. To do this, we first need to introduce two auxiliary functions \(f_{\rm D}\). Let \((\hat{W},\hat{H})\) be any point in \(X\). Let the function \(g_{\rm D}^{\hat{W}}:[\epsilon,\infty)^{r \times m} \times [\epsilon,\infty)^{r \times m} \rightarrow \mathbb{R}\) be defined by

$$\begin{aligned} g_{\rm D}^{\hat{W}}\bigl(H,H'\bigr) =& \sum _{i=1}^{n}\sum_{j=1}^{m} \Biggl\{V_{ij}\log V_{ij}-V_{ij} + \frac{V_{ij}}{(\hat{W}H')_{ij}}\sum^{r}_{a=1} \hat{W}_{ia}H'_{aj} \log\frac{\hat{W}_{ia}H'_{aj}}{(\hat{W}H')_{ij}} \Biggr \} \\ &{} + \sum_{a=1}^{r} \sum _{j=1}^{m} g_{{\rm D}aj}^{\hat{W}} \bigl(H_{aj},H'\bigr), \end{aligned}$$

where the function \(g_{{\rm D}aj}^{\hat{W}}:[\epsilon,\infty) \times [\epsilon,\infty)^{r \times m}\) is defined by

$$ g_{{\rm D}aj}^{\hat{W}}\bigl(H_{aj},H'\bigr) = H_{aj}\sum^{n}_{i=1} \hat{W}_{ia} -H'_{aj} \sum ^{n}_{i=1} \biggl\{\frac{\hat{W}_{ia}V_{ij}}{(\hat{W}H')_{ij}} \log ( \hat{W}_{ia}H_{aj}) \biggr\}. $$

Similarly, let the function \(g_{\rm D}^{\hat{H}}:[\epsilon,\infty)^{n \times r} \times [\epsilon,\infty)^{n \times r} \rightarrow \mathbb{R}\) be defined by

$$\begin{aligned} h_{\rm D}^{\hat{H}}\bigl(W,W'\bigr) =& \sum _{i=1}^{n} \sum_{j=1}^{m} \Biggl\{V_{ij}\log V_{ij}-V_{ij} + \frac{V_{ij}}{(W'\hat{H})_{ij}}\sum_{a=1}^{r} W'_{ia}\hat{H}_{aj} \log\frac{W'_{ia}\hat{H}_{aj}}{(W'\hat{H})_{ij}} \Biggr\} \\ &{} +\sum_{i=1}^{n} \sum _{a=1}^{r} h_{{\rm D}ia}^{\hat{H}} \bigl(W_{ia},W'\bigr), \end{aligned}$$

where the function \(h_{{\rm D}ia}^{\hat{H}}:[\epsilon,\infty) \times [\epsilon,\infty)^{n \times r} \rightarrow \mathbb{R}\) is defined by

$$ h_{{\rm D}ia}^{\hat{H}}\bigl(W_{ia},W'\bigr) = \sum_{j=1}^{m} \biggl\{W_{ia} \hat{H}_{aj}-V_{ij}\frac{W'_{ia}\hat{H}_{aj}}{(W'\hat{H})_{ij}} \log (W_{ia} \hat{H}_{aj}) \biggr\}. $$

The functions \(g_{\rm D}^{\hat{W}}\) and \(h_{\rm D}^{\hat{H}}\) are essentially the same as the auxiliary functions considered by Lee and Seung [17], though mathematical expressions are slightly different. In the following, we give five lemmas which are needed to prove that the second condition of Zangwill’s global convergence theorem is satisfied. Although Lemmas 9 and 10 below can be immediately obtained from some of the results given by Lee and Seung [17], we will provide proofs for these lemmas in order to make this paper self-contained. On the other hand, as for Lemmas 11, 12 and 13, we omit proofs because they are similar to those for Lemmas 4, 5 and 6.

Lemma 9

Let \((\hat{W},\hat{H})\) be any point in \(X\). The function \(g_{\rm D}^{\hat{W}}\) satisfies the following two conditions:

$$\begin{aligned} g_{\rm D}^{\hat{W}}(H,H) =& f_{\rm D}(\hat{W},H), \quad \forall H \in [\epsilon,\infty)^{r \times m}, \end{aligned}$$
(40)
$$\begin{aligned} g_{\rm D}^{\hat{W}}\bigl(H,H'\bigr) \geq& f_{\rm D}(\hat{W},H), \quad \forall H, H' \in [\epsilon, \infty)^{r \times m}. \end{aligned}$$
(41)

Also, the function \(h_{\rm D}^{\hat{H}}\) satisfies the following two conditions:

$$\begin{aligned} h_{\rm D}^{\hat{H}}(W,W) =& f_{\rm D}(W,\hat{H}), \quad \forall W \in [\epsilon,\infty)^{n \times r}, \\ h_{\rm D}^{\hat{H}}\bigl(W,W'\bigr) \geq& f_{\rm D}(W,\hat{H}), \quad \forall W, W' \in [\epsilon, \infty)^{n \times r}. \end{aligned}$$

Proof

We prove only the first part because the second part can be proved in the same way. For any \(\hat{W} \in [\epsilon,\infty)^{n \times r}\) and \(H \in [\epsilon,\infty)^{r \times m}\), \(g_{\rm D}^{\hat{W}}(H,H)\) can be transformed as

$$\begin{aligned} g_{\rm D}^{\hat{W}}(H,H) =& \sum_{i=1}^{n} \sum_{j=1}^{m} \Biggl\{ V_{ij}\log V_{ij}-V_{ij}+\frac{V_{ij}}{(\hat{W}H)_{ij}} \sum _{a=1}^{r} \hat{W}_{ia}H_{aj}\log \frac{\hat{W}_{ia}H_{aj}}{(\hat{W}H)_{ij}} \Biggr\} \\ &{} + \sum_{a=1}^{r} \sum _{j=1}^{m} \Biggl\{ H_{aj}\sum _{i=1}^{n} \hat{W}_{ia}-H_{aj} \sum_{i=1}^{n} \frac{\hat{W}_{ia}V_{ij}}{(\hat{W}H)_{ij}}\log ( \hat{W}_{ia} H_{aj}) \Biggr\} \\ =& \sum_{i=1}^{n} \sum _{j=1}^{m} ( V_{ij}\log V_{ij}-V_{ij} ) -\sum_{i=1}^{n} \sum _{j=1}^{m} V_{ij} \log ( \hat{W}H)_{ij} +\sum_{i=1}^{n} \sum _{j=1}^{m} (\hat{W}H)_{ij} \\ =& \sum_{i=1}^{n} \sum _{j=1}^{m} \biggl\{ V_{ij}\log \frac{V_{ij}}{(\hat{W}H)_{ij}}-V_{ij} +(\hat{W}H)_{ij} \biggr\} \\ =& f_{\rm D}(\hat{W},H). \end{aligned}$$

Thus the condition (40) holds true. In order to show (41), we consider

$$\begin{aligned} &g_{\rm D}^{\hat{W}}\bigl(H,H'\bigr)-f_{\rm D}( \hat{W},H) \\ &\quad{}=\sum_{i=1}^{n} \sum _{j=1}^{m} V_{ij} \biggl\{ \log ( \hat{W}H)_{ij}-\log \frac{H_{aj}}{H_{aj}'} -\frac{\hat{W}_{ia}H_{aj}'}{(\hat{W}H')_{ij}} \log \bigl( \hat{W}H'\bigr)_{ij} \biggr\}. \end{aligned}$$
(42)

From the concavity of the log function,

$$ \log (\hat{W}H)_{ij} = \log \Biggl(\sum_{a=1}^{r} \hat{W}_{ia} H_{aj} \Biggr) \geq \sum _{a=1}^{r} \mu_{a} \log \biggl( \frac{\hat{W}_{ia} H_{aj}}{\mu_{a}} \biggr) $$
(43)

for any \(H \in [\epsilon,\infty)^{r \times m}\) and any set of positive numbers \(\mu_{1},\mu_{2},\ldots,\mu_{r}\) such that \(\sum_{a=1}^{r} \mu_{a}=1\). By substituting \(\mu_{a}=(\hat{W}_{ia}H_{aj}')/(\hat{W}H')_{ij}\) for \(a=1,2,\ldots,r\) into (43), we have

$$\begin{aligned} \log (\hat{W}H)_{ij} =& \log \Biggl(\sum _{a=1}^{r} \hat{W}_{ia} H_{aj} \Biggr) \\ \geq& \sum_{a=1}^{r} \frac{\hat{W}_{ia}H_{aj}'}{(\hat{W}H')_{ij}} \log \biggl(\hat{W}_{ia} H_{aj} \cdot \frac{(\hat{W}H')_{ij}}{\hat{W}_{ia}H_{aj}'} \biggr) \\ \geq& \sum_{a=1}^{r} \frac{\hat{W}_{ia}H_{aj}'}{(\hat{W}H')_{ij}} \biggl\{ \log \bigl(\hat{W}H'\bigr)_{ij} + \log \frac{H_{aj}}{H_{aj}'} \biggr\}, \end{aligned}$$

which implies that the right-hand side of (42) is nonnegative for all \(H, H' \in [\epsilon,\infty)^{r \times m}\). □

Lemma 10

Let \((\hat{W},\hat{H})\) be any point in \(X\). Then \(g_{\rm D}^{\hat{W}}(H,\hat{H})\), which is considered as a function of \(H\), is strictly convex in \([\epsilon,\infty)^{r \times m}\). Also, \(h_{\rm D}^{\hat{H}}(W,\hat{W})\), which is considered as a function of \(W\), is strictly convex in \([\epsilon,\infty)^{n \times r}\).

Proof

The second-order partial derivatives of \(g_{\rm D}^{\hat{W}}(H,\hat{H})\) are given by

$$\frac{\partial^{2} g_{\rm D}^{\hat{W}}(H,\hat{H})}{\partial H_{aj} \partial H_{a'j'}}= \left\{ \begin{array}{l@{\quad}l} \frac{\hat{H}_{aj}}{H_{aj}^{2}} \sum_{i=1}^{n} \frac{\hat{W}_{ia}V_{ij}}{(\hat{W}\hat{H})_{ij}}, & \mbox{if } (a,j)=(a',j') \\ 0, & \mbox{otherwise}, \end{array} \right. $$

where \((a,j), (a',j') \in \{1,2,\ldots,r\} \times \{1,2,\ldots,m\}\). Note here that \((\hat{H}_{aj}/H_{aj}^{2}) \sum_{i=1}^{n} (\hat{W}_{ia}V_{ij}/(\hat{W}\hat{H})_{ij})\) is positive for all \(H_{aj} \in [\epsilon,\infty)\) because of Assumption 1. Therefore, \(g_{\rm D}^{\hat{W}}(H,\hat{H})\) is strictly convex in \([\epsilon,\infty)^{r \times m}\). The second part can be proved in the same way. □

Lemma 11

Let \((\hat{W},\hat{H})\) be any point in \(X\). The optimization problem

$$ \begin{aligned} &\textit{minimize} \quad g^{\hat{W}}_{\rm D}(H, \hat{H}) \\ &\textit{subject to} \quad H_{aj} \geq \epsilon, \quad \forall a,j \end{aligned} $$
(44)

has a unique optimal solution which is given by \(A_{1}(\hat{W},\hat{H})\). Also, the optimization problem

$$ \begin{aligned} &\textit{minimize} \quad h^{\hat{H}}_{\rm D}(W, \hat{W}) \\ &\textit{subject to} \quad W_{ia} \geq \epsilon, \quad \forall i,a \end{aligned} $$
(45)

has a unique optimal solution which is given by \(A_{2}(\hat{W},\hat{H})\).

Lemma 12

The inequality \(f_{\rm D}(A(\hat{W},\hat{H})) \leq f_{\rm D}(\hat{W},\hat{H})\) holds for all \((\hat{W},\hat{H}) \in X\).

Lemma 13

\((\hat{W},\hat{H}) \in S_{\rm D}\) if and only if \(\hat{H}\) and \(\hat{W}\) are the optimal solutions of (44) and (45), respectively.

From Lemmas 11–13, we derive the following lemma which claims that the second condition of Zangwill’s global convergence theorem is satisfied by setting \(z=f_{\rm D}\). The proof is omitted because it is similar to that for Lemma 7.

Lemma 14

Let \((\hat{W},\hat{H})\) be any point in \(X\). If \((\hat{W},\hat{H}) \in S_{\rm D}\) then \(A(\hat{W},\hat{H})=(\hat{W},\hat{H})\). Otherwise \(f_{\rm D}(A(\hat{W},\hat{H}))<f_{\rm D}(\hat{W},\hat{H})\). That is, \(S_{\rm D}\) is identical with the set of fixed points of the mapping \(A\).

5 Conclusion

We have shown that the global convergence of the multiplicative updates proposed by Lee and Seung is established if they are slightly modified as discussed by Gillis and Glineur. Their idea is just to prevent each variable from becoming smaller than a user-specified positive constant \(\epsilon\), but this slight modification guarantees the boundedness of solutions without normalization. Using Zangwill’s global convergence theorem, we have proved that any sequence of solutions generated by the modified updates has at least one convergent subsequence and the limit of any convergent subsequence is a stationary point of the corresponding optimization problem. Furthermore, we have developed two algorithms based on the modified updates which always stop within a finite number of iterations after finding an approximate stationary point.

One may be concerned with the fact that matrices obtained by the modified updates are always dense. However, when sparse matrices are preferable, we only have to replace all \(\epsilon\) in the obtained matrices with zero. If \(\epsilon\) is set to a small positive number, this replacement will not affect the results significantly. It is in fact proved that setting the entries of \(W\) and \(H\) equal to \(\epsilon\) to zero gives a solution which is \(\mathcal{O}(\epsilon)\) close to a stationary point of the original problem, and that the objective function is affected by an additive factor of at most \(\mathcal{O}(\epsilon)\) [11].

The approach presented in this paper may be applied to various multiplicative algorithms for NMF or other optimization problems. Developing a unified framework for the global convergence analysis of multiplicative updates is a topic for future research.