Keywords

1 Introduction

Nonnegative Matrix Factorization (NMF) [1,2,3] is a mathematical operation that decomposes a given nonnegative matrix \(\varvec{X}\) into two nonnegative matrices \(\varvec{W}\) and \(\varvec{H}\) such that \(\varvec{X} \approx \varvec{W}\varvec{H}\). NMF has found many applications in various fields such as image processing, acoustic signal processing, data analysis and text mining because it can find nonnegative basis for a given set of nonnegative data.

NMF is formulated as a constrained optimization problem in which an error between \(\varvec{X}\) and \(\varvec{W}\varvec{H}\) has to be minimized under the nonnegativity constraints on \(\varvec{W}\) and \(\varvec{H}\). Multiplicative update rules [3] are widely used as simple and easy-to-implement methods for solving the NMF optimization problems. This approach can be easily applied to a wide class of error measures [4,5,6], and the obtained update rules have the global convergence property [7, 8]. However, they are often slow. Hence many studies have been done to develop faster algorithms for solving the NMF optimization problems (see, for example, [9, 10] and references therein).

In this paper, we focus our attention on NMF with the alpha-divergence [11]. The alpha-divergence includes Pearson divergence, Hellinger divergence, and chi-square divergence as its special cases [12], and has been frequently used for NMF (see [13] and references therein). As a simple and fast method for solving the optimization problem for NMF with the alpha-divergence, we propose a novel iterative algorithm based on the coordinate descent and the Newton method. We show that the proposed algorithm has the global convergence property [7, 14,15,16] in the sense that the sequence of solutions has at least one convergent subsequence and the limit of any convergent subsequence is a stationary point of the corresponding optimization problem. We also show through numerical experiments that the proposed algorithm is much faster than the multiplicative update rule.

Notation: \(\mathbb {R}\) denotes the set of real numbers. \(\mathbb {R}_{+}\) and \(\mathbb {R}_{++}\) denote the set of nonnegative and positive real numbers, respectively. For any subset S of \(\mathbb {R}\), \(S^{I \times J}\) denotes the set of all \(I \times J\) matrices such that each entry belongs to S. For example, \(\mathbb {R}_{+}^{I \times J}\) is the set of all \(I \times J\) real nonnegative matrices. \(\mathbb {N}\) denotes the set of natural numbers or the set of positive integers. \(\varvec{0}_{I \times J}\) and \(\varvec{1}_{I \times J}\) denote the \(I \times J\) matrix of all zeros and all ones, respectively. For two matrices \(\varvec{A}=(A_{ij})\) and \(\varvec{B}=(B_{ij}) \) with the same size, the inequality \(\varvec{A} \ge \varvec{B}\) means that \(A_{ij} \ge B_{ij}\) for all i and j, and \((\varvec{A}\varvec{B})_{ij}\) denotes the (ij)-th entry of the matrix \(\varvec{A}\varvec{B}\), that is, \((\varvec{A}\varvec{B})_{ij}=\sum _{k}A_{ik}B_{kj}\).

2 Alpha-Divergence Based Nonnegative Matrix Factorization

2.1 Optimization Problem

Suppose we are given an \(M \times N\) nonnegative matrix \(\varvec{X}=(X_{ij})\). The alpha-divergence based NMF is formulated as the constrained optimization problem:

$$\begin{aligned} \begin{array}{ll} \text {minimize} &{} D_\alpha (\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T}) \\ \text {subject to} &{} \varvec{W} \in \mathbb {R}_{+}^{M \times K}, \;\; \varvec{H} \in \mathbb {R}_{+}^{N \times K} \end{array} \end{aligned}$$
(1)

where

$$\begin{aligned} D_\alpha (\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^{\mathrm {T}})&= \frac{1}{\alpha (1-\alpha )} \sum _{i=1}^M \sum _{j=1}^N \biggl [ \alpha X_{ij} + (1-\alpha )(\varvec{W} \varvec{H}^\mathrm {T})_{ij} \nonumber \\&\qquad -X_{ij}^\alpha (\varvec{W}\varvec{H}^\mathrm {T})^{1-\alpha }_{ij} \biggr ] \quad (\alpha \ne 0,1). \end{aligned}$$
(2)

When \(\alpha >1\), the right-hand side of (2) is not defined for all nonnegative matrices \(\varvec{W}\) and \(\varvec{H}\). A simple way to make the problem well-defined is to modify (1) as follows:

$$\begin{aligned} \begin{array}{ll} \text {minimize} &{} D_\alpha (\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T}) \\ \text {subject to} &{} \varvec{W} \in [\epsilon ,\infty )^{M \times K}, \;\; \varvec{H} \in [\epsilon ,\infty )^{N \times K} \end{array} \end{aligned}$$
(3)

where \(\epsilon \) is a positive constant, which is usually set to a small number so that (3) is close to (1). In what follows, we consider (3) instead of (1).

Note that sparse factor matrices can never be obtained from the modified optimization problem (3). However, if we replace all \(\epsilon \) in the obtained solution with zero, the resulting factor matrices are expected to be sparse because local optimal solutions of (3) are often located at the boundary of the feasible region. In addition, if \(\epsilon \) is sufficiently small, it is expected that the pair of the resulting factor matrices is close to the original local optimal solution.

When \(\alpha <0\) and \(\varvec{X}\) has a zero entry, the right-hand of (2) is not determined. We thus impose throughout this paper the following assumption on \(\varvec{X}\).

Assumption 1

All entries of \(\varvec{X}\) are positive.

2.2 Properties of the Objective Function

The partial derivatives of \(D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T})\) with respect to \(W_{ik}\) and \(H_{jk}\) are given by

$$\begin{aligned} \frac{\partial {D_\alpha }}{\partial {W_{ik}}}&=\frac{1}{\alpha }\left( \sum _j H_{jk} - \sum _j X_{ij}^\alpha H_{jk} (\varvec{W} \varvec{H}^{\mathrm {T}})_{ij}^{-\alpha } \right) , \\ \frac{\partial {D_\alpha }}{\partial {H_{jk}}}&=\frac{1}{\alpha }\left( \sum _i W_{ik} - \sum _i X_{ij}^\alpha W_{ik} (\varvec{W} \varvec{H}^{\mathrm {T}})_{ij}^{-\alpha } \right) , \end{aligned}$$

and the second and third partial derivatives are given by

$$\begin{aligned} \frac{\partial ^2{D_\alpha }}{\partial W_{ik}^2}&=\sum _j X_{ij}^\alpha H_{jk}^2 (\varvec{W}\varvec{H}^\mathrm {T})_{ij}^{-\alpha -1}, \\ \frac{\partial ^2{D_\alpha }}{\partial H_{jk}^2}&=\sum _i X_{ij}^\alpha W_{ik}^2 (\varvec{W}\varvec{H}^\mathrm {T})_{ij}^{-\alpha -1}, \\ \frac{\partial ^3{D_\alpha }}{\partial W_{ik}^3}&=-(\alpha +1) \sum _j X_{ij}^\alpha H_{jk}^3 (\varvec{W}\varvec{H}^\mathrm {T})_{ij}^{-\alpha -2}, \\ \frac{\partial ^3{D_\alpha }}{\partial H_{jk}^3}&=-(\alpha +1)\sum _i X_{ij}^\alpha W_{ik}^3 (\varvec{W}\varvec{H}^\mathrm {T})_{ij}^{-\alpha -2}. \end{aligned}$$

Under Assumption 1, the second partial derivatives are positive for all \(\alpha \,(\ne 0,1)\) and all pairs of positive matrices \(\varvec{W}\) and \(\varvec{H}\). Therefore, if we fix all entries of \(\varvec{W}\) and \(\varvec{H}\) except \(W_{ik}\) (\(H_{jk}\), resp.) to constants not less than \(\epsilon \) then we obtain a function of \(W_{ik}\) (\(H_{jk}\), resp.) which is strictly convex on \([\epsilon , \infty )\). In what follows, we express these functions as \(f_{ik}(W_{ik})\) and \(g_{jk}(H_{jk})\). Then \(f'_{ik}(W_{ik})\) and \(g'_{jk}(H_{jk})\) are monotone increasing functions, and convex if \(\alpha \le -1\) and concave otherwise.

It is easy to see that the objective function of (3) has the following property.

Lemma 1

For any \(\alpha \in \mathbb {R} \setminus \{0,1\}\), \(\epsilon \in \mathbb {R}_{++}\), \(\varvec{X} \in \mathbb {R}_{++}^{M \times N}\), \(\varvec{W}^{*} \in [\epsilon ,\infty )^{M \times K}\) and \(\varvec{H}^{*} \in [\epsilon , \infty )^{N \times K}\), the level set

$$\begin{aligned} \{(\varvec{W},\varvec{H}) \in [\epsilon ,\infty )^{M \times K} \times [\epsilon ,\infty )^{N \times K}\,| D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T}) \le D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}^{*}(\varvec{H}^{*})^\mathrm {T})\} \end{aligned}$$

is bounded.

2.3 Optimality Conditions

Let \(\mathcal {F}_{\epsilon }=[\epsilon ,\infty )^{M \times K} \times [\epsilon ,\infty )^{N \times K}\) be the feasible region of the problem (3). If \((\varvec{W},\varvec{H}) \in \mathcal {F}_{\epsilon }\) is a local optimal solution of (3), it must satisfy the following conditions:

$$\begin{aligned} \forall i,k, \quad \frac{\partial D_{\alpha }}{\partial W_{ik}}&{\left\{ \begin{array}{ll} \ge 0, &{} \mathrm{if} \;\; W_{ik}=\epsilon , \\ =0, &{} \mathrm{if} \;\; W_{ik} > \epsilon , \end{array}\right. } \end{aligned}$$
(4)
$$\begin{aligned} \forall j,k, \quad \frac{\partial D_{\alpha }}{\partial H_{jk}}&{\left\{ \begin{array}{ll} \ge 0, &{} \mathrm{if} \;\; H_{jk}=\epsilon , \\ =0, &{} \mathrm{if} \;\; H_{jk} > \epsilon . \end{array}\right. } \end{aligned}$$
(5)

A point \((\varvec{W},\varvec{H}) \in \mathcal {F}_{\epsilon }\) satisfying (4) and (5) is called a stationary point of (3).

3 Proposed Algorithm

The algorithm proposed here for solving the problem (3) is based on the coordinate descent and the Newton method. Let the current values of \(\varvec{W}\) and \(\varvec{H}\) be \(\varvec{W}^\mathrm {c} \in [\epsilon ,\infty )^{M \times K}\) and \(\varvec{H}^\mathrm {c} \in [\epsilon ,\infty )^{N \times K}\), respectively. We want to minimize the value of the objective function by updating only one variable, say \(W_{ik}\). This problem is formulated as

$$\begin{aligned} \begin{array}{ll} \text {minimize} &{} f_{ik}(W_{ik}) \\ \text {subject to} &{} W_{ik} \ge \epsilon \end{array} \end{aligned}$$
(6)

where \(f_{ik}(W_{ik})\) is the function obtained from \(D_\alpha (\varvec{X}\,\Vert \,\varvec{W} \varvec{H}^\mathrm {T})\) by fixing all variables except \(W_{ik}\) to the current values. Because \(f_{ik}(W_{ik})\) is strictly convex as stated in the previous section, (6) is a convex optimization problem. Therefore, if \(f'_{ik}(W_{ik}^\mathrm {c})=0\) then \(W_{ik}^\mathrm {c}\) is the optimal solution of (6). However, we cannot obtain the optimal solution in a closed form in general. So we apply the Newton method to obtain an approximate solution of \(f'_{ik}(W_{ik})=0\), which is given by

$$\begin{aligned} W_{ik}^\mathrm {n}=W_{ik}^\mathrm {c}-\frac{f'_{ik}(W_{ik}^\mathrm {c})}{f''_{ik}(W_{ik}^\mathrm {c})}. \end{aligned}$$

If \(f'_{ik}(W_{ik}^\mathrm {n})=0\) then \(W_{ik}^\mathrm {n}\) is the minimum point of \(f_{ik}(W_{ik})\), and hence \(W_{ik}^\mathrm {new}=\max \{\epsilon , W_{ik}^\mathrm {n}\}\) is the optimal solution of (6). If \(f'_{ik}(W_{ik}^\mathrm {n}) f'_{ik}(W_{ik}^\mathrm {c})>0\) then \(f_{ik}(W_{ik})\) decreases monotonically as the value of \(W_{ik}\) varies from \(W_{ik}^\mathrm {c}\) to \(W_{ik}^\mathrm {n}\). Hence, letting \(W_{ik}^\mathrm {new}=\max \{\epsilon ,W_{ik}^\mathrm {n}\}\), we have \(f_{ik}(W_{ik}^\mathrm {new})<f_{ik}(W_{ik}^\mathrm {c})\). On the other hand, in the case where \(f'_{ik}(W_{ik}^\mathrm {n}) f'_{ik}(W_{ik}^\mathrm {c})<0\), it can occur that \(f_{ik}(W_{ik}^\mathrm {n})>f_{ik}(W_{ik}^\mathrm {c})\) (see Fig. 1). In order to avoid this situation, we use a linear interpolation of the curve \(Y=f'_{ik}(W_{ik})\). We first draw a line

$$\begin{aligned} Y-f'_{ik}(W_{ik}^\mathrm {c})=\frac{f'_{ik}(W_{ik}^\mathrm {c})-f'_{ik}(W_{ik}^\mathrm {n})}{W_{ik}^\mathrm {c}-W_{ik}^\mathrm {n}} (W_{ik}-W_{ik}^\mathrm {c}). \end{aligned}$$

on \(W_{ik}\)-Y plane, which passes through the points \((W_{ik}^\mathrm {c},f'(W_{ik}^\mathrm {c}))\) and \((W_{ik}^\mathrm {n},f'(W_{ik}^\mathrm {n}))\) (see the red line in Fig. 1). We then find the point at which the line intersects the \(W_{ik}\)-axis, and let the \(W_{ik}\)-coordinate of the point be a new approximate solution \(W_{ik}^\mathrm {i}\) of \(f'_{ik}(W_{ik})=0\), that is,

$$\begin{aligned} W_{ik}^\mathrm {i}=W_{ik}^\mathrm {c}-\frac{W_{ik}^\mathrm {n}-W_{ik}^\mathrm {c}}{f'_{ik}(W_{ik}^\mathrm {n})-f'_{ik}(W_{ik}^\mathrm {c})} f'_{ik}(W_{ik}^\mathrm {c}). \end{aligned}$$

Furthermore, if \(W_{ik}^\mathrm {i}\) is less than \(\epsilon \), we replace it with \(\epsilon \). Then we have \(f'_{ik}(W_{ik}^\mathrm {i})f'_{ik}(W_{ik}^\mathrm {c})>0\) and \(f_{ik}(W_{ik}^\mathrm {i})<f_{ik}(W_{ik}^\mathrm {c})\).

Fig. 1.
figure 1

Update rule for \(W_{ik}\) when \(f'_{ik}(W_{ik}^\mathrm {c}) f'_{ik}(W_{ik}^\mathrm {n})<0\). (Color figure online)

Fig. 2.
figure 2

Update rule for \(W_{ik}\).

Figure 2 shows the proposed update rule for \(W_{ik}\), which is based on the idea described above but slightly modified so that a better solution can be obtained.

The problem of minimizing the value of the objective function by updating only \(H_{jk}\) is formulated as

$$\begin{aligned} \begin{array}{ll} \text {minimize} &{} g_{jk}(H_{jk}) \\ \text {subject to} &{} H_{jk} \ge \epsilon \end{array} \end{aligned}$$
(7)

where \(g_{jk}(H_{jk})\) is the function obtained from \(D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T})\) by fixing all variables except \(H_{jk}\) to the current values. Using the same idea as above, we can derive an update rule for \(H_{jk}\) as shown in Fig. 3.

Fig. 3.
figure 3

Update rule for \(H_{jk}\).

It is clear from the derivation of Algorithms 1 and 2 that the following two lemmas hold true.

Lemma 2

If \(W_{ik}\) is not the optimal solution of the subproblem (6) then \(W_{ik}^\mathrm {new}\) obtained by Algorithm 1 satisfies \(f_{ik}(W_{ik}^\mathrm {new}) < f_{ik}(W_{ik})\). Similarly, if \(H_{jk}\) is not the optimal solution of the subproblem (7) then \(H_{jk}^\mathrm {new}\) obtained by Algorithm 2 satisfies \(g_{jk}(H_{jk}^\mathrm {new}) < g_{jk}(H_{jk})\).

Lemma 3

Suppose that \(\varvec{X} \in \mathbb {R}_{++}^{M \times N}\), \(\alpha \in \mathbb {R}\setminus \{0,1\}\) and \(\epsilon \in \mathbb {R}_{++}\) are fixed. Then \(W_{ik}^\mathrm {new}\), the output of Algorithm 1, depends continuously on \(\varvec{W}\) and \(\varvec{H}\) for any i and k. Similarly, \(H_{jk}^\mathrm {new}\), the output of Algorithm 2, depends continuously on \(\varvec{W}\) and \(\varvec{H}\) for any j and k.

Furthermore, using Zangwill’s global convergence theorem [16], we obtain the following theorem.

Theorem 1

Given \(\varvec{X} \in \mathbb {R}_{++}^{M \times N}\), \(K \in \mathbb {N}\), \(\alpha \in \mathbb {R}\setminus \{0,1\}\), \(\epsilon \in \mathbb {R}_{++}\), and an initial solution \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in \mathcal {F}_{\epsilon }\), we apply the update rules described by Algorithms 1 and 2 to \(MK+NK\) variables in a fixed cyclic order. Let \((\varvec{W}^{(l)},\varvec{H}^{(l)}) \in \mathcal {F}_{\epsilon }\) be the solution after l rounds of updates. Then the sequence \(\{\varvec{W}^{(l)},\varvec{H}^{(l)}\}_{l=0}^{\infty }\) has at least one convergent subsequence and the limit of any convergent subsequence is a stationary point of the problem (3).

Proof

Let us express the relation between \((\varvec{W}^{(l)},\varvec{H}^{(l)})\) and \((\varvec{W}^{(l+1)},\varvec{H}^{(l+1)})\) by using a mapping \(A: \mathcal {F}_{\epsilon } \rightarrow \mathcal {F}_{\epsilon }\) as follows:

$$\begin{aligned} (\varvec{W}^{(l+1)},\varvec{H}^{(l+1)}) = A(\varvec{W}^{(l)},\varvec{H}^{(l)}). \end{aligned}$$

In view of Zangwill’s global convergence theorem [16], it suffices to show that the following statements hold true.

  1. 1.

    (Boundedness) For any initial solution \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in \mathcal {F}_{\epsilon }\), the sequence \(\{(\varvec{W}^{(l)},\varvec{H}^{(l)})\}_{l=0}^{\infty }\) belongs to a compact subset of \(\mathcal {F}_{\epsilon }\).

  2. 2.

    (Monotoneness) The objective function \(D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T})\) satisfies

    $$\begin{aligned} (\varvec{W},\varvec{H}) \not \in \mathcal {S}_{\epsilon }&\Rightarrow D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}'(\varvec{H}')^\mathrm {T}) < D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T}) \\ (\varvec{W},\varvec{H}) \in \mathcal {S}_{\epsilon }&\Rightarrow D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}'(\varvec{H}')^\mathrm {T}) \le D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T}) \end{aligned}$$

    where \(\mathcal {S}_{\epsilon }\) is the set of stationary points of (3) and \((\varvec{W}',\varvec{H}')=A(\varvec{W},\varvec{H})\).

  3. 3.

    (Continuity) The mapping A is continuous in \(\mathcal {F}_{\epsilon } \setminus \mathcal {S}_{\epsilon }\).

The monotoneness follows from Lemma 2. The boundedness follows from Lemmas 1 and 2. The continuity follows from Lemma 3.    \(\square \)

By Theorem 1, we can immediately obtain an algorithm that stops within a finite number of rounds by relaxing the optimality condition given by (4) and (5), as shown in Reference [7]. The resulting algorithm is shown in Fig. 4.

Fig. 4.
figure 4

Newton-type algorithm for solving (3).

Theorem 2

For any input, Algorithm 3 stops within a finite number of rounds.

4 Numerical Experiments

In order to evaluate the efficiency of the proposed algorithm, we applied it to a randomly generated matrix \(\varvec{X}\) and compared the results with those obtained using the multiplicative update rule [6, 8] described by

$$\begin{aligned} W_{ik}^\mathrm {new} \leftarrow \max \left( \epsilon , W_{ik}\left( \frac{\sum _{j=1}^N X_{ij}H_{jk}/(\varvec{W}\varvec{H}^{\mathrm {T}})_{ij}}{\sum _{j=1}^N H_{jk}}\right) ^{\frac{1}{\alpha }}\right) \end{aligned}$$

and the same stopping condition. Although some other methods have been proposed (see [13] for example), we do not consider them because the global convergence is not guaranteed.

In all experiments, \(\varvec{X}\) was set to the same \(40 \times 20\) matrix of which each entry was drawn from an independent uniform distribution on the interval [0, 1]. The value of K was set to 5. The values of the parameters in Algorithm 3 were set to \(\epsilon =10^{-6}\), \(\delta _1=10^{-4}\), \(\delta _2=10^{-6}\), \(u=1.0\) and \(\delta _3=0.5\times \delta _1\). The value of the parameter \(\alpha \) in the alpha-divergence was set to \(-1.5\), 0.5 and 2.5. For each value of \(\alpha \), the multiplicative update rules and the proposed algorithms were run for 10 times with 10 different initial solutions, which were generated in the same way as \(\varvec{X}\) but all entries less than \(\epsilon \) were replaced with \(\epsilon \) so that the initial solution belongs to the feasible region of the optimization problem. The maximum number of rounds was set to 40, 000, that is, if the solution does not satisfy the stopping condition within 40, 000 rounds then the algorithm was forcedly stopped. All algorithms were implemented in C language, compiled with gcc 5.3.0 and tested on a PC with Intel Core i5-4590 and 8 GB RAM.

The results are shown in Tables 1 and 2. It is easily seen from those tables that the proposed algorithm is much faster than the multiplicative update rule.

Table 1. The number of rounds of the multiplicative update (MU) rule and Algorithm 3 for solving (3).
Table 2. Computation time (in second) of the multiplicative update (MU) rule and Algorithm 3 for solving (3).

5 Conclusion

We have proposed a novel iterative algorithm, which is based on the coordinate descent and the Newton method, for NMF with the alpha-divergence. The proposed algorithm not only has the global convergence property like the multiplicative update rule but also is much faster than the multiplicative update rule, as shown in the experimental results in the previous section. Further experiments with various real data should be performed in the near future to evaluate the efficiency of the proposed algorithm.