Abstract
We propose a novel iterative algorithm for nonnegative matrix factorization with the alpha-divergence. The proposed algorithm is based on the coordinate descent and the Newton method. We show that the proposed algorithm has the global convergence property in the sense that the sequence of solutions has at least one convergent subsequence and the limit of any convergent subsequence is a stationary point of the corresponding optimization problem. We also show through numerical experiments that the proposed algorithm is much faster than the multiplicative update rule.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Nonnegative Matrix Factorization (NMF) [1,2,3] is a mathematical operation that decomposes a given nonnegative matrix \(\varvec{X}\) into two nonnegative matrices \(\varvec{W}\) and \(\varvec{H}\) such that \(\varvec{X} \approx \varvec{W}\varvec{H}\). NMF has found many applications in various fields such as image processing, acoustic signal processing, data analysis and text mining because it can find nonnegative basis for a given set of nonnegative data.
NMF is formulated as a constrained optimization problem in which an error between \(\varvec{X}\) and \(\varvec{W}\varvec{H}\) has to be minimized under the nonnegativity constraints on \(\varvec{W}\) and \(\varvec{H}\). Multiplicative update rules [3] are widely used as simple and easy-to-implement methods for solving the NMF optimization problems. This approach can be easily applied to a wide class of error measures [4,5,6], and the obtained update rules have the global convergence property [7, 8]. However, they are often slow. Hence many studies have been done to develop faster algorithms for solving the NMF optimization problems (see, for example, [9, 10] and references therein).
In this paper, we focus our attention on NMF with the alpha-divergence [11]. The alpha-divergence includes Pearson divergence, Hellinger divergence, and chi-square divergence as its special cases [12], and has been frequently used for NMF (see [13] and references therein). As a simple and fast method for solving the optimization problem for NMF with the alpha-divergence, we propose a novel iterative algorithm based on the coordinate descent and the Newton method. We show that the proposed algorithm has the global convergence property [7, 14,15,16] in the sense that the sequence of solutions has at least one convergent subsequence and the limit of any convergent subsequence is a stationary point of the corresponding optimization problem. We also show through numerical experiments that the proposed algorithm is much faster than the multiplicative update rule.
Notation: \(\mathbb {R}\) denotes the set of real numbers. \(\mathbb {R}_{+}\) and \(\mathbb {R}_{++}\) denote the set of nonnegative and positive real numbers, respectively. For any subset S of \(\mathbb {R}\), \(S^{I \times J}\) denotes the set of all \(I \times J\) matrices such that each entry belongs to S. For example, \(\mathbb {R}_{+}^{I \times J}\) is the set of all \(I \times J\) real nonnegative matrices. \(\mathbb {N}\) denotes the set of natural numbers or the set of positive integers. \(\varvec{0}_{I \times J}\) and \(\varvec{1}_{I \times J}\) denote the \(I \times J\) matrix of all zeros and all ones, respectively. For two matrices \(\varvec{A}=(A_{ij})\) and \(\varvec{B}=(B_{ij}) \) with the same size, the inequality \(\varvec{A} \ge \varvec{B}\) means that \(A_{ij} \ge B_{ij}\) for all i and j, and \((\varvec{A}\varvec{B})_{ij}\) denotes the (i, j)-th entry of the matrix \(\varvec{A}\varvec{B}\), that is, \((\varvec{A}\varvec{B})_{ij}=\sum _{k}A_{ik}B_{kj}\).
2 Alpha-Divergence Based Nonnegative Matrix Factorization
2.1 Optimization Problem
Suppose we are given an \(M \times N\) nonnegative matrix \(\varvec{X}=(X_{ij})\). The alpha-divergence based NMF is formulated as the constrained optimization problem:
where
When \(\alpha >1\), the right-hand side of (2) is not defined for all nonnegative matrices \(\varvec{W}\) and \(\varvec{H}\). A simple way to make the problem well-defined is to modify (1) as follows:
where \(\epsilon \) is a positive constant, which is usually set to a small number so that (3) is close to (1). In what follows, we consider (3) instead of (1).
Note that sparse factor matrices can never be obtained from the modified optimization problem (3). However, if we replace all \(\epsilon \) in the obtained solution with zero, the resulting factor matrices are expected to be sparse because local optimal solutions of (3) are often located at the boundary of the feasible region. In addition, if \(\epsilon \) is sufficiently small, it is expected that the pair of the resulting factor matrices is close to the original local optimal solution.
When \(\alpha <0\) and \(\varvec{X}\) has a zero entry, the right-hand of (2) is not determined. We thus impose throughout this paper the following assumption on \(\varvec{X}\).
Assumption 1
All entries of \(\varvec{X}\) are positive.
2.2 Properties of the Objective Function
The partial derivatives of \(D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T})\) with respect to \(W_{ik}\) and \(H_{jk}\) are given by
and the second and third partial derivatives are given by
Under Assumption 1, the second partial derivatives are positive for all \(\alpha \,(\ne 0,1)\) and all pairs of positive matrices \(\varvec{W}\) and \(\varvec{H}\). Therefore, if we fix all entries of \(\varvec{W}\) and \(\varvec{H}\) except \(W_{ik}\) (\(H_{jk}\), resp.) to constants not less than \(\epsilon \) then we obtain a function of \(W_{ik}\) (\(H_{jk}\), resp.) which is strictly convex on \([\epsilon , \infty )\). In what follows, we express these functions as \(f_{ik}(W_{ik})\) and \(g_{jk}(H_{jk})\). Then \(f'_{ik}(W_{ik})\) and \(g'_{jk}(H_{jk})\) are monotone increasing functions, and convex if \(\alpha \le -1\) and concave otherwise.
It is easy to see that the objective function of (3) has the following property.
Lemma 1
For any \(\alpha \in \mathbb {R} \setminus \{0,1\}\), \(\epsilon \in \mathbb {R}_{++}\), \(\varvec{X} \in \mathbb {R}_{++}^{M \times N}\), \(\varvec{W}^{*} \in [\epsilon ,\infty )^{M \times K}\) and \(\varvec{H}^{*} \in [\epsilon , \infty )^{N \times K}\), the level set
is bounded.
2.3 Optimality Conditions
Let \(\mathcal {F}_{\epsilon }=[\epsilon ,\infty )^{M \times K} \times [\epsilon ,\infty )^{N \times K}\) be the feasible region of the problem (3). If \((\varvec{W},\varvec{H}) \in \mathcal {F}_{\epsilon }\) is a local optimal solution of (3), it must satisfy the following conditions:
A point \((\varvec{W},\varvec{H}) \in \mathcal {F}_{\epsilon }\) satisfying (4) and (5) is called a stationary point of (3).
3 Proposed Algorithm
The algorithm proposed here for solving the problem (3) is based on the coordinate descent and the Newton method. Let the current values of \(\varvec{W}\) and \(\varvec{H}\) be \(\varvec{W}^\mathrm {c} \in [\epsilon ,\infty )^{M \times K}\) and \(\varvec{H}^\mathrm {c} \in [\epsilon ,\infty )^{N \times K}\), respectively. We want to minimize the value of the objective function by updating only one variable, say \(W_{ik}\). This problem is formulated as
where \(f_{ik}(W_{ik})\) is the function obtained from \(D_\alpha (\varvec{X}\,\Vert \,\varvec{W} \varvec{H}^\mathrm {T})\) by fixing all variables except \(W_{ik}\) to the current values. Because \(f_{ik}(W_{ik})\) is strictly convex as stated in the previous section, (6) is a convex optimization problem. Therefore, if \(f'_{ik}(W_{ik}^\mathrm {c})=0\) then \(W_{ik}^\mathrm {c}\) is the optimal solution of (6). However, we cannot obtain the optimal solution in a closed form in general. So we apply the Newton method to obtain an approximate solution of \(f'_{ik}(W_{ik})=0\), which is given by
If \(f'_{ik}(W_{ik}^\mathrm {n})=0\) then \(W_{ik}^\mathrm {n}\) is the minimum point of \(f_{ik}(W_{ik})\), and hence \(W_{ik}^\mathrm {new}=\max \{\epsilon , W_{ik}^\mathrm {n}\}\) is the optimal solution of (6). If \(f'_{ik}(W_{ik}^\mathrm {n}) f'_{ik}(W_{ik}^\mathrm {c})>0\) then \(f_{ik}(W_{ik})\) decreases monotonically as the value of \(W_{ik}\) varies from \(W_{ik}^\mathrm {c}\) to \(W_{ik}^\mathrm {n}\). Hence, letting \(W_{ik}^\mathrm {new}=\max \{\epsilon ,W_{ik}^\mathrm {n}\}\), we have \(f_{ik}(W_{ik}^\mathrm {new})<f_{ik}(W_{ik}^\mathrm {c})\). On the other hand, in the case where \(f'_{ik}(W_{ik}^\mathrm {n}) f'_{ik}(W_{ik}^\mathrm {c})<0\), it can occur that \(f_{ik}(W_{ik}^\mathrm {n})>f_{ik}(W_{ik}^\mathrm {c})\) (see Fig. 1). In order to avoid this situation, we use a linear interpolation of the curve \(Y=f'_{ik}(W_{ik})\). We first draw a line
on \(W_{ik}\)-Y plane, which passes through the points \((W_{ik}^\mathrm {c},f'(W_{ik}^\mathrm {c}))\) and \((W_{ik}^\mathrm {n},f'(W_{ik}^\mathrm {n}))\) (see the red line in Fig. 1). We then find the point at which the line intersects the \(W_{ik}\)-axis, and let the \(W_{ik}\)-coordinate of the point be a new approximate solution \(W_{ik}^\mathrm {i}\) of \(f'_{ik}(W_{ik})=0\), that is,
Furthermore, if \(W_{ik}^\mathrm {i}\) is less than \(\epsilon \), we replace it with \(\epsilon \). Then we have \(f'_{ik}(W_{ik}^\mathrm {i})f'_{ik}(W_{ik}^\mathrm {c})>0\) and \(f_{ik}(W_{ik}^\mathrm {i})<f_{ik}(W_{ik}^\mathrm {c})\).
Figure 2 shows the proposed update rule for \(W_{ik}\), which is based on the idea described above but slightly modified so that a better solution can be obtained.
The problem of minimizing the value of the objective function by updating only \(H_{jk}\) is formulated as
where \(g_{jk}(H_{jk})\) is the function obtained from \(D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T})\) by fixing all variables except \(H_{jk}\) to the current values. Using the same idea as above, we can derive an update rule for \(H_{jk}\) as shown in Fig. 3.
It is clear from the derivation of Algorithms 1 and 2 that the following two lemmas hold true.
Lemma 2
If \(W_{ik}\) is not the optimal solution of the subproblem (6) then \(W_{ik}^\mathrm {new}\) obtained by Algorithm 1 satisfies \(f_{ik}(W_{ik}^\mathrm {new}) < f_{ik}(W_{ik})\). Similarly, if \(H_{jk}\) is not the optimal solution of the subproblem (7) then \(H_{jk}^\mathrm {new}\) obtained by Algorithm 2 satisfies \(g_{jk}(H_{jk}^\mathrm {new}) < g_{jk}(H_{jk})\).
Lemma 3
Suppose that \(\varvec{X} \in \mathbb {R}_{++}^{M \times N}\), \(\alpha \in \mathbb {R}\setminus \{0,1\}\) and \(\epsilon \in \mathbb {R}_{++}\) are fixed. Then \(W_{ik}^\mathrm {new}\), the output of Algorithm 1, depends continuously on \(\varvec{W}\) and \(\varvec{H}\) for any i and k. Similarly, \(H_{jk}^\mathrm {new}\), the output of Algorithm 2, depends continuously on \(\varvec{W}\) and \(\varvec{H}\) for any j and k.
Furthermore, using Zangwill’s global convergence theorem [16], we obtain the following theorem.
Theorem 1
Given \(\varvec{X} \in \mathbb {R}_{++}^{M \times N}\), \(K \in \mathbb {N}\), \(\alpha \in \mathbb {R}\setminus \{0,1\}\), \(\epsilon \in \mathbb {R}_{++}\), and an initial solution \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in \mathcal {F}_{\epsilon }\), we apply the update rules described by Algorithms 1 and 2 to \(MK+NK\) variables in a fixed cyclic order. Let \((\varvec{W}^{(l)},\varvec{H}^{(l)}) \in \mathcal {F}_{\epsilon }\) be the solution after l rounds of updates. Then the sequence \(\{\varvec{W}^{(l)},\varvec{H}^{(l)}\}_{l=0}^{\infty }\) has at least one convergent subsequence and the limit of any convergent subsequence is a stationary point of the problem (3).
Proof
Let us express the relation between \((\varvec{W}^{(l)},\varvec{H}^{(l)})\) and \((\varvec{W}^{(l+1)},\varvec{H}^{(l+1)})\) by using a mapping \(A: \mathcal {F}_{\epsilon } \rightarrow \mathcal {F}_{\epsilon }\) as follows:
In view of Zangwill’s global convergence theorem [16], it suffices to show that the following statements hold true.
-
1.
(Boundedness) For any initial solution \((\varvec{W}^{(0)},\varvec{H}^{(0)}) \in \mathcal {F}_{\epsilon }\), the sequence \(\{(\varvec{W}^{(l)},\varvec{H}^{(l)})\}_{l=0}^{\infty }\) belongs to a compact subset of \(\mathcal {F}_{\epsilon }\).
-
2.
(Monotoneness) The objective function \(D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T})\) satisfies
$$\begin{aligned} (\varvec{W},\varvec{H}) \not \in \mathcal {S}_{\epsilon }&\Rightarrow D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}'(\varvec{H}')^\mathrm {T}) < D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T}) \\ (\varvec{W},\varvec{H}) \in \mathcal {S}_{\epsilon }&\Rightarrow D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}'(\varvec{H}')^\mathrm {T}) \le D_{\alpha }(\varvec{X}\,\Vert \,\varvec{W}\varvec{H}^\mathrm {T}) \end{aligned}$$where \(\mathcal {S}_{\epsilon }\) is the set of stationary points of (3) and \((\varvec{W}',\varvec{H}')=A(\varvec{W},\varvec{H})\).
-
3.
(Continuity) The mapping A is continuous in \(\mathcal {F}_{\epsilon } \setminus \mathcal {S}_{\epsilon }\).
The monotoneness follows from Lemma 2. The boundedness follows from Lemmas 1 and 2. The continuity follows from Lemma 3. \(\square \)
By Theorem 1, we can immediately obtain an algorithm that stops within a finite number of rounds by relaxing the optimality condition given by (4) and (5), as shown in Reference [7]. The resulting algorithm is shown in Fig. 4.
Theorem 2
For any input, Algorithm 3 stops within a finite number of rounds.
4 Numerical Experiments
In order to evaluate the efficiency of the proposed algorithm, we applied it to a randomly generated matrix \(\varvec{X}\) and compared the results with those obtained using the multiplicative update rule [6, 8] described by
and the same stopping condition. Although some other methods have been proposed (see [13] for example), we do not consider them because the global convergence is not guaranteed.
In all experiments, \(\varvec{X}\) was set to the same \(40 \times 20\) matrix of which each entry was drawn from an independent uniform distribution on the interval [0, 1]. The value of K was set to 5. The values of the parameters in Algorithm 3 were set to \(\epsilon =10^{-6}\), \(\delta _1=10^{-4}\), \(\delta _2=10^{-6}\), \(u=1.0\) and \(\delta _3=0.5\times \delta _1\). The value of the parameter \(\alpha \) in the alpha-divergence was set to \(-1.5\), 0.5 and 2.5. For each value of \(\alpha \), the multiplicative update rules and the proposed algorithms were run for 10 times with 10 different initial solutions, which were generated in the same way as \(\varvec{X}\) but all entries less than \(\epsilon \) were replaced with \(\epsilon \) so that the initial solution belongs to the feasible region of the optimization problem. The maximum number of rounds was set to 40, 000, that is, if the solution does not satisfy the stopping condition within 40, 000 rounds then the algorithm was forcedly stopped. All algorithms were implemented in C language, compiled with gcc 5.3.0 and tested on a PC with Intel Core i5-4590 and 8 GB RAM.
The results are shown in Tables 1 and 2. It is easily seen from those tables that the proposed algorithm is much faster than the multiplicative update rule.
5 Conclusion
We have proposed a novel iterative algorithm, which is based on the coordinate descent and the Newton method, for NMF with the alpha-divergence. The proposed algorithm not only has the global convergence property like the multiplicative update rule but also is much faster than the multiplicative update rule, as shown in the experimental results in the previous section. Further experiments with various real data should be performed in the near future to evaluate the efficiency of the proposed algorithm.
References
Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–792 (1999)
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems. vol. 13, pp. 556–562 (2001)
Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the Itakura-Saito divergence: with application to music analysis. Neural Comput. 21(3), 793–830 (2009)
Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural Comput. 23(9), 2421–2456 (2011)
Yang, Z., Oja, E.: Unified development of multiplicative algorithm for linear and quadratic nonnegative matrix factorization. IEEE Trans. Neural Networks 22(12), 1878–1891 (2011)
Takahashi, N., Hibi, R.: Global convergence of modified multiplicative updates for nonnegative matrix factorization. Comput. Optim. Appl. 57, 417–440 (2014)
Takahashi, N., Katayama, J., Takeuchi, J.: A generalized sufficient condition for global convergence of modified multiplicative updates for NMF. In: Proceedings of 2014 International Symposium on Nonlinear Theory and Its Applications. pp. 44–47 (2014)
Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorization: a unified view based on block coordinate descent framework. J. Global Optim. 58(2), 285–319 (2014)
Hansen, S., Plantenga, T., Kolda, T.G.: Newton-based optimization for Kullback-Leibler nonnegative tensor factorizations. Optim. Methods Softw. 30(5), 1002–1029 (2015)
Amari, S.I.: Differential-Geometrical Methods in Statistics. Springer, New York (1985)
Cichocki, A., Zdunek, R., Amari, S.I.: Csiszar’s divergences for non-negative matrix factorization: family of new algorithms. In: Proceedings of the 6th International Conference on Independent Component Analysis and Signal Separation, pp. 32–39 (2006)
Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.I.: Nonnegative Matrix and Tensor Factorizations. Wiley, West Sussex (2009)
Kimura, T., Takahashi, N.: Global convergence of a modified HALS algorithm for nonnegative matrix factorization. In: Proceedings of 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, pp. 21–24 (2015)
Takahashi, N., Seki, M.: Multiplicative update for a class of constrained optimization problems related to NMF and its global convergence. In: Proceedings of 2016 European Signal Processing Conference, pp. 438–442 (2016)
Zangwill, W.: Nonlinear Programming: A Unified Approach. Prentice-Hall, Englewood Cliffs (1969)
Acknowledgments
This work was partially supported by JSPS KAKENHI Grant Number JP15K00035.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Nakatsu, S., Takahashi, N. (2017). A Novel Newton-Type Algorithm for Nonnegative Matrix Factorization with Alpha-Divergence. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10634. Springer, Cham. https://doi.org/10.1007/978-3-319-70087-8_36
Download citation
DOI: https://doi.org/10.1007/978-3-319-70087-8_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70086-1
Online ISBN: 978-3-319-70087-8
eBook Packages: Computer ScienceComputer Science (R0)