1 Introduction

Given a nonnegative matrix data \(V\in \mathbb {R}_{+}^{m\times n}\) and a positive integer \(r\le \min (m,n)\), nonnegative matrix factorization (NMF) is the problem of finding \(W\in \mathbb {R}_{+}^{m\times r}\) and \(H\in \mathbb {R}_{+}^{r\times n}\) such that \(V\approx WH\). The quality of the approximation is measured using an objective function, which typically has the form

$$\begin{aligned} D\left( V|WH\right) \triangleq \sum _{i=1}^{m}\sum _{j=1}^{n}d\left( V_{ij}|(WH)_{ij}\right) , \end{aligned}$$

where d(x|y) is a scalar cost function such that \(d(x|y) \ge 0\) for all \(x,y \ge 0\) and \(d(x|y) = 0\) if and only if \(x=y\). NMF is then written as the following problem

$$\begin{aligned} \min _{W\in \mathbb {R}^{m\times r}_+,H\in \mathbb {R}^{r\times m}_+} D\left( V|WH\right) . \end{aligned}$$

The most widely used class of the scalar cost functions d(x|y) is the \(\beta \)-divergence [15]

$$\begin{aligned} d_{\beta }(x|y)={\left\{ \begin{array}{ll} \frac{1}{\beta (\beta -1)}\big (x^{\beta }+(\beta -1)y^{\beta }-\beta xy^{\beta -1}\big ) &{}\mathrm{if}\,\beta \in \mathbb {R}\setminus \left\{ 0,1\right\} , \\ x\log \frac{x}{y}-x+y &{} \mathrm{if}\,\beta =1,\\ \frac{x}{y}-\log \frac{x}{y}-1 &{}\mathrm{if}\,\beta =0. \end{array}\right. } \end{aligned}$$

In this paper, we are interested in the Kullback–Leibler (KL) divergence (also known as the I-divergence), which is the \(\beta \)-divergence with \(\beta =1\). The KL NMF problem can be rewritten as follows:

$$\begin{aligned} \min _{W\in \mathbb {R}^{m\times r}_+,H\in \mathbb {R}^{r\times m}_+} D_\mathrm{KL} (V|WH), \end{aligned}$$
(1)

where

$$\begin{aligned} D_\mathrm{KL} (V|WH) := \sum _{i=1}^m \sum _{j=1}^n\big ((WH)_{ij}- V_{ij}\log (WH)_{ij}\big ) +\sum _{i=1}^m \sum _{j=1}^n \big (V_{ij}\log V_{ij}-V_{ij} \big ). \end{aligned}$$

With the convention that \(0 \times \log 0 =0\) and \(a \times \log 0 = -\infty \) for \(a>0\), the objective function in (1) is well-defined and it is an extended-value function, that is, \(D_\mathrm{KL}\in [0,+\infty ]\). KL NMF (1) is well-posed, that is, a solution always exists (see Sect. 2.1 for more details), but the solution is in general non-unique even when removing the scaling and permutation ambiguities of the low-rank approximation WH; see [17] and the references therein.

1.1 Motivation

Since the seminal paper of Lee and Seung [31], NMF has been shown to be a very powerful model to extract perceptually meaningful features from high-dimensional data sets. Applications include facial feature extraction [22, 31], recovery and document classification [11, 31, 43], unmixing hyperspectral images [4, 39]; see also [9, 17, 18] and the references therein. The most widely used objective function for NMF is the Frobenius norm \(\Vert V - WH\Vert _F^2\) which corresponds to the \(\beta \)-divergence with \(\beta = 2\). For nonnegative data sets, the Frobenius norm is however not the theoretically most reasonable choice. In fact, it corresponds to the maximum likelihood estimator in the presence of additive i.i.d. Gaussian noise. Under this noise distribution, observing negative entries in a data set has a positive probability. Moreover, many nonnegative data sets are sparse in which case Gaussian noise is clearly not appropriate [8, 14, 31].

Let us assume that \(V_{ij}\) is a sample of the random variable \(\tilde{V}_{ij}\) following the Poisson distribution of parameter \((WH)_{ij}\), that is,

$$\begin{aligned} \mathbb {P}\big (\tilde{V}_{ij}=k\big ) = \frac{(WH)_{ij}^k \, e^{-(WH)_{ij}}}{k!} \text { for } k=0,1,2, \ldots \end{aligned}$$

Then the maximum likelihood estimator of W and H, given V, is the solution of (1). The Poisson distribution is particularly well suited for integer-valued data sets, such a documents represented by vector of word counts [8, 33], or images which can be interpreted as a photon counting process [23]. KL NMF has also been used successfully in bioinformatics [35], e.g., to cluster samples of RNA sequencing gene expression data [10]. It is worth noting that KL NMF also makes sense when V contains non-integer entries. In that case, \(V_{ij}\) can be interpreted as the average of several samples of the random variable \(\tilde{V}_{ij}\).

In the literature, NMF with the Frobenius norm (Fro NMF) has been thoroughly studied and well-documented. Among algorithms for Fro NMF, the block coordinate (BC) methods, which update one block of the variables at a time, have the best performance in practice, see for example [1, 9, 18, 24, 28, 49] and the references therein. Note that one “block” here can be a full matrix, a column, a row, or even just a scalar component of the matrices W or H. The objective function of Fro NMF has several nice properties that allow us to apply some advanced development of BC methods for solving composite block-wise convex optimization problems, which subsume Fro NMF as a special case, to derive very efficient algorithms with some rigorous convergence guarantee [24, 49]. One of the most important properties is that the gradients of the objective with respect to W and H, that is, \(W\mapsto \nabla _W D_\mathrm{{Fro}}\) and \(H\mapsto \nabla _H D_\mathrm{{Fro}}\), are Lipschitz-continuous over \(W \in \mathbb {R}^{m\times r}_+\), \(H \in \mathbb {R}^{r\times n}_+\). Although the objective of KL NMF, similarly to \(D_\mathrm{{Fro}}\), is block-wise convex (that is, the function \(W\mapsto D_\mathrm{{KL}}\) and \(H\mapsto D_\mathrm{{KL}}\) are convex), \(D_\mathrm{KL}\) is even not differentiable at W or H when \((WH)_{ij}=0\) for some ij. Since \(D_\mathrm{{KL}}\) does not possess the nice smooth properties of Fro NMF, the extension of the analysis of BC methods from Fro NMF to KL NMF is restricted. Proposing a good algorithm for solving KL NMF is therefore a more difficult task compared to the Fro NMF. In fact, there are much fewer papers studying algorithms for KL NMF in the literature; in particular, algorithms with convergence guarantee are scarce. To the best of our knowledge, the multiplicative updates (MU) with some modification of KL NMF (see Sect. 3.1 for more details) is the only algorithm that has a subsequential convergence guarantee. Moreover, the MU are the most widely used algorithm for KL NMF, while it is well known that, for Fro NMF, the MU are slow and should not be used; see [18] and the references therein. These observations motivate this work which analyses in details the properties and algorithms for KL NMF (1).

1.2 Contribution and Outline

Our main contribution is threefold.

  1. 1.

    In Sect. 2, we present important properties of KL NMF (1) for the convergence analysis of KL-NMF algorithms. Some are well-known while others have not yet been highlighted in the literature, to the best of our knowledge.

  2. 2.

    While existing algorithms for (1) are briefly reviewed in Sect. 3, we present two new algorithms in Sect. 4. They are (i) a block mirror descent method (BMD), for which we prove the global convergence of its generated sequence to a stationary point of a slightly perturbed version of KL NMF, and (ii) a scalar Newton-type algorithm which monotonically decreases the objective function. To the best of our knowledge, BMD is the first algorithm for KL NMF that has a global convergence guarantee. We also propose a hybridization between the scalar Newton algorithm and the MU for which the objective function is guaranteed to be non-increasing.

  3. 3.

    In Sect. 5, we perform extensive numerical experiments on synthetic as well as real data sets to compare the performance of the algorithms. To the best of our knowledge, this is the first time such a comparison is performed. It provides a good picture on the performance of the algorithms in different scenarios.

We hope that the paper will be a good reference for whomever is using KL NMF, or is interested in KL NMF algorithms.

2 Some Properties of KL NMF

In the remainder of the paper, for simplicity, we denote \(D(V|WH) = D_\mathrm{{KL}}(V|WH)\) since we only consider the KL divergence. The objective function of KL NMF (1) is not finite at every point of its constraint set since \(D(V|WH)=\infty \) when \(V_{ij}>0\) and \((WH)_{ij}=0\). This fact makes the convergence analysis of algorithms for (1) very challenging. Hence, in parallel with considering KL NMF (1), we propose to study the following perturbed version

$$\begin{aligned} \min _{W, H}&\quad D(V|WH) \nonumber \\ \text {such that }&W_{ik}\ge \varepsilon , H_{kj} \ge \varepsilon \text { for } i \in [m], j \in [n], k \in [r], \end{aligned}$$
(2)

where \(\varepsilon \ge 0\), and [n] denotes the set \(\{1,\ldots ,n\}\). Problem (2) is equivalent to KL NMF (1) when \(\varepsilon =0\). When \(\varepsilon >0\), the objective of (2) is finite at every point of its constraint set. Moreover, a solution of (2) has all its entries strictly positive when \(\varepsilon > 0\). However, for \(\varepsilon \) sufficiently small (we recommend to use the machine precision), then such entries can be considered as zeros, which will not influence the objective function of KL NMF much; see Proposition 1 below.

2.1 Existence of Solutions

The following proposition proves the existence of solutions of Problem (2) and provides a connection between the optimal value of (1) and its perturbed problem (2) with \(\varepsilon >0\). The proof is given in Appendix 1.

Proposition 1

(A) Given \(\varepsilon \ge 0\) and a nonnegative matrix V, Problem (2) attains its minimum, that is, it has at least one optimal solution.

(B) Let \( \nu =\sum _{i=1}^m\sum _{j=1}^n V_{ij}\). Denote \(D^*(V,\varepsilon )\) be the optimal value of Problem (2). We have \(D^*(V,\varepsilon ) \le D^*(V,0) + (\min \{ n+ mr, m+nr\} \sqrt{\nu }+mn\varepsilon )\varepsilon . \)

Proposition 1(A) is known for the case \(\varepsilon =0\); see for example [16, Proposition 2.1], while Proposition 1(B) is new. Given \(\delta \ge 0\), Proposition 1(B) shows that by choosing \(\varepsilon \) such that \( \min \{n+mr,m+nr\}\sqrt{\nu } \varepsilon + mn \varepsilon ^2 =\delta , \) an optimal solution of (2) is a \(\delta \)-optimal solution of KL NMF (1). Proposition 1(B) shows that, for \(\varepsilon \) sufficiently small, the objective function of (2) is not significantly larger than that of (1). However, it says nothing on the corresponding optimal solutions. To the best of our knowledge, no sensitivity analysis of the optimal solutions exist in the NMF literature. In fact, there could exist several isolated optimal solutions (see, e.g., [19], for some examples) which makes a sensitivity analysis difficult in the general case because the optimal solution can change drastically for an arbitrarily small perturbation (leading to an infinite condition number of the problem). This is an interesting direction of further research.

2.2 KKT and Stationary Points

A pair \((W^*,H^*)\) is a Karush–Kuhn–Tucker (KKT) point of (2) if it satisfies the KKT optimality condition of (2), that is, if the objective function \(f(W,H) := D(V,WH)\) is differentiable at \((W^*,H^*)\), and for all \(i\in [m]\) and \(k\in [r]\): \(W_{ik}^*\ge \varepsilon \) and

$$\begin{aligned} \begin{aligned} \frac{\partial f(W^*,H^*)}{\partial W_{ik}} = \sum _{j=1}^{n}H_{kj}^*-\sum _{\{j | V_{ij}>0\}}^{n}V_{ij}\frac{H_{kj}^*}{(W^*H^*)_{ij}}&\ge 0, (3a)\\ (W^*_{ik}-\varepsilon )\frac{\partial f(W^*,H^*)}{\partial W_{ik}}&=0, (3b)\\ \end{aligned} \end{aligned}$$
(3)

and similarly for \(H^*\), by symmetry. A pair \((W^*,H^*)\) is a stationary point of (2) if it is a feasible point of (2) that lies in the domain of f and for all \(i\in [m]\) and \(k\in [r]\), it satisfies

$$\begin{aligned} \frac{\partial f(W^*,H^*)}{\partial W_{ik}}(W_{ik}-W_{ik}^*)\ge 0 \text { for any } W_{ik}\ge \varepsilon , \end{aligned}$$
(4)

and similarly for \(H^*\). For Problem (2), it turns out that KKT points and stationary points coincide. Since \(W_{ik}^*-\varepsilon \ge 0\), (3a) and (3b) hold if and only if (4) holds. Indeed, let \((W^*,H^*)\) satisfy (4) then choosing \(W_{ik}\) in (4) to be \(W_{ik}^*+1\) gives (3a) while choosing \(W_{ik}\) in (4) to be \(\varepsilon \) or \(2W_{ik}^*-\varepsilon \) gives (3b). The reverse direction is obvious. The same reasoning applies to \(H^*\).

The following proposition provides an interesting property of KKT points of KL NMF (1), see [25, Theorem 1] for its proof. Note that this property does not hold for its perturbed variant (2) with \(\varepsilon >0\).

Proposition 2

If \((W^*,H^*)\) is a KKT point of (1), then \(W^*H^*\) preserves the row sums and the column sums of V, that is,

$$\begin{aligned} V e =(W^*H^*)e \quad \text { and } \quad e^\top V = e^\top (W^*H^*), \end{aligned}$$

where e denotes the vector of all ones of appropriate dimension.

Let us define the following notion of a scaled pair (WH).

Definition 1

We say (WH) is scaled if the optimal solution of the problem \( \min _{\alpha \in \mathbb {R}} D(V,\alpha WH) \) is equal to 1.

Hence WH is scaled implies that one cannot multiply WH by a constant \(\alpha \) (that is, one cannot scale WH) to reduce the error KL(XWH).

By Proposition 2, if \((W^*,H^*)\) is a KKT point of (1) then \(\sum _{i=1}^m \sum _{j=1}^n V_{ij} = \sum _{i=1}^m \sum _{j=1}^n (W^*H^*)_{ij}\) and hence is scaled. Moreover, we have the following result.

Proposition 3

A pair (WH) is scaled if and only if

$$\begin{aligned} \sum _{i=1}^m \sum _{j=1}^n V_{ij} = \sum _{i=1}^m \sum _{j=1}^n (WH)_{ij}. \end{aligned}$$

Proof

We have

$$\begin{aligned} D(V,\alpha WH)&=\alpha \sum _{i,j}(WH)_{ij}-\sum _{i,j}V_{ij}\log (WH)_{ij}\\&\quad -\sum _{i,j}V_{ij}\log \alpha + \sum _{i,j} (V_{ij} \log V_{ij} - V_{ij}). \end{aligned}$$

The result follows from the equation \(\nabla _\alpha D(V,\alpha WH)=0\). \(\square \)

Proposition 3, although simple to prove, is not present explicitly in the literature. However, it is rather interesting from a practical point of view: any feasible solution (WH) to KL NMF can be improved simply by scaling it. In fact, combining Definition 1 and Proposition 3, we can compute

$$\begin{aligned} \alpha ^* = {{\,\mathrm{argmin}\,}}_{\alpha \in \mathbb {R}} D(V,\alpha WH) = \frac{e^\top V e}{e^\top (WH) e}, \end{aligned}$$

and scale \(W \; \leftarrow \; \alpha ^* \, W \) to improve the feasible solution (WH). This could be used within any algorithm for KL NMF.

2.3 Relative Smoothness

In Sect. 4.1, we will propose a new algorithm that globally converges to a stationary point of (2) with \(\varepsilon >0\), namely a block mirror descent method. An instrumental property of (2) to prove such a result is relative smoothness. Let us describe this property in details.

The objective D(V|WH) is convex in each block variable W and H, but it is not jointly convex in (WH). Furthermore, D(V|WH) does not possess the Lipschitz smoothness property, that is, the derivative of D(V|WH) with respect to W or H is not Lipschitz continuous. Recently, the authors in [3] and [37] introduce the notion of relative smoothness that is a generalization of the Lipschitz smoothness.

Definition 2

[37, Definition 1.1] Let \(\kappa (\cdot )\) be any given differentiable convex function defined on a convex set Q. The convex function \( g(\cdot )\) is L-smooth relative to \(\kappa (\cdot )\) on Q if for any \(x, y \in \mathrm{int} Q\), there is a scalar L for which

$$\begin{aligned} g(y) \le g(x) + \langle \nabla g(x), y - x\rangle + L \mathcal B_\kappa (y, x), \end{aligned}$$
(5)

where

$$\begin{aligned} \mathcal B_\kappa (y,x):=\kappa (y)-\kappa (x)-\langle \nabla \kappa (x), y-x \rangle , \text { for all } \, x,y\in Q. \end{aligned}$$
(6)

The following proposition shows that, when restricted to a column of H, the KL objective function is a relative smooth function. Since \(D(V|WH)=D(V^\top |H^\top W^\top )\), this result implies that the KL objective function when restricted to a row of W is also relative smooth.

Proposition 4

[3, Lemma 7] Let \(v \in \mathbb {R}^m_+\) and \(W \in \mathbb {R}^{m \times r}_+\), and

$$\begin{aligned} D(v|Wh):=\sum _{i=1}^m \Big ((Wh)_i - v_i \log (Wh)_i + v_i \log v_i - v_i\Big ). \end{aligned}$$
(7)

Then the function \(h\mapsto D(v|Wh)\) is relative smooth to \(\kappa (h)=-\sum _{j=1}^r\log h_{j} \) with the relative smooth constant \(L=\Vert v\Vert _{1}\).

2.4 Self-concordant Properties

In Sect. 4.2, we will propose a new monotone algorithm to solve (2), namely a scalar Newton-type algorithm. An instrumental property of (2) to prove the monotonicity is the self-concordant properties of its objective function. Let us first define a self-concordant function. We adopt the definition in [40, Chapter 5].

Consider a closed convex function \(g: \mathbb E \rightarrow \mathbb R\) with an open domain. Fixing a point \(x\in \mathrm{dom} (g)\) and a direction \(d\in \mathbb E\), let \(\varphi _x(\tau )=g(x+\tau d)\). We define \( \mathbf{D} g(x)[d] = \varphi _x'(0)\), \(\mathbf{D}^2 g(x)[d,d]=\varphi _x''(0)\), and \( \mathbf{D}^3 g(x)[d,d,d]=\varphi _x'''(0). \)

Definition 3

We say the function g belongs to the class \(\mathcal {F}_{M}\) of self-concordant functions with parameter \(M\ge 0\), if

$$\begin{aligned} \big |\mathbf{D}^3 g(x)[d,d,d]\big |\le 2 M \Vert d\Vert ^3_{\nabla ^2 g(x)}, \end{aligned}$$

where \(\Vert d\Vert ^2_{\nabla ^2 g(x)}=\langle \nabla ^2 g(x)d,d\rangle =\varphi _x''(0)\). The function g is called standard self-concordant when \(M=1\).

The scalar function \(g(x)=-\log (x)\) is a standard self-concordant function; see [40, Example 5.1.1]. The following proposition provides some useful properties to determine the self-concordant constant of a function.

Proposition 5

[40, Theorems 5.1.1 and 5.1.2]  

  1. (i)

    Let \(g_1, g_2\) be self-concordant functions with constants \(M_1, M_2\). Then the function \(g(x)=\alpha g_1(x) + \beta g_2(x)\), where \(\alpha , \beta \) are positive constants, is self-concordant with constant \(M_g = \max \big \{\frac{1}{\sqrt{\alpha }} M_1, \frac{1}{\sqrt{\beta }}M_2 \big \}\).

  2. (ii)

    If \(g(\cdot )\) is self-concordant with constant \(M_g\), then \(\phi (x)=g(\mathcal A(x))\), where \(\mathcal A(x)=Ax+b\) is a linear operator, is also self-concordant with constant \(M_\phi =M_g\).

Using Proposition 5, we can prove that the objective of the perturbed KL NMF problem (2) is self-concordant with respect to a single entry of W and H.

Proposition 6

Given \(V \in \mathbb {R}^{m \times n}_+\), \(W \in \mathbb {R}^{m \times r}_+\) and \(H \in \mathbb {R}^{r \times n}_+\), the scalar function

$$\begin{aligned} H_{kj} \mapsto D(V|WH) = \sum _{i=1}^m \Big (\sum _{a=1}^n W_{ia}H_{aj}-V_{ij}\log \sum _{a=1}^n W_{ia}H_{aj}\Big ) + \mathrm{constant} \end{aligned}$$

is a self-concordant function with constant \( \mathbf{c}_{H_{kj}}=\max _{\{i | V_{ij}>0\}}\Big \{\frac{1}{\sqrt{V_{ij}}}\Big \}.\)

3 Existing Algorithms

In this section, we briefly review the most efficient algorithms for KL NMF.

3.1 Multiplicative Updates

Let us consider the linear regression problem: Given \(v \in \mathbb {R}^m_+\) and \(W \in \mathbb {R}^{m \times r}_+\),

$$\begin{aligned} \mathrm{minimize}_{h \in \mathbb {R}^{r}_+} \; D_{\text {KL}}(v,Wh) . \end{aligned}$$

For W fixed in KL NMF, this is the subproblem to be solved for each column of H. The multiplicative updates (MU) for solving this problem are given by

$$\begin{aligned} h \; \leftarrow \; h \circ \left( \frac{W^\top \frac{[v]}{[Wh]}}{W^\top ee^\top } \right) , \end{aligned}$$

where \(\circ \) and \(\frac{[\cdot ]}{[\cdot ]}\) are the component-wise product and division between two matrices, respectively. They were derived by Richardson [42] and Lucy [38], who used them for image restoration, and rectification and deconvolution in statistical astronomy, respectively. This algorithm was later referred to as the Richardson–Lucy algorithm, and is guaranteed to decrease the KL NMF objective function.

In the context of NMF, Lee and Seung [31] derived these updates again, in a matrix form, to update W and H alternatively in KL NMF. The MU can be easily generalized to the perturbed KL NMF problem (2) [45], and are given by

$$\begin{aligned} H \leftarrow \max \left( \varepsilon , H \circ \frac{ \big [ W^\top \frac{[V]}{[WH]} \big ] }{ \big [ W^\top ee^\top \big ] } \right) , W \leftarrow \max \left( \varepsilon , W \circ \frac{ \big [ \frac{[V]}{[WH]} H \big ] }{ \big [ ee^\top H \big ] } \right) . \end{aligned}$$
(8)

Note that, for \(\varepsilon = 0\), the \(\max (\varepsilon ,\cdot )\) can be removed since the entries of V, W and H are nonnegative, which corresponds to the MU used by Lee and Seung.

The MU can be derived using the majorization–minimization (MM) framework, which is a two-step approach:

  1. 1.

    majorization: find a majorizer, that is, a function that is equal to the objective function at the current iterate while being larger everywhere else on the feasible domain, and

  2. 2.

    minimization: minimize the majorizer to obtain the next iterate.

We refer the interested reader to [13] for all the details. Moreover, the MU can be shown to belong to a specific MM framework, namely, the block successive upper-bound minimization (BSUM) framework [26, 41]. For completeness, we describe the BSUM framework and its convergence guarantees in Appendix 1. This allows to provide convergence guarantees for the MU. Let us anaylze two cases separately.

3.1.1 Case \(\varepsilon =0\)

In this case, the MU are not well-defined if \((WH)_{ij}=0\) for some (ij). Furthermore, the MU would encounter a zero locking phenomenon, that is, the MU cannot modify an entry of W or H when it is equal to 0. This phenomenon can be fixed by choosing an initial pair with strictly positive entries. Moreover, the objective function is not directionally differentiable when \((WH)_{ij}=0\) and \(V_{ij}>0\) for some ij. Hence, the convergence to stationary points obtained in [41, Theorem 2] for BSUM does not apply. In fact, [21] provided a numerical evidence that the generated sequence may not converge to a KKT point, and, [8, Section 6.2] gave an example that MU may converge to a non KKT point. Although the convergence of the generated sequence by MU is not guaranteed, it is worth noting that MU in this case possesses an interesting scale-invariant property.

Proposition 7

Let \(\varepsilon =0\) and denote \(H^+\) (resp. \(W^+\)) the update of H (resp. W) after applying one MU (8) of H (resp. of W) on (WH). Suppose the MU for H (resp. W) is well-defined, that is, \((WH)_{ij}>0\) for all ij and W has no zero column (resp. H has no zero row). Then the MU of H preserve the column sum of V, that is, \(e^\top V = e^\top (WH^+) \), while the MU of W preserve the row sum of V, that is, \( Ve = (W^+ H) e\).

Proof

We prove the result for H, the result for W can be obtained by symmetry. We have for \(j \in [n]\) that

$$\begin{aligned} \sum _{i=1}^m(WH^+)_{ij}&=\sum _{i=1}^m\sum _{k=1}^r W_{ik}H^+_{kj}=\sum _{i=1}^m\sum _{k=1}^r W_{ik}H_{kj}\frac{\sum _{l=1}^m W_{lk}V_{lj}/(WH)_{lj}}{\sum _{l=1}^m W_{lk}}\\&=\sum _{k=1}^r \sum _{i=1}^m W_{ik}H_{kj}\frac{\sum _{l=1}^m W_{lk}V_{lj}/(WH)_{lj}}{\sum _{l=1}^m W_{lk}}=\sum _{k=1}^r H_{kj}\sum _{l=1}^m \frac{W_{lk}V_{lj}}{(WH)_{lj}} \\&=\sum _{l=1}^m \sum _{k=1}^r V_{lj} \frac{W_{lk}H_{kj}}{(WH)_{lj}}=\sum _{l=1}^m V_{lj}. \end{aligned}$$

\(\square \)

Proposition 7 shows that any iterate of the MU, except the first one, is scaled (Definition 1). Hence, sometimes in the literature, the MU are used to scale the pair (WH) within KL NMF algorithms.

3.1.2 Case \(\varepsilon >0\)

For this case, [41, Theorem 2] implies that every limit point of the generated sequence of MU for solving Problem (2) with \(\varepsilon >0\) is a stationary point of Problem (2) (which is also a KKT point, see Sect. 2.2). It is worth noting that the sub-sequential convergence of MU in this case can also be proved by using Zangwill’s convergence theory as in [45].

3.2 ADMM

The alternating direction method of multiplier (ADMM) is a standard technique to tackle low-rank matrix approximation problems [29], and it was used to solve KL NMF (1) in [44]. The first step is to reformulate (1) as

$$\begin{aligned} \begin{aligned} \min _{W,H,W_+,H_+}&\quad D(V | \hat{V}) \\ \text {such that }&\hat{V}=WH, W=W_+, H=H_+, W_+ \ge 0, H_+ \ge 0, \end{aligned} \end{aligned}$$
(9)

where \(\hat{V}\), \(W_+\) and \(H_+\) are auxiliary variables. ADMM alternately minimizes the augmented Lagrangian of (9) over the variables \((\hat{V},W,H,W_+,H_+)\), and updates the dual variables at each iteration. We refer the reader to [44] for more details. It is important noting that the objective is not monotonically decreasing under the updates of ADMM. Also, convergence to stationary points is not guaranteed. In fact, we will see in the numerical experiments that ADMM does not converge in some cases.

3.3 Primal-Dual Approach

In [50], Yanez and Bach proposed a first-order primal-dual (PD) algorithm for KL NMF (1). PD employs the primal-dual method proposed in [7] to tackle the convex subproblems in the columns of H and rows of W; see (7). To improve the performance of the primal-dual method, PD uses an automatic heuristic selection for the step sizes; see [50] for the details. PD is a heuristic algorithm, and it does not guarantee the monotonicity of the objective.

3.4 A Cyclic Coordinate Descent Method

In [28], Hsieh and Dhillon proposes to solve KL NMF (1) using a cyclic coordinate descent (CCD) method. CCD alternately updates the scalars \(W_{ik}\) or \(H_{kj}\). The subproblems in one variable are solved by a Newton method, in which a full Newton step is used, without line search. Again, there is no convergence guarantee for this algorithm, although the Newton method used to approximately solve the subproblems in one variable may have some convergence guarantee, if properly tuned. In fact, we realized that there is a gap in the convergence proof of the Newton method without line search to the global minimum of the scalar subproblems [28, Theorem 1]: Equation (29) is not correctly computed which makes the proof incorrect. This observation motivated us to introduce a new scalar Newton-type algorithm in Sect. 4.2. Note that a similar algorithm was independently developed in [36], along with an R package. Another R package is available that implements both CCD and MU [6].

3.5 Two Other Algorithmic Approaches

In [34], Li, Lebanon and Park proposed a Taylor series expansion to express Bregman divergences in term of Euclidean distances, leading to a heuristic scalar coordinate descent algorithm for solving NMF problems with Bregman divergences. The algorithm using this approach for solving KL NMF underperforms the CCD method from [28], and hence we will not compare it in the numerical experiment section.

More recently, Kim, Kim and Klabjan [30] proposed a scale invariant power iteration (SCI-PI) algorithm to solve a class of scale invariant problems, and apply it to KL NMF (1). To establish convergence results for SCI-PI (see [30, Theorem 9, 11]), the objective function of the scale invariant problem needs to be twice continuously differentiable on an open set containing the set \(\{x: \Vert x\Vert =1\}\). However, the objective function of the corresponding sub-problem (see [30, Lemma 14]) when applying SCI-PI to KL NMF violates this condition. Therefore, the theory of SCI-PI in [30] does not apply to KL NMF.

4 New Members of the Algorithm Collection

In this section, we present two new algorithms for KL NMF, a block mirror descent method in Sect. 4.1, and a new scalar Newton-type algorithm in Sect. 4.2.

4.1 Block Mirror Descent Method

Let us first give details on the analysis of the block mirror descent (BMD) method (Sect. 4.1.1), and then apply the result to solve the perturbed KL NMF problem (2) (Sect. 4.1.2).

4.1.1 BMD Method

The standard gradient descent (GD) scheme for solving the smooth optimization problem \(\min _{x\in Q} g(x)\), with g being L-smooth, uses the following classical update

$$\begin{aligned} x^{k+1}=\arg \min _{x\in Q} \left\{ g(x^k)+\langle \nabla g(x^k),x\text {-}x^k\rangle + \frac{L}{2}\Vert x\text {-}x^k\Vert _2^2 \right\} . \end{aligned}$$
(10)

When \(g(\cdot )\) is not L-smooth but L-relative smooth to \(\kappa (\cdot )\) (see Definition 2), the GD scheme can be generalized by replacing the Euclidean norm in (10) by the Bregman divergence \(\mathcal B\), leading to the following mirror descent step

$$\begin{aligned} x^{k+1}=\arg \min _{x\in Q} \{g(x^k)+\langle \nabla g(x^k),x-x^k\rangle + L\mathcal B_\kappa (x,x^k)\}. \end{aligned}$$
(11)

Let us now present BMD, that uses the above generalized GD scheme, but updating the variable block by block. For that, let us consider a problem of the form

$$\begin{aligned} \min _{ x \in \mathcal X} f(x), \end{aligned}$$
(12)

where f is continuously differentiable on \(\mathcal X\), \(x := (x_1,x_2,\ldots ,x_s)\) and \(\mathcal X = \mathcal X_1 \times \mathcal X_2 \times \cdots \times \mathcal X_s\) with \(\mathcal X_i\) (\(i=1,2,\ldots ,s\)) being closed convex sets. Let us also make the following assumption.

Assumption 1

  1. (i)

    For all \(\bar{x} \in \mathcal X\), the function \( x_i \mapsto f(\bar{ x}_1,\ldots ,\bar{ x}_{i-1},x_i,\bar{ x}_{i+1},\ldots ,\bar{ x}_s) \) is convex and relative smooth to \(\kappa _i(x_i)\) with constant \(L_i(\bar{ x}_1,\ldots ,\bar{ x}_{i-1},\bar{ x}_{i+1},\ldots ,\bar{ x}_s)\).

  2. (ii)

    There exist positive constants \(\underline{L}_i\) and \(\overline{L}_i\) such that

    $$\begin{aligned} \underline{L}_i \le L_i(\bar{ x}_1,\ldots ,\bar{ x}_{i-1},\bar{ x}_{i+1},\ldots ,\bar{ x}_s) \le \overline{L}_i. \end{aligned}$$

BMD for solving Problem (12) is described in Algorithm 1.

figure a

BMD described in Algorithm 1 is structurally identical to the block Bregman proximal gradient (BBPG) method presented in [46] with cyclic update. However, BMD has an improvement over BBPG: it uses the step size \(1/L_i^{k}\) which is larger than \(\rho /L_i^{k}\) used in BBPG, as \(\rho \in (0,1)\). Our convergence analysis of Algorithm 1 (Theorem 1 below, see its proof in Appendix 2) is an extension of the primal gradient scheme (which is BMD for \(s=1\)) analysed in [37]. It is worth noting that our result can be extended to composite optimization with essentially cyclic regime by using the technique in [37, Section A.2] and [46, Section 2]. In this paper, we only present the result of BMD with cyclic regime for (12) to simplify the presentation.

Theorem 1

Suppose Assumption 1 is satisfied. Let \(\{x^k\}\) be the sequence generated by Algorithm 1. Let also \(\varPhi (x)= f (x)+ \sum _i I_{\mathcal X_i}(x_i)\), where \(I_{\mathcal X_i}\) is the indicator function of \(\mathcal X_i\). We have

  1. (i)

    \(\varPhi \big (x^k\big )\) is non-increasing;

  2. (ii)

    Suppose \(\mathcal X_i \subset \mathrm{int\, dom} \, \kappa _i\). If \(\{x^k\}\) is bounded and \(\kappa _i(x_i)\) is strongly convex on bounded subsets of \(\mathcal X_i\) that contain \(\{x_i^k\}\), then every limit point of \(\{x^k\}\) is a critical point of \(\varPhi \);

  3. (iii)

    If together with the conditions in (ii) we assume that \(\nabla \kappa _i\) and \(\nabla _i f\) are Lipschitz continuous on bounded subsets of \(\mathcal X_i\) that contain \(\{x_i^k\}\), then the whole sequence \(\{x^k\}\) converges to a critical point of \(\varPhi \).

Let us now apply this new algorithm and convergence result to KL NMF.

4.1.2 BMD for KL NMF

Problem (2) has the form of Problem (12). Proposition 4 allows us to apply BMD to solve (2), where the blocks of variables are the columns of H and the rows of W. Using the notation of Proposition 4, we have \( \nabla _h D(v|Wh)=\sum _{i=1}^m \big (1-\frac{v_i}{(Wh)_i}\big ) W_{i,:}, \) where \(W_{i,:}\) is the i-th row of W, and \(\mathcal B_\kappa (h,h^k)=\sum _{j=1}^r \frac{h_j}{(h^k)_j}-\log \frac{h_j}{(h^k)_j} -1. \) The following proposition provides the closed-form solution for the mirror descent step, see its proof in [3, Section 5.2].

Proposition 8

Using the notation of Proposition 4, the problem

$$\begin{aligned} \min _{h\ge \varepsilon } \Big \{ D(v|Wh^k) + \big \langle \nabla _h D(v|Wh)[h^k],h-h^k \big \rangle + L\mathcal {B}_\kappa (h,h^k)\Big \} \end{aligned}$$
(14)

has the following unique closed-form solution \(h^{k+1}\): for \(l \in [r]\),

$$\begin{aligned} \begin{aligned} \displaystyle (h^{k+1})_l&=\max \left\{ \frac{(h^k)_l}{1+\frac{1}{L} (h^k)_l\big (\sum \limits _{i=1}^mW_{il}-\sum \limits _{i=1}^m \frac{v_i W_{il}}{(Wh^k)_i}\big )},\varepsilon \right\} . \end{aligned} \end{aligned}$$
(15)

Recall that, \(h\mapsto D(v|Wh)\) is \(\Vert v\Vert _{1}\)-relative smooth to \(\kappa (h)=-\sum _{j=1}^r\log h_{j} \). Together with Proposition 8 which provides a closed-form solution of the update (13), we can therefore easily apply BMD to KL NMF. Assumption 1 is satisfied, and Theorem 1 (i) implies that \(\varPhi \big (x^{k}\big )\) is non-increasing. Moreover, using a similar method as in the proof Proposition 1 and noting that \(\varPhi \big (x^{k}\big )\le \varPhi \big (x^{0}\big ) \), we can prove that BMD for KL NMF (2) with \(\varepsilon >0\) generates a bounded sequence. Together with the assumption \(x\ge \varepsilon \), we see that all conditions of Theorem 1 (iii) are satisfied. Hence, we obtain the following convergence guarantee for BMD applied to (2).

Theorem 2

Algorithm 1 applied on the KL NMF problem (2) monotonically decreases the objective function for any \(\varepsilon \ge 0\). Moreover, for \(\varepsilon >0\), the generated sequence by BMD is bounded and globally converges to a stationary point of (2).

It is important noting that \(\nabla \kappa (h)\) is not Lipschitz continuous on bounded sets containing 0. We hence cannot apply Theorem 1 (iii) to the KL NMF problem (1).

4.2 A Scalar Newton-Type Algorithm

In this section, we propose a new scalar Newton-type (SN) algorithm that makes use of the self-concordant property of the objective function (see Sect. 2.4) to guarantee the non-increasingness for the objective sequence, that consequently guarantees the convergence of the objective sequence since it is also bounded from below. The motivation to propose this new method comes from our observation that CCD does not come with any convergence guarantee, as explained in Sect. 3.4, although it performs well in many cases.

Let us first have a brief review on Newton methods. Unconstrained minimization problem of a self-concordant function can be efficiently solved by Newton methods, see [40, Section 5.2]. Tran-Dinh et al. [47] brought the spirit of Newton methods for unconstrained optimization to composite optimization problems of the form

$$\begin{aligned} \min _{x\in \mathbb R^n } \varPsi (x):=\psi (x) + \phi (x), \end{aligned}$$
(16)

where \(\psi (x)\) is a standard self-concordant function and \(\phi (x)\) is a proper, closed, convex but possibly non-smooth function. In particular, the authors propose a proximal Newton method (PNM) with the following update

$$\begin{aligned} x^{k+1}=x^k + \alpha _k d^{k}, \end{aligned}$$
(17)

where \(\alpha _k \in (0,1]\) is a step size, \(d^k=s^k-x^k\) and

$$\begin{aligned} s^k=\arg \min _x\big \{ \psi (x^k) + \langle \nabla \psi (x^k), x-x^k \rangle + \frac{1}{2} (x-x^k)^\top \nabla ^2 \psi (x^k) (x-x^k) + \phi (x)\big \}. \end{aligned}$$

Denoting \(\lambda _k=\Vert d^k\Vert _{x^k}=\big ((d^k)^\top \nabla ^2 \psi (x^k) d^k \big )^{1/2}\), it follows from [47, Theorem 6] that when the stepsize \(\alpha _k=(1+\lambda _k)^{-1}\) is used, then the PNM generates the sequence \(x^k\) satisfying \(\varPsi (x^{k+1}) \le \varPsi (x^k)-\omega (\lambda _k) \), where \(\omega (t)=t-\ln (1+t)\).

Recall that the KL objective function with respect to a scalar component of W or H is a self-concordant function, see Proposition 6. We can hence make use of the update in (17) to propose the SN method; see Algorithm 2.

figure b

The following proposition proves that SN monotonically decreases the objective function. The proof is provided in Appendix 3.

Proposition 9

The objective function for the perturbed KL-NMF problem (1) is non-increasing under the updates of Algorithm 2.

4.3 A Hybrid SN-MU Algorithm

We recall that MU possesses an important property, namely that (WH) scaled after any MU update; see Proposition 7. On the other hand, we note that KKT points of Problem (1) are also scaled. However, SN does not possess this scale-invariant property. Hence we propose to combine SN with MU to result in a hybrid SN-MU algorithm. Specifically, we alternately run several SN steps before scaling the sequence by running one or several updates of MU. In Sect. 5, we will use 10 steps of SN followed by one step of MU for all numerical experiments. As we will see, this combination sometimes significantly improves the performance of SN.

5 Experiments

In this section, we report comparisons of the KL NMF algorithms listed in Table 1.

Table 1 Properties of KL NMF algorithms presented in this paper, applied on an m-by-n matrix V to be approximated with a rank-r approximation WH. The parameter d is the number of inner iterations

The second column of Table 1 provides the complexity of one iteration to update all entries of (WH). The parameter d in the second column of PD, CCD, SN and SN-MU is the number of inner iterations of one main iteration of updating (WH). The third column indicates whether the corresponding algorithm has some convergence guarantee for its generated sequence. We note that, considering Problem (2) with \(\varepsilon >0\), MU guarantees a subsequential convergence while BMD guarantees a global convergence. The fourth column indicates if the sequence of the objective function values is non-decreasing.

5.1 Implementation

We have implemented MU and BMD in Matlab, SN in C++ and use its mex file to run it from Matlab, as for CCD provided by the authorsFootnote 1 (for which we have fixed an issue on maintaining \((WH)_{ij}\), otherwise it sometimes run into numerical issues generating NaN objective function values because \((WH)_{ij}\) could take negative values). We use the Matlab code provided by the authors for ADMMFootnote 2 and PDFootnote 3. We used the best possible programming language for each algorithm. For example, if CCD was implemented on Matlab, it would be extremely slow as it loops over each variable (and Matlab is very ineffective to handle loops). On the other side, the MU run faster on Matlab because the main computational cost resides in matrix–matrix multiplications for which Matlab is more effective than C++. All tests are preformed using Matlab R2018a on a laptop Intel CORE i7-8550U CPU @1.8GHz 16GB RAM. The code is available at https://github.com/LeThiKhanhHien/KLNMF. We choose the penalty parameter of ADMM to be equal to 1 in all of the experiments. In each run for a data set, we use the same random initialization and the same maximal running time for all algorithms.

We use the Matlab commands \(W=rand(m,r)\) and \(H=rand(r,n)\) to generate a random initial point; and to avoid initial points with a large value D(VWH), we then scale W and H by \(W=\sqrt{\alpha } W\), \( H=\sqrt{\alpha } H\), where \(\alpha =\frac{\sum _{i,j} V_{ij}}{\sum _{i,j}(WH)_{ij}}\); see Definition 1 and Proposition 3. We define the relative error \(\mathrm{rel} \, D(V,WH)\) to be the objective D(VWH) divided by \(\sum _{i,j} V_{ij}\log \frac{V_{ij}}{(\sum _j V_{ij})/n}\), and denote E(t) the value of \(\mathrm{rel} \, D(V,WH) - e_{\min }\) at time t, where \(e_{\min }\) is the smallest value among all \(\mathrm{rel} \, D(V,WH)\) produced by all algorithms and all initializations within the allotted time. Hence E(t) goes to zero for the best run among all algorithms and initializations.

5.2 Experiments with Synthetic Data Sets

For each type of synthetic data sets, we generate 10 random matrices V, then for each random V, we generate 10 random initial points.

5.2.1 Low-Rank Synthetic Data Sets

We will consider several types of low-rank synthetic data sets depending on the parameter \(\ell \in (0,1]\) which is the density of the underlying factors, denoted \(W^*\) and \(H^*\). We will use \(\ell = 1,0.9,0.3\). More precisely, to generate a low-rank synthetic data set V, we use the Matlab commands \(W^* = sprand(m,r,\ell )\) and \(H^* = sprand(r,n,\ell )\), where \(\ell \) is the density of non-zero elements (that is, \(1-\ell \) is the percentage of zero elements), and let \(V = W^*H^*\). We will also either keep \(V = W^*H^*\) as is, which is the noiseless case, or generate each entry of V following a Poisson distribution of parameters \(W^*H^*\) as described in Sect. 1.1, which is a noisy case and is achieved with the Matlab command \(V = poissrnd(W^*H^*)\).

The results of applying the different algorithms on such matrices are reported in Fig. 1 for 200-by-200 matrices with \(r=10\), and in Fig. 2 for 500-by-500 matrices with \(r=20\). We report the evolution of the median value of E(t). Although this is not an ideal choice, comparing the performance in term of iterations would be worse since the cost of one iteration can be rather different for each algorithm; for example, CCD has inner iterations, which is not the case of MU.

We also report the average and standard deviation (std) of the relative errors over 200 runs for the 6 types of synthetic data sets (100 runs for each size \(200\times 200\) or \(500\times 500\)) in Table 2, and provide a ranking over the total 1200 runs between the different algorithms in Table 3: the ith entry of the ranking vector indicates how many times the corresponding algorithm obtained the ith best solution (that is, with the ith lowest objective function value). This table allows to see which algorithms performs on average the best on these data sets.

Fig. 1
figure 1

Median value of the error measure E(t) on \(200\times 200\) low-rank matrices for various KL NMF algorithms (see Table 1). The left column corresponds to noiseless input matrices, the right column to noisy matrices using the Poisson distribution. The top row corresponds to dense factors \((\ell = 1)\), the middle row to slightly sparse factors \((\ell = 0.9)\), and the bottom row to sparse factors \((\ell = 0.3)\)

Fig. 2
figure 2

Median value of the error measure E(t) on \(500\times 500\) low-rank matrices for various KL NMF algorithms (see Table 1). The left column corresponds to noiseless input matrices, and the right column to noisy matrices using the Poisson distribution. The top row corresponds to dense factors \((\ell = 1)\), the middle row to slightly sparse factors \((\ell = 0.9)\), and the bottom row to sparse factors \((\ell = 0.3)\)

Table 2 Average error, standard deviation over 200 different runs for each type of low-rank synthetic data sets (100 runs for each type with the size \(200\times 200\) or \(500\times 500\)). The middle column correspond to noiseless input matrices, the right column to noisy matrices using the Poisson distribution. The first subtable corresponds to dense factors \((\ell = 1)\), the second subtable slightly sparse factors \((\ell = 0.9)\), and the third subtable to sparse factors \((\ell = 0.3)\)
Table 3 Ranking among 1200 different runs for low-rank synthetic data sets

For these low-rank synthetic data sets, let us discuss the behaviour of the various algorithms:

  • ADMM is not stable, it diverges in many cases.Although the results are for ADMM with the penalty parameter \(\varrho =1\), we also tried other values for \(\varrho \) but the algorithm still did not converge for the other values we have tried. ADMM has the largest number of worst solutions (503 out of 1200). Note however that it also has a high number of best solutions (301 out of 1200), because it performs well in the simple scenario when the factors W and H are dense in the absence of noise. In summary, ADMM is unstable but, when it converges, it provides good solutions.

  • CCD performs very well, among the best in most cases. When looking at Table 2, we observe that CCD has average results, having most of its solutions ranked third to fifth (out of 7). However, it has the second lowest average error right after SN-MU.

  • SN monotonically decreases the objective function, as proved in Proposition 9. However, in some cases, it may converge rather slowly. Its ranking are well distributed, hence it performs close to the average.

  • SN-MU improves SN and performs well, in all cases among the best algorithms. In fact, it obtained the lowest objective function values among all algorithm, 343 out of the 1200 experiments. Also, it generates only 8 out of 1200 solutions as the worst solutions. Also, it has the lowest relative error on average.

  • MU performs well, on average better than the other algorithms (it only provides 7 worst solutions, out of 1200 tests). It has a low relative error on average, ranked third, right behind SN-MU and CDD.

  • BMD converges very slowly (it only provides a solution among the third best ones in 3 cases out of 1200). Although it is the only algorithm with global convergence guarantee (Proposition 2), this comes at the expense of slow convergence.

  • PD performs well, although it provides in many cases (293 out of 1200) the second worst solutions.

In summary, SN-MU performs on average the best, followed by the MU, CCD, PD, and SN. ADMM does not always converge but can produce good solutions. BMD has strong convergence guarantees but converges very slowly. However, there is no clear winner, and, depending on the types of data sets, some algorithms might perform better than others.

5.2.2 Full-Rank Synthetic Data Sets

We generate a full-rank synthetic data set V by the Matlab command \(V=rand(m,n)\). Results for full-rank synthetic \(200 \times 200\) (with \(r=10\)) and \(500 \times 500\) (with \(r=20\)) data sets are reported in Fig. 3. We also report the average, the standard deviation of the relative errors and the ranking vector over 200 runs (100 runs for each size) in Table 4.

Fig. 3
figure 3

Median value of E(t) on Full-rank synthetic data sets: left—\(200 \times 200\), right—\(500 \times 500\)

Table 4 Average error, standard deviation and ranking among 200 different runs for full-rank synthetic data sets

We observe that the behavior can be quite different than in the low-rank cases. In particular,

  • CCD now clearly performs best in term of convergence speed and average relative error.

  • The MU, CCD and SN-MU performs well, while BMD and SN perform relatively poorly (they never produce the best solution).

  • ADMM never converges, and produces the worst solution in all cases.

5.3 Experiments with Real Data Sets

We report in this section experiments on various widely used real data sets that are summarized in Table 5. We use \(r=10\) for all real data sets.

Table 5 Real data sets

5.3.1 Audio Data Sets

For each audio data set, we generate 30 random initial points. We report the evolution of the median of E(t) in Fig. 4, and report the average, the standard deviation of the relative errors and the ranking vector over 120 runs (30 runs for each audio data set) in Table 6.

Fig. 4
figure 4

Median value of E(t) on audio data sets: top left—mary, top right—prelude JSB, bottom left—ShanHur sunrise, bottom right—voice cell

Table 6 Average error, standard deviation and ranking among 120 different runs for audio data sets

As for full-rank synthethic data sets, ADMM diverges while SN and BMD converges slowly. However, for these data sets, MU outperforms the other algorithms, followed by CCD, SN-MU, and PD.

5.3.2 Image Data Sets

As for audio data sets, we generate 30 random initial points. We report the result in Fig. 5 and Table 7.

Fig. 5
figure 5

Median value of E(t) on image data sets: left—cbclim, right—ORLfaces

Table 7 Average error, standard deviation and ranking among 60 different runs for the 2 image data sets

As for full-rank and audio data sets, ADMM diverges, and SN and BMD converge slowly. CCD outperforms the other algorithms followed by MU, SN-MU and PD (in that order).

5.3.3 Document Data Sets

For each document data set, we generate 10 random initial points and record the final relative errors (the reason of using only 10 initializations is that these data sets are rather large, and the computational time is high–we used 500 seconds for each run as shown on Table 5). We report the average, the standard deviation of the final relative errors and the ranking vector over 60 runs (10 runs for each document data set) in Table 8.

Table 8 Average error, standard deviation and ranking among 60 different runs on the document data sets

We observe that, CCD performs best, followed by SN-MU, in terms of the average relative error.

Performance profiles. Fig. 6 reports the performance profiles for the experiments for synthetic (left) and real (right) data sets. It displays the performance of each algorithm as a function of \(\rho \ge 0\). For a given value of \(\rho \), we define the performance of an algorithm as follows

$$\begin{aligned} \mathrm{performance} (\rho ) = \frac{\sharp \big \{ \text {solution } (W,H) \mid \mathrm{rel} \, D(V,WH) - \mathrm{rel} \, D(V,W^*H^*) \le \rho \big \}}{\sharp \mathrm{runs}}, \end{aligned}$$
(18)

where a solution (WH) is the final pair obtained within the total allotted time by the algorithm, and \((W^*,H^*)\) is the best solution obtained among all algorithms using the same initialization. Hence, for example, performance(0) aggregates the values of the rankings at the first position provided in Tables 3 and 4 for the synthetic data sets, and in Tables 6, 7 and 8 for the real data sets.

Performance profiles allow us to compare the algorithms meaningfully over different instances [12], that is, different matrices and initializations in our case. Looking at the curves for \(\rho \) equal zero, we observe the percentage of the time each algorithm was able to obtain the best solution. The right of the curve, as \(\rho \) increases, reports the robustness of an algorithm, that is, the percentage of times it was able to obtain a solution close to the best solutions found among all algorithms. In all cases, the higher the curve the better.

Fig. 6
figure 6

Performance profile for all experiments; see (18). On the left: synthetic data set, on the right: real data sets

Figure 6 confirms our observations: CCD, MU and SN-MU perform the best. On synthetic data sets, SN-MU provides the best solutions in most cases (left part of the left figure), while on real data sets, MU and CCD perform better (left part of the right figure). In terms of robustness, the three algorithms are comparable: their curves get closer together as \(\rho \) increases. Note that, as we have observed as well, PD performs on par with CCD, MU and SN-MU on synthetic data sets, while it is less effective for real data sets as it does not reach a performance of 1 even for \(\rho = 1\).

5.4 Conclusions of the Numerical Experiments

Surprisingly, for KL NMF, the behaviour of the algorithms can be highly dependent on the input data. For example, CCD performs best for images and documents, while MU performs best for audio data sets. To the best of our knowledge, this has not been reported in the literature. As far as we know, most papers focus only on a few numerical examples, introducing an undesirable bias towards certain algorithms. It is interesting to note that for Fro NMF, such different behaviors depending on the input data has not been reported in the literature, despite numerous studies. The reason is most likely that KL NMF is a more difficult optimization problem, for which the subproblem in W and H, although convex, does not admit an L-Lipschitz gradient.

The main take-home message of our experiments is that CCD, SN-MU and MU appear to be the most reliable algorithms for KL NMF, performing among the best in most scenarios.

6 Conclusion

In this paper, we have presented important properties of KL NMF that are useful to analyze algorithms (Sect. 2). Then, we have reviewed existing algorithms, and proposed three new algorithms: a block mirror descent (BMD) method with global convergence guarantees, a scalar Newton-type (SN) algorithm which monotonically decreases the objective function, and an hybridization between SN and MU. Finally, in Sect. 5, we performed extensive numerical experiments on synthetic and real data sets. Although no KL NMF algorithms clearly outperforms the others, it appears that the CCD, MU and SN-MU provide the best results on average.