Keywords

2.1 Introduction

There are many kinds of real-world data given by non-negative values, such as power spectra, pixel values and count data. In a way similar to multivariate analysis techniques such as Principal Component Analysis (PCA) and Independent Component Analysis (ICA), decomposing non-negative data into the sum of the underlying components can be useful in many situations: For example, if we can extract the power spectra of the underlying sources in a mixture signal, they may be useful for noise reduction and source separation. If we can decompose face images into components corresponding to facial features such as the eyes, nose and mouth, they may be useful for face recognition, identification and synthesis. If we can decompose the word histograms of text documents into components associated with latent topics such as politics, sport and economy, they may be useful for document indexing and retrieval. Similarly, if we can extract patterns reflecting users’ preferences from purchase logs, they may be useful for making recommendations. A multivariate analysis technique enabling the decomposition of non-negative data into non-negative components is called Non-negative Matrix Factorization (NMF) [1]. In this chapter, I will mention some basic properties of NMF, how to derive an iterative algorithm for NMF, and some attempts that have been made to apply NMF and its variants to audio processing problems.

2.2 What Is NMF?

In the following, we will represent data by vectors. For image data, each pixel value will correspond to a single element of the data vector. For power spectrum data, the power at each frequency point will correspond to a single element of the data vector. Let us assume that we are given a set of N non-negative data vectors \({\varvec{y}}_1,\ldots ,{\varvec{y}}_N\in \mathbb {R}^{\ge 0,K}\). We refer to each of them as an observed vector. Here, \(\mathbb {R}^{\ge 0,K}\) is used to represent an entire set of K-dimensional non-negative vectors. The aim of NMF is to decompose each of \({\varvec{y}}_1,\ldots ,{\varvec{y}}_N\) into the sum of M non-negative components: The problem is to find the linear combinations of M basis vectors \({\varvec{h}}_1,\ldots ,{\varvec{h}}_M\in \mathbb {R}^{\ge 0,K}\) that best approximate \({\varvec{y}}_1,\ldots ,{\varvec{y}}_N\):

$$\begin{aligned} {\varvec{y}}_n \simeq \sum _{m=1}^{M} {\varvec{h}}_{m}u_{m,n} ~~(n=1,\ldots ,N), \end{aligned}$$
(2.1)

subject to the non-negativity constraints on both the basis vectors \({\varvec{h}}_m\) and the coefficients \(u_{m,n}\). Here, it is important to note that the observed data are assumed to be quantities that are additive in nature. Although neither a pixel value nor a power spectrum is strictly an additive quantity, we must be aware of the fact that when applying NMF, the additivity of the data of interest will be implicitly assumed to hold, regardless of whether this assumption is true or only approximately true. The non-additivity of power spectra will be discussed in detail in Sect. 2.7. In addition to the additivity assumption as regards the data, the non-negativity constraint is one of the most important features of NMF. As explained later, the non-negativity constraint contributes to inducing sparsity of both the basis vectors and the coefficients.

Now, if we let \({\varvec{Y}} = [{\varvec{y}}_1, \ldots , {\varvec{y}}_N]=(y_{k,n})_{K\times N}\), \({\varvec{H}} = [{\varvec{h}}_{1},\ldots ,{\varvec{h}}_{M}]=(h_{k,m})_{K\times M}\) and \({\varvec{U}} = (u_{m,n})_{M\times N}\), Eq. (2.1) can be rewritten as \({\varvec{Y}}\simeq {\varvec{HU}}\). NMF can thus be seen as a problem of factorizing an observed matrix into the product of two non-negative matrices, which gives NMF its name. To understand NMF intuitively, see Fig. 2.1 for an example of NMF applied to the spectrogram of an audio signal, interpreted as a non-negative matrix.

Fig. 2.1
figure 1

NMF applied to the spectrogram of an audio signal. Each column of \({\varvec{H}}\) and each row of \({\varvec{U}}\) can be interpreted as a spectral template and the corresponding temporal activation, respectively. \({\varvec{HU}}\) can thus be viewed as the sum of the spectral templates scaled by time-varying amplitudes

2.3 Basic Properties of NMF

The number M of basis vectors is usually set smaller than the dimension K and the number N of data vectors. This is because when \(M\ge K\) or \(M \ge N\), there are trivial solutions to the factorization \({\varvec{Y}}={\varvec{HU}}\). For example, when \(M=K\), we have \({\varvec{Y}}={\varvec{IU}}\) and when \(M=N\), we have \({\varvec{Y}}={\varvec{HI}}\), where \({\varvec{I}}\) denotes an identity matrix. Obviously, neither of these decompositions provides information about the latent components underlying the data. When \(M<\min (K,N)\), the factorization amounts to approximating the data matrix using a lower rank matrix, which provides meaningful information about the latent components. Geometrically, while PCA (singular value decomposition) tries to find a linear subspace to which observed vectors belong, NMF can be interpreted as finding a convex cone (see Fig. 2.2) that is closest to the entire set of observed vectors. The number M of basis vectors corresponds to the dimension of the convex cone, which depends on the data and is usually unknown. Thus, determining M is an important issue in NMF. Recent techniques for determining M will be mentioned in Sect. 2.8.

Fig. 2.2
figure 2

Geometric understanding of NMF. Because of the non-negativity of \({\varvec{H}}\), all basis vectors lie in the first quadrant. Because of the non-negativity of \({\varvec{U}}\), \({\varvec{H}}{\varvec{u}}_n\) can only cover the area enclosed by the extended lines of all the basis vectors. Thus, NMF can be interpreted as finding a convex cone that is closest to the entire set of observed vectors

With NMF, the elements of the coefficient matrix \({\varvec{U}}\) tend to become sparse as a side effect of the non-negativity constraint. The intuitive reason for this can be explained as follows. First, let us consider an unconstrained optimization problem \(\hat{{\varvec{u}}} = \mathop {\mathrm{argmin}}\limits _{{\varvec{u}}} \mathcal {D}({\varvec{y}}|{\varvec{H}}{\varvec{u}})\) where \(\mathcal {D}(\cdot |\cdot )\) is a measure of the difference between two vectors. \({\varvec{H}}\hat{{\varvec{u}}}\) corresponds to the closest point from \({\varvec{y}}\) in the subspace spanned by \({\varvec{h}}_1,\ldots ,{\varvec{h}}_M\). If \(\mathcal {D}\) is defined as an \(\ell _2\) norm, for example, this point simply corresponds to the orthogonal projection of \({\varvec{y}}\) onto the subspace. Now, let us denote the solution to this optimization problem under the non-negativity constraint by \(\tilde{{\varvec{u}}}\). Except for a coincidental case where the unconstrained optimal solution \(\hat{{\varvec{u}}}\) satisfies the non-negativity constraint, \({\varvec{H}}\tilde{{\varvec{u}}}\) will be a closest point to \(\hat{{\varvec{u}}}\) in the convex cone shown in Fig. 2.2, namely some point on the boundary of the cone. This means at least one of the elements of the coefficient vector becomes 0. Therefore, the constrained optimal solution \(\tilde{{\varvec{u}}}\) becomes relatively sparser (with a larger number of zero entries) than the unconstrained optimal solution \(\hat{{\varvec{u}}}\). This explains why NMF tends to produce sparse representations. It is important to note that sparsity is related to statistical independence (non-Gaussianity). Thus, roughly speaking, producing sparse representations implies that each row of the coefficient matrix tends to become uncorrelated. The above property also applies to the transposition of \({\varvec{Y}}\simeq {\varvec{HU}}\), i.e., \({\varvec{Y}}^{\mathsf T}\simeq {\varvec{U}}^{\mathsf T}{\varvec{H}}^{\mathsf T}\), meaning that \({\varvec{H}}\) also tends to become sparse owing to the non-negativity constraint on \({\varvec{H}}\).

2.4 NMF Algorithms

2.4.1 Positive Matrix Factorization and NMF

The original concept of NMF was first introduced by Paatero and Tapper in 1994 [2]. Within their formulation, they used the Frobenius norm of \({\varvec{Y}}-{\varvec{HU}}\) as a measure of the difference between \({\varvec{Y}}\) and \({\varvec{HU}}\) and a logarithmic barrier function

$$\begin{aligned} {B}({\varvec{H}},{\varvec{U}})&= -\sum _{k,m} \log h_{k,m} -\sum _{m,n} \log u_{m,n} \end{aligned}$$
(2.2)

as a penalizing term for violations of the non-negativity constraint, which approaches infinity as \( h_{k,m}\) or \(u_{m,n}\) approaches zero. They proposed a gradient-based optimization algorithm for minimizing the cost function defined as a weighted sum of these two terms.

Because of the property of the logarithmic barrier function, the elements of the matrices given by this method must always be positive (they never become zero). Thus, it is usually called “Positive Matrix Factorization (PMF)”, which distinguishes it from NMF. Several years later, Lee and Seung proposed an iterative scheme called the multiplicative update algorithm, which ensures the non-negativity of \({\varvec{H}}\) and \({\varvec{U}}\) without using barrier functions [1]. Owing to the simplicity of its implementation, NMF has subsequently gained considerable momentum in a wide range of research areas.

2.4.2 Divergence Measures

NMF leads to different optimization problems according to the definition of the measure of the difference between \({\varvec{Y}}\) and \({\varvec{HU}}\). Lee and Seung have proposed deriving NMF algorithms using the Frobenius norm and the generalized Kullback-Leibler (KL) divergence (also known as the I divergence) [3] as the goodness-of-fit criteria. Of course, the optimal values of \({\varvec{H}}\) and \({\varvec{U}}\) depend on the choice of these criteria. It is desirable that the goodness-of-fit criterion be set according to the underlying generative process of the data \({\varvec{Y}}\). For example, the Itakura-Saito (IS) divergence is often used as the model-fitting criterion for NMF when it is applied to power spectrum data [4, 5]. This is actually based on an assumption about the generative process of time-domain signals (as explained in Sect. 2.7.3).

For \(y,x\in \mathbb {R}\), the Euclidean distance (squared error), the generalized KL divergence and the IS divergence of x from y are defined as

$$\begin{aligned} \mathcal {D}_\mathrm{EU}(y|x)&= (y-x)^2, \end{aligned}$$
(2.3)
$$\begin{aligned} \mathcal {D}_\mathrm{KL}(y|x)&= y\log \frac{y}{x}-y+x, \end{aligned}$$
(2.4)
$$\begin{aligned} \mathcal {D}_\mathrm{IS}(y|x)&= \frac{y}{x}-\log \frac{y}{x}-1, \end{aligned}$$
(2.5)

respectively. All of these metrics become 0 only when \(x=y\) and increase monotonically as x and y become more distant. Figure 2.3 shows the graph of each of these measures seen as a function of x. While the Euclidean distance is symmetric about \(x=y\), the generalized KL divergence and the IS divergence are asymmetric and impose larger penalties when x is below y than when x is above y. It is also important to note that the IS divergence is invariant under the scaling of x and y since it is represented using only the ratio of x to y. By using these metrics, we can measure the difference between \({\varvec{HU}}\) and \({\varvec{Y}}\) with

$$\begin{aligned} D_{\cdot }({\varvec{H}},{\varvec{U}}) = \sum _{k,n}\mathcal {D}_{\cdot }\Big (y_{k,n}\Big |\sum _{m}h_{k,m}u_{m,n}\Big ), \nonumber \end{aligned}$$

where \(\cdot \) indicates \(\text {EU}\), \(\text {KL}\) or \(\text {IS}\).

Fig. 2.3
figure 3

Graph of \(\mathcal {D}_\mathrm{EU/KL/IS}(y|x)\) as a function of x

2.4.3 Auxiliary Function Approach

The goal of NMF is to find optimal values for \({\varvec{H}}\) and \({\varvec{U}}\) that minimize one of these kinds of measures subject to the non-negativity constraint. Although it is usually difficult to obtain an analytical expression of the global optimum solution, one of the local optimum solutions can be searched for numerically using the “auxiliary function approach” (also known as the “Majorization-Minimization algorithm”) [7, 25]. As explained later, the auxiliary function approach makes it possible to locally minimize an objective function by iteratively minimizing an auxiliary function whose lower bound is exactly equal to the objective function value. It should be noted that the Expectation-Maximization (EM) algorithm [8], a popular technique for maximum likelihood estimation from incomplete data, is a special case of this approach.

In NMF, the non-negativity constraint must be considered. If the objective function were given as the sum of individual terms, each relating to one matrix element, solving the constrained optimization problem would be relatively simple. But of course this is not the case. If we can use such a function as an auxiliary function, the constrained optimization problem of NMF can be solved in an iterative manner using the auxiliary function approach.

The definition of the auxiliary function and the principle of the auxiliary function approach are as follows.

Fig. 2.4
figure 4

Sketch of process of auxiliary function method

Definition 2.1

Given an objective function \(D(\theta )\) with the parameter \(\theta =\{\theta _i\}_{1\le i\le I}\), \(G(\theta ,\alpha )\) is defined as an auxiliary function of \(D(\theta )\) if it satisfies

$$\begin{aligned} D(\theta )=\min _{\alpha } G(\theta ,\alpha ), \end{aligned}$$
(2.6)

where we refer to \(\alpha \) as auxiliary variables.

Theorem 2.1

\(D(\theta )\) is non-increasing under the updates:

$$\begin{aligned} \alpha&\leftarrow \mathop {\mathrm{argmin}}\limits _{\alpha }G(\theta ,\alpha ),\end{aligned}$$
(2.7)
$$\begin{aligned} \theta _i&\leftarrow \mathop {\mathrm{argmin}}\limits _{\theta _i}G(\theta ,\alpha )~~(i=1,\ldots ,I). \end{aligned}$$
(2.8)

Proof

Let us set \(\theta \) at an arbitrary value \(\theta ^{(\ell )}\) and define

$$\begin{aligned} \alpha ^{(\ell +1)}= \mathop {\mathrm{argmin}}\limits _{\alpha } G(\theta ^{(\ell )},\alpha ),~~ \theta ^{(\ell +1)}= \big \{ \mathop {\mathrm{argmin}}\limits _{{\theta }_i} G(\theta ,\alpha ^{(\ell +1)}) \big \}_{1\le i\le I}. \end{aligned}$$
(2.9)

First, it is obvious that \(D(\theta ^{(\ell )})=G(\theta ^{(\ell )},\alpha ^{(\ell +1)})\). Next, we can confirm that \(G(\theta ^{(\ell )},\alpha ^{(\ell +1)}) \ge G(\theta ^{(\ell +1)},\alpha ^{(\ell +1)})\). By definition, it is clear that \(G(\theta ^{(\ell +1)},\alpha ^{(\ell +1)})\ge D(\theta ^{(\ell +1)})\) and so we can finally show that \(D(\theta ^{(\ell )})\ge D(\theta ^{(\ell +1)})\). A sketch of this proof can be found in Fig. 2.4.

2.4.4 NMF Algorithm with Euclidean Distance

By employing the principle of the auxiliary function approach, we first derive an NMF algorithm using \(D_\mathrm{EU}({\varvec{H}},{\varvec{U}})\) as the goodness-of-fit criterion. By using \(\displaystyle \mathop {=}^{z}\) to denote equality up to a term independent of z, we can write \(D_\mathrm{EU}({\varvec{H}},{\varvec{U}})\) as

$$\begin{aligned}&D_\mathrm{EU}({\varvec{H}},{\varvec{U}}) \mathop {=}^{H,U} \sum _{k,n} ( -2y_{k,n}x_{k,n}+ x_{k,n}^2), \end{aligned}$$
(2.10)

where

$$\begin{aligned} x_{k,n}=\sum _m h_{k,m}u_{m,n}. \end{aligned}$$
(2.11)

We want to design an auxiliary function such that the matrix elements are separated into individual terms. Note that \(x_{k,n}^2\) is a term involving \(h_{k,1},\ldots ,h_{k,M}\) and \(u_{1,n},\ldots ,u_{M,n}\). Since a quadratic function is convex, we can employ Jensen’s inequality to construct a desired auxiliary function.

Theorem 2.2

(Jensen’s inequality for convex functions with non-negative arguments (Fig. 2.5)) For an arbitrary convex function g with I non-negative arguments \(z_1,\ldots ,z_I\), we have

$$\begin{aligned} g\Big (\sum _i z_i\Big ) \le \sum _i \lambda _i g\left( \frac{z_i}{\lambda _i}\right) , \end{aligned}$$
(2.12)

where \(\lambda _1,\ldots ,\lambda _1\) are non-negative weights satisfying \(\sum _i \lambda _i =1\). Equality in this inequality holds when

$$\begin{aligned} \lambda _i = \frac{z_i}{\sum _j z_j}. \end{aligned}$$
(2.13)
Fig. 2.5
figure 5

Jensen’s inequality for functions with non-negative arguments for \(I = 2\) case

Since \(h_{k,m} u_{m,n} \ge 0\), we can apply this to \(x_{k,n}^2\)

$$\begin{aligned} x_{k,n}^2&\le \sum _m \lambda _{k,m,n}\bigg (\frac{h_{k,m}u_{m,n}}{\lambda _{k,m,n}}\bigg )^2, \end{aligned}$$
(2.14)

where \(\lambda _{k,m,n}\ge 0\), \(\sum _m \lambda _{k,m,n} = 1\). Here, we notice that the right-hand side of this inequality is given as the sum of terms each relating to \(h_{k,m}\) and \(u_{m,n}\). It is also important to note that the equality holds when \(\frac{h_{k,1} u_{1,n}}{\lambda _{k,1,n}} =\cdots =\frac{h_{k,M} u_{M,n}}{\lambda _{k,M,n}}\), namely

$$\begin{aligned} \lambda _{k,m,n} = \frac{h_{k,m}u_{m,n}}{x_{k,n}}. \end{aligned}$$
(2.15)

Hence, the function obtained by replacing the term \(x_{k,n}^2\) in \(D_\mathrm{EU}({\varvec{H}},{\varvec{U}})\) with the right-hand side of Eq. (2.14)

$$\begin{aligned}&G_\mathrm{EU}({\varvec{H}},{\varvec{U}},{\varvec{\lambda }}) = \sum _{k,n} \bigg (y_{k,n}^2 -2y_{k,n} \sum _m h_{k,m}u_{m,n} + \sum _m \frac{h_{k,m}^2u_{m,n}^2}{\lambda _{k,m,n}} \bigg ) \end{aligned}$$
(2.16)

satisfies the requirement of an auxiliary function for \(D_\mathrm{EU}({\varvec{H}},{\varvec{U}})\). Here, \({\varvec{\lambda }}=\{\lambda _{k,m,n}\}_{K\times M\times N}\). By using \(G_\mathrm{EU}({\varvec{H}},{\varvec{U}},{\varvec{\lambda }})\), we can develop an iterative algorithm for locally minimizing \(D_\mathrm{EU}({\varvec{H}},{\varvec{U}})\), that consists of performing

$$\begin{aligned} {\varvec{\lambda }}&\leftarrow \mathop {\mathrm{argmin}}\limits _{{\varvec{\lambda }}}G_\mathrm{EU}({\varvec{H}},{\varvec{U}},{\varvec{\lambda }}), \end{aligned}$$
(2.17)
$$\begin{aligned} {\varvec{H}}&\leftarrow \mathop {\mathrm{argmin}}\limits _{{\varvec{H}}}G_\mathrm{EU}({\varvec{H}},{\varvec{U}},{\varvec{\lambda }}),~~ {\varvec{U}}\leftarrow \mathop {\mathrm{argmin}}\limits _{{\varvec{U}}}G_\mathrm{EU}({\varvec{H}},{\varvec{U}},{\varvec{\lambda }}). \end{aligned}$$
(2.18)

First, Eq. (2.17) is given as Eq. (2.15) as mentioned above. Next, Eq. (2.18) must be solved subject to non-negativity. \(G_\mathrm{EU}({\varvec{H}},{\varvec{U}},{\varvec{\lambda }})\) is a quadratic function of each matrix element \(h_{k,m}\), which can be minimized when

$$\begin{aligned} \hat{h}_{k,m} = \frac{ \displaystyle \sum _n y_{k,n} u_{m,n} }{ \displaystyle \sum _n {u_{m,n}^2}/{\lambda _{k,m,n}} }. \end{aligned}$$
(2.19)

In the same way, \(G_\mathrm{EU}({\varvec{H}},{\varvec{U}},{\varvec{\lambda }})\) can be minimized with respect to \({u}_{m,n}\) when

$$\begin{aligned} \hat{u}_{m,n} = \frac{ \displaystyle \sum _k y_{k,n} h_{k,m} }{ \displaystyle \sum _k {h_{k,m}^2}/{\lambda _{k,m,n}} }. \end{aligned}$$
(2.20)

If these values become negative, the minimizers of \(G_\mathrm{EU}({\varvec{H}},{\varvec{U}},{\varvec{\lambda }})\) within the non-negativity constraint will obviously be \(h_{k,m}=0\) and \({u}_{m,n}=0\). Thus, Eq. (2.18) is given as \(h_{k,m} = \max \{\hat{h}_{k,m},0\}\) and \(u_{m,n} = \max \{\hat{u}_{m,n},0\}\). Note, however, that when all the elements of \({\varvec{H}}\), \({\varvec{U}}\) and \({\varvec{\lambda }}\) are non-negative, both (2.19) and (2.20) necessarily become non-negative. Hence, if the initial values of \({\varvec{H}}\) and \({\varvec{U}}\) are set at non-negative values, \(h_{k,m}\) and \(u_{m,n}\) will always be updated to non-negative values. In such a situation, the update equations for \(h_{k,m}\) and \(u_{m,n}\) can be written simply as \(h_{k,m}=\hat{h}_{k,m}\) and \(u_{m,n}=\hat{u}_{m,n}\). By substituting Eq. (2.17) into Eq. (2.18), we obtain the following algorithm:

figure a

Since each variable is updated by multiplying the value at the previous iteration by a non-negative factor, this kind of algorithm is often referred to as a “multiplicative update algorithm” [1].

2.4.5 NMF Algorithm with I Divergence

An NMF algorithm using \(D_\mathrm{KL}({\varvec{H}},{\varvec{U}})\) as a goodness-of-fit criterion can be derived in a similar way. \(D_\mathrm{KL}({\varvec{H}},{\varvec{U}})\) is equal up to a constant term to

$$\begin{aligned}&D_\mathrm{KL}({\varvec{H}},{\varvec{U}}) \mathop {=}^{H,U} \sum _{k,n} ( - y_{k,n}\log x_{k,n} + x_{k,n}). \end{aligned}$$
(2.21)

Here, \(-y_{k,n}\log x_{k,n}\) is a nonlinear term involving \(h_{k,1},\ldots ,h_{k,M}\) and \(u_{1,n},\ldots ,u_{M,n}\). By using the fact that a negative logarithmic function is convex and \(h_{k,m}u_{m,n}\ge 0\), we can apply Theorem 2.2

$$\begin{aligned} {-\log x_{k,n}}&\le -\sum _m \lambda _{k,m,n}\log \bigg ( \frac{h_{k,m}u_{m,n}}{\lambda _{k,m,n}} \bigg ) \nonumber \end{aligned}$$

to construct a desired auxiliary function, from which we obtain the following algorithm:

figure b

2.4.6 NMF Algorithm with IS Divergence

Here, we show an NMF algorithm using the IS divergence as a goodness-of-fit criterion developed by the author in 2006 [9]. By omitting the terms that do not depend on \({\varvec{H}}\) and \({\varvec{U}}\), \(D_\mathrm{IS}({\varvec{H}},{\varvec{U}})\) is written as

$$\begin{aligned} D_\mathrm{IS}({\varvec{H}},{\varvec{U}}) \mathop {=}^{H,U} \sum _{k,n} \bigg ( \frac{y_{k,n}}{x_{k,n}} + \log x_{k,n} \bigg ). \end{aligned}$$
(2.22)

In a way similar to that described in the previous subsection, we want to design an auxiliary function such that the matrix elements are separated into individual terms. First, by using the fact that the reciprocal function is convex on a positive half-axis, \(h_{k,m}u_{m,n}\ge 0\) and \(y_{k,n}\ge 0\), we can apply Theorem 2.2 to the term \(1/x_{k,n}\)

$$\begin{aligned} \frac{1}{x_{k,n}}&\le \sum _m \lambda _{k,m,n} \bigg (1\bigg /\frac{h_{k,m}u_{m,n}}{\lambda _{k,m,n}}\bigg ), \end{aligned}$$
(2.23)

where \(\lambda _{k,m,n}\) is a positive weight satisfying \(\lambda _{k,m,n}>0\) and \(\sum _m \lambda _{k,m,n} = 1\). Next, let us focus on the term \(\log x_{k,n}\). Since the positive logarithmic function is concave (not convex), the strategy using Jensen’s inequality cannot be used. However, we can apply a different inequality as described below. Given a differentiable concave function g, we can show that a tangent line to g at an arbitrary tangent point \(\alpha \in \mathbb {R}\) lies entirely above the graph of g, namely for all \(x\in \mathbb {R}\),

$$\begin{aligned} g(x)\le g(\alpha ) + (x-\alpha )g'(\alpha ). \end{aligned}$$
(2.24)

Obviously, the equality of this inequality holds if and only if \(\alpha =x\). By applying this to \(\log x_{k,n}\), we obtain

$$\begin{aligned} \log x_{k,n} \le \log \alpha _{k,n} + \frac{1}{\alpha _{k,n}} (x_{k,n}-\alpha _{k,n}), \end{aligned}$$
(2.25)

where \(\alpha _{k,n}\) is an arbitrary real number. It is important to note that the right-hand side of this inequality is given as a first order function of the matrix elements. Hence, the function obtained by replacing the terms \(1/x_{k,n}\) and \(\log x_{k,n}\) in \(D_\mathrm{IS}({\varvec{H}},{\varvec{U}})\) with the right-hand sides of Eqs. (2.23) and (2.25), such that

$$\begin{aligned} G_\mathrm{IS}({\varvec{H}},{\varvec{U}},{\varvec{\lambda }},{\varvec{\alpha }}) \!=&\sum _{k,n} \bigg ( \sum _m \! \frac{y_{k,n}\lambda _{k,m,n}^2}{h_{k,m} u_{m,n}} \!+\! \sum _m \! \frac{h_{k,m} u_{m,n}}{\alpha _{k,n}} -\log y_{k,n} +\log \alpha _{k,n}-2 \bigg ), \end{aligned}$$
(2.26)

satisfies the requirement of an auxiliary function for \(D_\mathrm{IS}({\varvec{H}},{\varvec{U}})\) [9]. Note that the equalities of Eqs. (2.23) and (2.25) hold if and only if

$$\begin{aligned} \lambda _{k,m,n} = \frac{h_{k,m}u_{m,n}}{x_{k,n}},~~ \alpha _{k,n} =x_{k,n}. \end{aligned}$$
(2.27)

By applying Theorem 2.1 and deriving each update equation, we obtain the following algorithm:

figure c

2.4.7 NMF Algorithm with \(\beta \) Divergence

The three divergence measures given in Eqs. (2.3)–(2.5) can be described in a unified manner using a criterion called the \(\beta \) divergence [10]

$$\begin{aligned} \mathcal {D}_{\beta }(y|x)= y \frac{y^{\beta -1}-x^{\beta -1}}{\beta -1} - \frac{y^{\beta }-x^{\beta }}{\beta }, \end{aligned}$$
(2.28)

where \(\beta \) is a real number such that \(\beta \ne 0\) and \(\beta \ne 1\). By using the fact that \(\textstyle \lim _{\beta \rightarrow 0}({x^\beta -y^\beta })/{\beta }=\log ({x}/{y})\), it can be confirmed that Eq. (2.28) reduces to the IS divergence when \(\beta \rightarrow 0\), the I divergence when \(\beta \rightarrow 1\) and the Euclidean distance when \(\beta = 2\), respectively. Here, we show a generalized NMF algorithm using the \(\beta \) divergence as a goodness-of-fit criterion, that we have previously developed [11]. The first term \((y^{\beta -1}-x^{\beta -1})/(\beta -1)\) of Eq. (2.28) is convex in x when \(\beta \le 2\) and is concave otherwise. On the other hand, the second term \(-(y^{\beta }-x^{\beta })/\beta \) is concave in x when \(\beta \le 1\) and is convex otherwise. In a way similar to the idea of [9], we can construct an auxiliary function by applying Eq. (2.12) to the convex term and Eq. (2.24) to the concave term. By using this auxiliary function, we can derive update equations given in closed form in the same way as in the previous subsections. The NMF algorithm derived using this idea is summarized as follows:

figure d

It can be readily verified that the above algorithm reduces to the multiplicative update algorithms with the IS divergence, the I divergence and the Euclidean distance presented in Sects. 2.4.4, 2.4.5 and 2.4.6 when \(\beta =0,1,2\), respectively.

2.5 Interpretation of NMF as Generative Model

2.5.1 \(\beta \) Divergence Versus Tweedie Distribution

The optimization problems of NMF with the Euclidean distance, I divergence, IS divergence and \(\beta \) divergence are equivalent to the problems of the maximum likelihood estimation of \({\varvec{H}}\) and \({\varvec{U}}\), where each element \(y_{kn}\) of \({\varvec{Y}}\) is assumed to have been generated independently from the normal distribution, Poisson distribution, exponential distribution and Tweedie distribution with the mean \(x_{k,n}\)

$$\begin{aligned} y_{k,n}&\sim \mathcal {N}(y_{k,n};x_{k,n},\sigma ^2), \end{aligned}$$
(2.29)
$$\begin{aligned} y_{k,n}&\sim \mathrm{Poisson}(y_{k,n};x_{k,n}),\end{aligned}$$
(2.30)
$$\begin{aligned} y_{k,n}&\sim \mathrm{Exponential}(y_{k,n};x_{k,n}),\end{aligned}$$
(2.31)
$$\begin{aligned} y_{k,n}&\sim \mathrm{Tweedie}(y_{k,n};x_{k,n},\phi ), \end{aligned}$$
(2.32)

respectively, where

$$\begin{aligned}&\mathcal {N}(z;\mu ,\sigma ^2) = \textstyle \frac{1}{\sqrt{2\pi }\sigma ^2}e^{-(z-\mu )^2/2\sigma ^2},\end{aligned}$$
(2.33)
$$\begin{aligned}&\mathrm{Poisson}(z;\mu )= \mu ^{z}e^{-\mu }/z!~~(z\ge 0),\end{aligned}$$
(2.34)
$$\begin{aligned}&\mathrm{Exponential}(z;\mu ) =\textstyle \frac{1}{\mu } e^{-z/\mu }~~(z\ge 0),\end{aligned}$$
(2.35)
$$\begin{aligned}&\mathrm{Tweedie}(z;\mu ,\phi ) = a(z,\phi )e^{\frac{1}{\phi }(z\rho (\mu )-\kappa (\mu ))},\\&~~~~~~\rho (\mu )= {\left\{ \begin{array}{ll} \frac{\mu ^{\beta -1}-1}{\beta -1}&{}\!\!\!(\beta \ne 1)\\ \log \mu &{}\!\!\!(\beta = 1) \end{array}\right. }, ~ \kappa (\mu )= {\left\{ \begin{array}{ll} \frac{\mu ^{\beta }-1}{\beta }&{}\!\!\!(\beta \ne 0)\\ \log \mu &{}\!\!\!(\beta =0). \end{array}\right. } \nonumber \end{aligned}$$
(2.36)

This can be confirmed as follows. All the log-likelihoods \(L(x_{k,n})=\log p(y_{k,n}|x_{k,n})\) defined by Eqs. (2.29)–(2.32) are maximized when \(x_{k,n}=y_{k,n}\). Thus, \(L(y_{k,n})\ge L(x_{k,n})\). Hence, the log-likelihood differences \(L(y_{k,n}) - L(x_{k,n})\) can be regarded as non-negative measures of the dissimilarity between \(y_{k,n}\) and \(x_{k,n}\) that become 0 only when \(x_{k,n}=y_{k,n}\). We can see that the log-likelihood differences \(L(y_{k,n}) - L(x_{k,n})\) for Eqs. (2.29)–(2.32) are equal to Eqs. (2.3)–(2.5) and (2.28), respectively.

2.5.2 Bregman Divergence Versus Natural Exponential Family

As we have seen in the four examples above, an assumption regarding the divergence measure for a certain model-fitting problem is associated with a probability density function assumption regarding the observed data. In this subsection, I show that the class of probabilistic distributions belonging to the natural exponential family is associated with the class of goodness-of-fit criteria called the Bregman divergence and that the \(\beta \) divergence is a special case of the Bregman divergence. In the following, I will omit the subscripts kn for simplicity and assume that an element y of the observed matrix follows a probability distribution belonging to the exponential family

$$\begin{aligned} y \sim \exp \big \{\eta T(y) - \psi (\eta ) + c(y)\big \}, \end{aligned}$$
(2.37)

where \(\psi \) is an infinitely differentiable, strictly convex function. \(\eta \) is called a natural parameter and is a function of the parameters characterizing the distribution. Here, we consider the case \(T(y) = y\), whose distribution class is called the natural exponential family.

First, we introduce the Legendre transform of \(\psi \)

$$\begin{aligned} \phi (z) = \max _{v} (vz - \psi (v)). \end{aligned}$$
(2.38)

Since \(\psi \) is a convex function, \(\phi \) also becomes a convex function due to the property of the Legendre transform. By using \(v^*\) to denote \(\phi (z)\), i.e., v that maximizes \(vz - \psi (v)\), \(v^*\) satisfies \((vz - \psi (v))^\prime = 0\), namely

$$\begin{aligned} \psi ^\prime (v^*) = z. \end{aligned}$$
(2.39)

Next, by using the fact that the cumulant generating function of \(y \sim \exp \big \{\eta y - \psi (\eta ) + c(y)\big \}\) is given as \(K(t) = \log \mathbb {E}[e^{yt}]=\psi (t + \eta )-\psi (\eta )\), we can write \(x:=\mathbb {E}[y] = K'(0)\) as

$$\begin{aligned} x = \psi ^\prime (\eta ). \end{aligned}$$
(2.40)

Since \(\psi \) is a convex function, \(\psi ^{\prime }\) is a one-to-one function. Thus, there is a one-to-one correspondence between \(\eta \) and x. By comparing Eq. (2.40) with Eq. (2.39), we can show that \(\eta = \mathop {\mathrm{argmax}}\limits _{v}(vx - \psi (v))\). \(\phi (x)\) can thus be written as

$$\begin{aligned} \phi (x) = \eta (x)x - \psi (\eta (x)). \end{aligned}$$
(2.41)

Note that here \(\eta \) is written as \(\eta (x)\) to emphasize that it is a function of x. Here, by differentiating both sides of Eq. (2.41) with respect to x, we have

$$\begin{aligned} \phi ^{\prime }(x)&= \eta (x) + \eta ^{\prime }(x)x - \psi ^{\prime }(\eta (x))\eta ^{\prime }(x). \end{aligned}$$
(2.42)

By plugging Eq. (2.40) into Eq. (2.42), the second and third terms cancel each other out, thus resulting in \(\phi ^{\prime }(x) = \eta (x)\).

By substituting the two relationships \(\phi (x) = \eta x - \psi (\eta )\) and \(\phi ^{\prime }(x) = \eta (x)\) given above into the probability density function of the natural exponential family \(\exp \{ \eta y - \psi (\eta ) + c(y) \} = \exp \{ \eta x - \psi (\eta ) + \eta (y - x) + c(y)\}\), we obtain

$$\begin{aligned} p(y|x) = \exp \{ \phi (x) + \phi ^{\prime }(x) (y - x) + c(y)\}. \end{aligned}$$
(2.43)

Here, it is important to note that the log-likelihood of x, \(L(x)=\log p(y|x)=\phi (x) + \phi ^{\prime }(x) (y - x) + c(y)\), is maximized when \(x=y\), since \((\phi (x) + \phi ^{\prime }(x) (y - x))^{\prime } = \phi ^{\prime \prime }(x) (y - x)\). Thus, \(L(y)\ge L(x)\). Hence, the log-likelihood difference \(L(y) - L(x)\)

$$\begin{aligned} \mathcal {D}_{\phi }(y|x) = \phi (y) - \phi (x) - \phi ^{\prime }(x) (y - x), \end{aligned}$$
(2.44)

can be regarded as a non-negative measure of the dissimilarity between x and y that becomes 0 only when \(x=y\). This measure is called the Bregman divergence [12]. As shown in Fig. 2.6, \(D_{\phi }(y|x)\) corresponds to the difference between the convex function \(\phi \) and its tangent line at point x. We can see from this figure that \(D_{\phi }(y|x)\) is always non-negative and that \(D_{\phi }(y|x)\) becomes 0 only when x and y are equal.

Fig. 2.6
figure 6

Bregman divergence

The \(\beta \) divergence introduced in Sect. 2.4.7 is a special case of the Bregman divergence with

$$\begin{aligned} \phi (x) = {\left\{ \begin{array}{ll} {-{\log } x} + x -1 &{}(\beta = 0)\\ x\log x - x + 1&{}(\beta = 1)\\ \frac{x^{\beta }}{\beta (\beta -1)} - \frac{x}{\beta - 1} + \frac{1}{\beta }&{}(\mathrm{otherwise}) \end{array}\right. }[13]. \end{aligned}$$
(2.45)

Thus, the Euclidean distance, I divergence, and IS divergence, which are special cases of the \(\beta \) divergence, are also special cases of the Bregman divergence.

An attempt was made by Dhillon and Sra to derive a multiplicative update algorithm for NMF with the Bregman divergence under a limited class of \(\phi \) [14]. However, its generalization to an arbitrary \(\phi \) has yet to be proposed.

2.6 Relation to Probabilistic Latent Semantic Analysis (pLSA)

The concept of probabilistic Latent Semantic Analysis (pLSA) [15], which is a technique that was originally developed for document clustering and indexing, is closely related to NMF. This section describes the relationship between these two techniques.

Let \(y_{k,n}\) be the number of times word k occurs in document n. The histogram of all possible words \({\varvec{y}}_n =(y_{1,n},\ldots ,y_{K,n})^{\mathsf T}\) in document n is called a document data. The number of times a particular word occurs may depend heavily on the topic of the document such as politics, economy, sports, entertainment, and culture. The aim of pLSA is to estimate topics from document data based on this dependence.

Let p(k|m) be the probability that word k occurs when the topic is m and p(m|n) be the probability that the topic of document n is m. Then, the probability p(k|n) that word k occurs in document n can be written as

$$\begin{aligned} p(k|n) = \sum _{m} p(k|m)p(m|n). \end{aligned}$$
(2.46)

By putting \(x_{k,n} = p(k|n)\), \(h_{k,m} = p(k|m)\) and \(u_{m,n}=p(m|n)\), and by arranging them in matrices \({\varvec{X}} = (x_{k,n})_{K\times N}\), \({\varvec{H}} = (h_{k,m})_{K\times M}\) and \({\varvec{U}} = (u_{m,n})_{M\times N}\), Eq. (2.46) can be written in matrix notation as \({\varvec{X}} = {\varvec{HU}}\). If each word in a set of document data is assumed to be generated independently according to the distribution p(k|n), the probability that all the document data are generated becomes \(\prod _{k,n} {p(k|n)}^{y_{k,n}}\). Since both \(H_{k,m} = p(k|m)\) and \(U_{m,n}=p(m|n)\) are unknown, the maximum likelihood estimation of \({\varvec{H}}\) and \({\varvec{U}}\) can be formulated as an optimization problem of maximizing

$$\begin{aligned} \log p({\varvec{Y}}|{\varvec{H}},{\varvec{U}}) = \sum _{k,n} y_{k,n}\log x_{k,n}, \end{aligned}$$
(2.47)

with respect to \({\varvec{H}}\) and \({\varvec{U}}\) subject to non-negativity and sum-to-one constraints: \(h_{k,m}\ge 0\), \(\sum _{k}h_{k,m} = 1\), \(u_{m,n}\ge 0\), \(\sum _{m}u_{m,n} = 1\). By comparing Eqs. (2.47) and (2.21), we notice that the above log-likelihood is exactly opposite to the first term of Eq. (2.21). Furthermore, as the second term of Eq. (2.21) can be seen as corresponding to a Lagrange multiplier term for \(x_{k,n}\), the pLSA optimization problem has the same form as that of NMF with the I divergence criterion. Indeed, it turns out that the optimization algorithm described in Sect. 2.4.5 is equivalent to the expectation-maximization (EM) algorithm obtained by treating the topic index m as a latent variable up to the normalization of \({\varvec{H}}\) and \({\varvec{U}}\).

As described above, the way in which the likelihood function of pLSA is defined is different from NMF described in Sect. 2.5. While pLSA treats \(h_{k,m}\) and \(u_{m,n}\) as probability distributions over k and m, NMF treats them as random variables. Namely, pLSA is categorized as mixture models (models defined as the sum of probability distributions) whereas NMF is categorized as factor models (models defined as the distribution of the sum of random variables). The Bayesian extension of pLSA is called the latent Dirichlet allocation (LDA) [16] and the Bayesian extension of NMF with the I divergence criterion is discussed for example in [17].

2.7 Applications to Audio Signal Processing Problems

2.7.1 Audio Source Separation and Music Transcription

Smaragdis and Brown proposed an automatic music transcription method that uses NMF to decompose the magnitude (or power) spectrograms of music signals into spectrograms associated with individual pitches [18]. With this approach, the magnitude (or power) spectrogram of a mixture signal, interpreted as a non-negative matrix \({\varvec{Y}}\), is factorized into the product of two non-negative matrices \({\varvec{H}}\) and \({\varvec{U}}\) (See Fig. 2.1). This can in turn be interpreted as approximating the observed spectra at each time frame as a linear sum of basis spectra scaled by time-varying amplitudes, and amounts to decomposing the observed spectrogram into the sum of rank-1 spectrograms. As described in Sect. 2.3, an important feature of NMF is that its non-negativity constraint usually induces sparse representations, i.e., \({\varvec{U}}\) with a relatively large number of zero entries. This means that each observed spectrum is parsimoniously represented using only a few active basis spectra. In such situations, the sequence of observed spectra can be approximated reasonably well when each basis spectrum expresses the spectrum of an underlying audio event that occurs frequently over the entire observed range. Thus, with music signals, each basis spectrum usually becomes the spectrum of a frequently used pitch in the music piece under analysis.

This approach is based on two assumptions; one is that magnitude (or power) spectra are additive and the other is that the magnitude spectrum of each sound source is constant up to the scale over time (i.e., only the scale of the spectrum is time-variant). However, these assumptions do not hold in reality. This section introduces some of the attempts that have been made to develop variants of NMF that aim to relax these assumptions.

2.7.2 Complex NMF

Audio signals in the time domain (sound waves) are additive. Since typical methods for time-frequency decomposition, such as the short-time Fourier transform (STFT) and the wavelet transform, are linear, complex spectrograms of audio signals are also additive. However, since the transformation of complex spectrograms into magnitude (or power) spectrograms is nonlinear, magnitude spectrograms are non-additive. Namely, the magnitude spectrum of the sum of two waveforms is not equal to the sum of the magnitude spectra of the two waveforms. This implies that decomposing a magnitude spectrogram into the sum of additive components does not necessarily lead to an appropriate decomposition of the audio signal.

To address this shortcoming of the NMF approach, I previously proposed a framework called the “Complex NMF” [19], which makes it possible to realize NMF-like signal decompositions in the complex spectrogram domain. The key idea behind the NMF approach was to model the magnitude spectrogram of a mixture signal as the sum of rank-1 magnitude spectrograms. By contrast, the key idea behind the proposed approach is to model the complex spectrogram of a mixture signal as the sum of complex spectrograms each having a rank-1 structure in the magnitude domain. This idea can be formulated as follows.

Let \({a}_{m,k,n}\in \mathbb {C}\) denote the complex spectrogram of source m. The complex spectrogram of a mixture signal consisting of M sources is given as

$$\begin{aligned} f_{k,n}= \sum _{m=1}^{M} {a}_{m,k,n} =\sum _{m}|{a}_{m,k,n}| e^{j\phi _{m,k,n}}, \end{aligned}$$
(2.48)

where \(\phi _{m,k,n}\) denotes the phase spectrogram of source m. Here, if we assume that the magnitude spectrogram of each source has a rank-1 structure, we can write \(|{a}_{m,k,n}|=h_{k,m}u_{m,n}\). This leads to a complex spectrogram model of the form:

$$\begin{aligned} f_{k,n}&=\sum _m h_{k,m} u_{m,n} e^{j\phi _{m,k,n}}. \end{aligned}$$
(2.49)

It is important to emphasize that \(\phi _{m,k,n}\) is indexed by n, meaning that this model allows the phase spectrum of each source to vary freely over time. The aim of Complex NMF is to fit this model to an observed complex spectrogram through the estimation of H, U and \(\phi \). It should be noted that unlike NMF, this model allows the components to cancel each other out (since the real and imaginary parts of the complex spectrum of each source can take either positive or negative values), and so when there are no constraints, it does not naturally produce sparse representations. Thus, to obtain sparse representations similar to NMF, some constraint is needed to induce the sparsity of \({\varvec{U}}\). In [19], I formulated an optimization problem of minimizing

$$\begin{aligned} I(H,U,\phi ):= \sum _{k,n} \big |y_{k,n}-f_{k,n}\big |^2 + 2\gamma \sum _{m,n}|u_{m,n}|^{p}, \end{aligned}$$
(2.50)

with respect to H, U and \(\phi \) where the second term is a sparse regularization term, and derived an iterative algorithm based on the auxiliary function approach. Here, \(0<p<2\) and \(\gamma \ge 0\) are constants. The main difficulty with this optimization problem lies in the nonlinear interdependence of \(\phi _{1,k,n},\ldots ,\phi _{M,k,n}\) and the discontinuity of the gradients with respect to \(u_{m,n}\). The nonlinear interdependence of \(\phi _{1,k,n},\ldots ,\phi _{M,k,n}\) arises from the “square-of-sum” form in the first term of Eq. (2.50). To derive closed-form update equations using the auxiliary function approach in a similar way to Sect. 2.4.4, it is desirable to design an upper bound function that has a “sum-of-squares” form for this term. However, unlike Sect. 2.4.4, Theorem 2.2 cannot be applied in this case, since \(h_{k,m} u_{m,n} e^{j\phi _{m,k,n}}\) is a complex number. Instead, in [19] I proposed invoking the following inequality:

Theorem 2.3

(Jensen’s inequality for convex functions with complex arguments) For an arbitrary convex function g with complex arguments y and \(z_1,\ldots ,z_I\), we have

$$\begin{aligned} g\Big (y-\sum _i z_i\Big ) \le \sum _i \beta _i g\left( \frac{\alpha _i - z_i}{\beta _i}\right) , \end{aligned}$$
(2.51)

where \(\alpha _1,\ldots ,\alpha _1\) are complex variables satisfying \(\sum _i \alpha =y\) and \(\beta _1,\ldots ,\beta _1\) are positive weights satisfying \(\sum _i \beta = 1\). Equality in this inequality holds when

$$\begin{aligned} \alpha _i = z_i + \beta _i\Big ( y - \sum _i z_i\Big ). \end{aligned}$$
(2.52)

Proof

Since \(\sum _i \alpha = y\), we can write \(g(y- \sum _i z_i) = g(\sum _i (\alpha _i - z_i))\). By using arbitrary positive weights \(\beta _1,\ldots ,\beta _1\) that sum to one, we obtain

$$\begin{aligned} g\Big (\sum _i (\alpha _i - z_i)\Big )&= g\left( \sum _i \beta _i \frac{\alpha _i - z_i}{\beta _i}\right) \nonumber \\&\le \sum _i \beta _i g\left( \frac{\alpha _i - z_i}{\beta _i}\right) , \end{aligned}$$
(2.53)

where the second line follows from Jensen’s inequality. Note that equality in this inequality holds when

$$\begin{aligned} \frac{\alpha _1-z_1}{\beta _1} =\cdots = \frac{\alpha _I-z_I}{\beta _I}. \end{aligned}$$
(2.54)

Letting Z denote the value of Eq. (2.54), \(\alpha _i\) is given as \(\alpha _i = z_i + \beta _i Z\). Since \(\alpha _i\) must sum to y, i.e., \(\sum _i\alpha _i = \sum _i z_i + Z = y\), Z is given by \(Z = y- \sum _i z_i\). By substituting this into \(\alpha _i = z_i + \beta _i Z\), we finally obtain Eq. (2.52).

As for the second term of Eq. (2.50), which is non-differentiable with respect to \(u_{m,n}\), we can use the fact that, when \(0<p\le 2\),

$$\begin{aligned} |u_{m,n}|^p \le \frac{p|v_{m,n}|^{p-2}}{2}u_{m,n}^2 + \frac{2-p}{2}|v_{m,n}|^p, \end{aligned}$$
(2.55)

to construct an upper bound function. Altogether, we obtain an auxiliary function

$$\begin{aligned} I^{+}(H,U,\phi ,\alpha ,V) :=&\sum _{k,n,m} \frac{1}{\beta _{m,k,n}} \Big | \alpha _{m,k,n}-h_{k,m}u_{m,n} e^{\mathrm{j}\phi _{m,k,n}} \Big |^2 \nonumber \\&\,\,+ \gamma \sum _{m,n} \Big \{ {p|v_{m,n}|^{p-2}}{u_{m,n}}^2 +(2-p)|v_{m,n}|^p \Big \}, \end{aligned}$$
(2.56)

which has a “sum-of-squares” form. Here, \(\beta _{m,k,n}\) is a positive weight that can be set arbitrarily subject to \(\sum _m \beta _{m,k,n}=1\). \(\alpha _{m,k,n}\) and \(v_{m,n}\) are auxiliary variables satisfying \(\sum _m \alpha _{m,k,n} = y_{k,n}\). By using this, we can develop a convergence-guaranteed iterative algorithm with closed-form update equations.

2.7.3 Itakura-Saito NMF

Although the additivity of power spectra does not generally hold as mentioned above, it holds in the expectation sense if the signals are assumed to be samples drawn independently from stochastic processes.

If each underlying source signal in a mixture signal within a short-term segment is assumed to have been generated from a zero-mean circularly-stationary Gaussian process, each frequency component of the discrete Fourier transform of that segment independently follows a zero-mean complex normal distribution. Namely, if we let \(s_{m,k,n}\) be a component of frequency k of source signal m within segment n (i.e., the complex spectrogram of source m), \(s_{m,k,n}\) follows a zero-mean complex normal distribution

$$\begin{aligned} s_{m,k,n}\sim \mathcal {N}_{\mathbb {C}}\big (s_{m,k,n};0,\nu _{m,k,n}\big ), \end{aligned}$$
(2.57)

with variance \(\nu _{m,k,n}\), where \(\mathcal {N}_{\mathbb {C}}(z;\mu ,\nu )=\frac{1}{\pi \nu } e^{-|z-\mu |^2/\nu }\). Note that \(\nu _{m,k,n}\) corresponds to the expectation of the power spectrogram of source m, i.e., \(\nu _{m,k,n} = \mathbb {E}[|s_{m,k,n}|^2]\). Now, if we assume that the complex spectrogram \(y_{k,n}\) of an observed signal is given as \(y_{k,n} = \sum _m s_{m,k,n}\), and that \(s_{m,k,n}\) and \(s_{m',k,n}\) \((m\ne m')\) are statistically independent, \(y_{k,n}\) also follows a zero-mean complex normal distribution

$$\begin{aligned} y_{k,n}\sim \mathcal {N}_{\mathbb {C}} \Big (y_{k,n}; 0, \sum _{m}\nu _{m,k,n}\Big ), \end{aligned}$$
(2.58)

with variance \(\sum _{m}\nu _{m,k,n}\). By putting \(x_{k,n} = \sum _m \nu _{m,k,n}\), the log-likelihood of \(x_{k,n}\) given an observation \(y_{k,n}\) can be written as

$$\begin{aligned} L(x_{k,n}) = {-\log }\pi x_{k,n} - \frac{|y_{k,n}|^2}{x_{k,n}}. \end{aligned}$$
(2.59)

Since this log-likelihood reaches maximum only when \(x_{k,n} = |y_{k,n}|^2\), we have \(L(|y_{k,n}|^2)\ge L(x_{k,n})\). We notice that the log-likelihood difference\(L(|y_{k,n}|^2)-L(x_{k,n})\ge 0\) is actually equal to the IS divergence between \(|y_{k,n}|^2\) and \(x_{k,n}\), i.e., \(\mathcal {D}_\mathrm{IS}(|y_{k,n}|^2|x_{k,n})\). Thus, if we assume the expectation of the power spectrogram of each source to have a rank-1 structure, i.e., \(\nu _{m,k,n} = h_{k,m}u_{m,n}\), the maximum likelihood estimation of \({\varvec{H}} = (h_{k,m})_{K\times M}\) and \({\varvec{U}} =(u_{m,n})_{M\times N}\) is equivalent to the problem of NMF with the IS divergence criterion applied to the observed power spectrogram \(|y_{k,n}|^2\) [4, 5].

2.7.4 NMF with Time-Varying Bases

When applying NMF to music spectrograms, we may expect the magnitude spectra of a single musical note produced by an instrument to be represented using a single basis spectrum scaled by time-varying amplitudes. However, its variations in time are actually much richer. For example, a piano note would be more accurately characterized by a succession of several basis spectra corresponding to, for example, “attack,” “decay,” “sustain” and “release” segments. As another example, singing voices and string instruments have a particular musical effect, vibrato, which can be characterized by its “depth” (the range of pitch variation), and its “speed” (the rate at which the pitch varies). Several variants of NMF have been proposed to represent time-varying spectra by introducing the concept of time-varying bases [2022].

One approach involves extending NMF to a convolutional version, which finds a decomposition of \({\varvec{Y}}\) as

$$\begin{aligned} y_{k,n}\simeq x_{k,n} = \sum _m \sum _l h_{k,m,l}u_{m,n-l}, \end{aligned}$$
(2.60)

where \(h_{k,m,l}\) can be interpreted as the local time-frequency pattern of the mth audio event and \(u_{m,n}\) represents its temporal activation. Since the problem at hand is to decompose the convolutive mixture, this approach is called “non-negative matrix factor deconvolution (NMFD)” [20].

Fig. 2.7
figure 7

Illustration of the factorial HMM approach [22]. The spectrogram of a mixture signal is modeled as the sum of the outputs emitted from multiple HMMs, each representing the spectrogram of an underlying audio event

NMFD assumes that the spectrum of each audio event evolves in time in exactly the same way every time it occurs. However, the speeds of the temporal variations are unlikely to be the same all the time. To cope with the varying speeds of the temporal variations of spectra, we proposed modeling the magnitude spectrogram of a mixture signal based on a factorial hidden Markov model (FHMM) formulation [22]. The idea is to model the spectrogram of a mixture signal as the sum of the outputs emitted from multiple HMMs, each representing the spectrogram of an underlying audio event (see Fig. 2.7). Thus, the problem is to find a decomposition of \({\varvec{Y}}\) as

$$\begin{aligned} y_{k,n} \simeq x_{k,n} = \sum _{m} h_{k,m}^{(z_{m,n})} u_{m,n}, \end{aligned}$$
(2.61)

where \(h_{k,m}^{(i)}\) denotes the basis spectrum at state i and \(z_{m,n}\in \{1,\ldots ,I_m\}\) denotes a state variable indicating which basis spectrum is activated at time n. The path of the state variables \(z_{m,1},\ldots ,z_{m,N}\) is governed by a state transition probability \(p(z_{m,n}=a|z_{m,n-1}=b)= \pi _{m,a,b}\).

2.7.5 Other NMF Variants

A number of constrained and regularized variants of NMF have been proposed specifically with the aim of solving audio source separation problems. Some examples are as follows. Virtanen proposed incorporating a temporal continuity constraint in the factorization process [23]. Raczyński proposed constraining each basis spectrum so that it had a harmonic structure [24]. Several groups (including mine) independently proposed combining the source-filter model with the NMF model [2528]. I proposed incorporating a subprocess that involved clustering timbrally similar basis spectra in the factorization process [29].

2.7.6 Other Applications

NMF has found several interesting audio-related applications including speech enhancement [30], bandwidth extension [31], singing voice separation [32], drum sound extraction [33], formant tracking [35], echo cancellation [36], and the temporal decomposition of speech [37]. I proposed a blind dereverberation method inspired by the NMF algorithm in [27]. Multichannel extensions of NMF have been proposed independently by several groups (including mine) with an expectation that the modeling concept of NMF can also be useful for multichannel audio source separation problems [3943].

2.8 Bayesian Nonparametric NMF

2.8.1 Determination of Basis Number

The determination of the number of bases is an important issue in NMF. Cemgil and Schmidt proposed formulating the problem of the basis number estimation for NMF as a model selection problem [17, 44]. By using \({\varvec{H}}^{(M)}\) and \({\varvec{U}}^{(M)}\) to denote the basis and coefficient matrices with M bases, the marginal likelihood can be thought of as the likelihood function of M, since

$$\begin{aligned} p({\varvec{Y}}|M) = \iint p({\varvec{Y}}|{\varvec{H}}^{(M)},{\varvec{U}}^{(M)})p({\varvec{H}}^{(M)}|M)p({\varvec{U}}^{(M)}|M) \text{ d }{\varvec{H}}^{(M)}\text{ d }{\varvec{U}}^{(M)}. \end{aligned}$$
(2.62)

As the exact marginal likelihood involves intractable integrals, some approximation of the log marginal likelihood is usually used as a criterion for model selection. The Bayesian Information Criterion (BIC) [45] and the variational Bayesian lower bound [46] are examples of such approximations. To determine the number of bases with model selection, we need to perform NMF with all possible M settings and find the best model structure by comparing the values of model selection criteria. Although this approach is indeed principled and well founded, such procedures can be time-consuming. By contrast, a framework called the Bayesian nonparameteric approach allows us to avoid performing an explicit model selection procedure and instead reduce this task to a single run of a parameter inference algorithm. In the following, we briefly show some examples of attempts that have been made to apply the Bayesian nonparameteric approach to NMF.

2.8.2 Beta Process NMF and Gamma Process NMF

A Bayesian nonparametric model is a Bayesian model on an infinite-dimensional parameter space. The Bayesian nonparameteric approach refers to a parameter inference framework for Bayesian nonparametric models, which makes it possible to infer the model complexity along with the model parameters by finding a minimal subset of parameters that can explain given observed data.

Bayesian nonparametric models (also known as infinite models) are typically described using stochastic processes. Up to now, many types of infinite models including infinite mixture models and infinite factor models have been proposed in the literature. For instance, infinite counterparts of mixture models, such as the Gaussian mixture model (GMM), hidden Markov model (HMM), probabilistic context-free grammar (PCFG), and probabilistic Latent Semantic Analysis (pLSA), can be constructed using a stochastic process called the Dirichlet process (DP) or its variants. While a Dirichlet distribution is a probabilistic distribution over a finite set of non-negative numbers that sum to 1, the Dirichlet process can be thought of as an extension of it to an infinite set. Thus, the Dirichlet process is a generative model of a categorical distribution (probabilities of discrete random variables) with an infinite dimension, i.e., \(\pi _1,\pi _2,\ldots \pi _{\infty }\in [0, 1]\) satisfying \(\sum _{i=1}^{\infty }\pi _i = 1\), which can be used, for example, as a prior distribution over the mixture weights of mixture models. An important property of the Dirichlet process is its sparsity-inducing effect. The categorical distributions generated from a Dirichlet process tend to become sparse. Owing to this property, we can find a minimal subset of mixture components with non-zero weights that explains given observed data through parameter inference. This is why the use of an infinite mixture model allows us to infer the adequate model complexity (the number of mixture components) from observed data. As mentioned in Sect. 2.6, pLSA can be understood as a particular case of NMF, where the number of mixture components (i.e., the latent topics) corresponds to the basis number. Thus, in a way similar to the NMF approach, it is technically possible to apply pLSA to a magnitude spectrogram by regarding it as document data, where the frequency and time indices are interpreted as the word and document indices, respectively [47], and an infinite counterpart of this approach can be constructed using a Dirichlet process [48]. On the other hand, infinite counterparts of factor models, such as NMF and Independent Component Analysis (ICA), can be constructed using stochastic processes called the beta process (BP) or gamma process (GP). Simply put, the beta process is a generative model of infinitely many variables within the range [0, 1], \(\pi _1,\pi _2,\ldots ,\pi _{\infty }\in [0, 1]\), and the gamma process is a generative model of infinitely many non-negative variables, \(\theta _1,\theta _2,\ldots ,\theta _{\infty }\in [0,\infty )\). An infinite extension of NMF can be constructed using these stochastic processes. When using the beta process, we introduce a binary variable \(z_{m,n}\in \{0,1\}\) indicating whether the m-th basis exists in data n, with \(z_{m,n}=1\) if data n has a basis m and 0 otherwise. By using \(z_{m,n}\), we define \(x_{k,n}\) as \(x_{k,n} = \sum _{m=1}^{\infty } z_{m,n} h_{k,m}u_{m,n}\) and place a beta process prior \(\pi _{m,n} = p(z_{m,n}=1)\) over \(z_{1,n},\ldots ,z_{\infty ,n}\) [49, 50]. An important feature of the beta process is its sparsity-inducing effect. The variables generated from a beta process tend to become sparse (most of the variables become almost 0). Owing to this property, we can find a minimal subset of bases that explains given observed data through parameter inference. When using the gamma process, we introduce a non-negative variable \(\theta _m \in \mathbb {R}^{\ge 0}\) indicating the contribution made by basis m. By using \(\theta _m\), we define \(x_{k,n}\) as \(x_{k,n} = \sum _{m=1}^{\infty } \theta _m h_{k,m}u_{m,n}\), put some constraint on the scales of \(h_{k,m}\) and \(u_{m,n}\) (e.g., \(\mathbb {E}[h_{k,m}]=1\) and \(\mathbb {E}[u_{m,n}]=1\)), and place a gamma process prior over \(\theta _1,\ldots ,\theta _\infty \) [22, 51]. An important feature of the gamma process is its sparsity-inducing effect as with the beta process. The variables generated from a gamma process tend to become sparse (most of the variables become almost 0). Owing to this property, we can find a minimal subset of bases that explains given observed data through parameter inference.

2.9 Summary

This chapter described some basic properties of NMF, effects induced by the non-negative constraints, how to derive an iterative algorithm for NMF, some attempts that have been made to apply NMF to audio processing problems, and extensions to the Bayesian nonparametric framework. Readers are referred to other review articles such as [5255] for further details.