Abstract
This chapter introduces multichannel nonnegative matrix factorization (NMF) methods for audio source separation. All the methods and some of their extensions are introduced within a more general local Gaussian modeling (LGM) framework. These methods are very attractive since allow combining spatial and spectral cues in a joint and principal way, but also are natural extensions and generalizations of many single-channel NMF-based methods to the multichannel case. The chapter introduces the spectral (NMF-based) and spatial models, as well as the way to combine them within the LGM framework. Model estimation criteria and algorithms are described as well, while going deeper into details of some of them.
Access provided by CONRICYT-eBooks. Download chapter PDF
Similar content being viewed by others
4.1 Introduction
Nonnegative matrix factorisation (NMF) [1] is a dimensionality reduction technique that consists in approximating a nonnegative data matrix (a matrix with nonnegative entries) as a product of two nonnegative matrices of lower rank than the initial data matrix. This also can be viewed as an approximation of data matrix as a sum of few rank-1 nonnegative matrices. It was first successfully applied for single-channel source separation [2], where the nonnegative matrix of magnitude or power spectrogram is decomposed, and became a state of the art reference. The success of this method is mainly due to universality of this quite simple modeling (it is applicable to various types of audio sources including speech [3, 4], music [2, 5], environmental sounds [6], etc.) and due to the flexibility of this modeling allowing adding various constraints to it, such as for example harmonicity of spectral patterns [7], smoothness of their activation coefficients [2, 5], pre-trained spectral patterns [8, 9], etc.
Given the success of the NMF for single-channel source separation, there were several attempts to extend it to the case of multichannel source separation. Earlier ideas were relying on stacking magnitude or power spectrograms of all channels into a 3-valence nonnegative tensor and decomposing it with nonnegative tensor factorisation (NTF) methods [10] or other NTF-like nonnegative structured approximations [11, 12]. This gave some interesting results. However, since only nonnegative power spectrograms are involved, such approaches rely only on the amplitude information, while completely discarding the phases of the short time Fourier transforms (STFTs). In other words, these approaches do not allow exploiting the interchannel phase differences (IPDs), but only the interchannel level differences (ILDs). However, the IPDs may be very important for multichannel source separation, and they are indeed exploited by several clustering-based methods [13, 14]. Using IPDs becomes even more critical for the far-field case (i.e., when the distances between the microphones are much smaller than the distances between the sources and microphones), where the information carried by the ILDs becomes almost non-discriminating.
It is clear that a fully nonnegative (e.g., NTF-like) modeling is unable to model jointly source power spectrograms, ILDs and IPDs, since the phase information is discarded in the nonnegative tensor of multichannel mixture power spectrograms. As such, it was proposed to resort to a semi-nonnegative modeling [8, 12, 15,16,17], where the latent source power spectrograms are modeled with NMF [8, 12] or NTF [15,16,17], while the mixing system is modeled differently, not with a nonnegative model. This modeling, often referred to as multichannel NMF [12] or multichannel NTF [15]Footnote 1 depending on the model of the source power spectrograms, is usually achieved via a Gaussian probabilistic modeling applied directly to the complex-valued STFTs of all channels.
The multichannel NMF modeling treats the complex-valued STFT coefficients as realizations of zero-mean circular complex-valued Gaussian random variables with structured variances (via NMF) and covariances. This leads to the fact that this modeling reduces to Itakura Saito (IS) NMF in the single channel case (see Chap. 1), thus being its natural extension to the multichannel case. Moreover, it allows integrating many other NMF-like models (see Chap. 1 and [8]) in an easy and flexible manner. Finally, it combines both spectral and spatial (including ILDs and IPDs) cues within a unified framework. When one of these two cues does not allow separating the sources efficiently, the algorithm relies on the other cue, and vice versa. In our opinion the multichannel NMF is one of the first attempts of combining these two cues in a systematic and principal way.
4.2 Local Gaussian Model
Multichannel NMF can be formulated as based on a so-called local Gaussian model (LGM) that is more general itself (than the multichannel NMF) and allows modeling and combining spatial and spectral cues in a systematic way. In a most general manner the LGM may be formulated as follows. Let us first assume that we deal with a multichannel (I-channel) mixture of J sources to be separated. Assuming all the signals are converted into the STFT domain, this can be written as
where \(\mathbf{x}_{fn} = \left[ x_{1,fn}, \ldots , x_{I,fn}\right] ^T \in \mathbb C^I\) and \(\mathbf{y}_{jfn} = \left[ y_{1,jfn}, \ldots , x_{I,jfn}\right] ^T \in \mathbb C^I\) (\(j = 1, \ldots , J\)) are the channel-wise vectors of STFT coefficients of the mixture and of the j-th source spatial image,Footnote 2 respectively; and \(f = 1, \ldots , F\) and \(n = 1, \ldots , N\) are the frequency and time indices, respectively. Given the above-introduced notations, the LGM modeling [18] assumes that each source image (I-length complex-valued vector \(\mathbf{y}_{jfn}\)) is modeled as a zero-mean circular complex Gaussian random vector as follows
where the complex-valued covariance matrix is positive definite Hermitian, and it is composed of two factors:
-
a spatial covariance \(\mathbf{R}_{jfn} \in \mathbb C^{I \times I}\) representing the spatial characteristics of the j-th source image at the time-frequency (TF) point (f, n), and
-
a spectral variance \(\mathbf{v}_{jfn} \in \mathbb R\) representing the spectral characteristics of the j-th source image at the TF point (f, n).
Given the model parameters, i.e., the spatial covariances \(\mathbf{R}_{jfn}\) and the spectral variances \(\mathbf{v}_{jfn}\), the random vectors \(\mathbf{y}_{jfn}\) in (4.2) are also assumed mutually independent in time, frequency and between sources. Note that the LGM modeling was not proposed in [18] for the first time, indeed, its variants were already considered in [19, 20]. However, the formulation from [18] is quite general to cover all the cases, that is why we have chosen here this formulation.
Given the multichannel mixing equation and the above independence assumptions, the mixture STFT coefficients may be shown distributed as
The model parameters are usually estimated in the maximum likelihood (ML) sense from the observed mixture \(\mathbf{X} = \left\{ x_{ifn} \right\} _{i,f,n}\). However, a direct ML estimation of parameters under the modeling (4.3) would lead to the data overfitting, since the number of scalar parameters exceeds the number of the mixture STFT coefficients. As such, various constraints are applied to both spectral variances and spatial covariances, as it is presented in detail in Sects. 4.3 and 4.4 respectively. In the case of multichannel NMF we address in this chapter, the spectral variances are usually represented by low-rank nonnegative matrices or tensors. However, other approaches consider different models (e.g., such as composite autoregressive models [21], source-excitation models [8] or hidden Markov models [22]) to structure the spectral variances, that is why the LGM modeling is more general than the multichannel NMF. As it is discussed in Sect. 4.4 below, spectral covariances are usually not modeled with fully nonnegative structures. This is the reason why we are speaking about semi-nonnegative modeling in the introduction.
For the sake of better understanding, we now give an interpretation to the spatial covariance matrix \(\mathbf{R}_{jfn}\), and relate it to the methods used for multichannel audio compression. For the sake of simplicity and also since most of audio recording are stereo (i.e., two channel mixtures), we consider the case of \(I = 2\). The spatial covariance matrix \(\mathbf{R}_{jfn}\) is in general a full-rank positive definite Hermitian complex-valued matrix. An example of a spatial covariance matrix is represented on Fig. 4.1. Note that this is a rather “fake” (or incomplete) representation, since it is difficult to represent a 2-dimensional complex-valued covariance matrix on a 2-dimentional real plane.
Since the spatial covariance matrix \(\mathbf{R}_{jfn}\) is complex-valued Hermitian, it can be easily shown that in the 2-dimensional case we consider here it is uniquely encoded by only four real scalars. Indeed, its 2 diagonal entries are real and the 2 complex-valued off-diagonal entries are conjugate. These four real-valued parameters may be uniquely converted into the following, in a sense more meaningful, real-valued parameters:
-
Loudness,Footnote 3
-
ILD,
-
IPD,
-
Diffuseness that can be also replaced by interchannel coherence (IC) [23].
It is worth to note that the last three spatial parameters (ILD, IPD and IC) are also used for parametric coding of stereo audio [23]. This is somehow expected, indeed, the models that are suitable for compression should be also suitable for sources separation, since in both cases the models tend to reduce the redundancy in the signal.
Finally, let us also stress that the LGM modeling seems more general (and thanks to Gaussian formulation more principal) than blind source separation (BSS) approaches based on ILD/IPD clustering [13, 24]. Indeed, the diffuseness or IC is not taken at all into account within the latter approaches.
4.3 Spectral Models
In this section we present and discuss spectral models used within various multichannel NMF approaches. These models include NMF models, NTF models and their extensions.
4.3.1 NMF Modeling of Each Source
NMF modeling of each source, which is usually referred to as mutichannel NMF, consists in structuring the source variances \(v_{jfn}\) in (4.2) with NMF structure as in the single-channel NMF case (see Chap. 1):
where the source-dependent \(K_j\) is usually smaller than both F and N, and \(w_{jfk}\) and \(h_{jkn}\) are all nonnegative. By introducing nonnegative matrices (i.e., matrices with nonnegative entries) \(\mathbf{V}_j = [v_{jfn}]_{f,n} \in \mathbb R_+^{F \times N}\), \(\mathbf{W}_j = [w_{jfk}]_{f,k} \in \mathbb R_+^{F \times K_j}\), and \(\mathbf{H}_j = [h_{jkn}]_{k,n} \in \mathbb R_+^{K_j \times N}\), (4.4) may be rewritten in a matrix form as:
A visualization of these NMF spectral models is shown on Fig. 4.2.
This kind of spectral models in the case of mutichannel source separation were first introduced in [25, 26], though with more sophisticated NMF-like structures suitable for harmonic music instruments and with different optimization criteria than those we discuss in this chapter. Spectral models based on usual NMF, exactly as in (4.5), were proposed in [12], and then extended/re-considered in many other works [8, 15,16,17, 27].
A very attractive property of this modeling is that any NMF or NMF-like structure based on the IS divergence, such as for example harmonic NMF [7], smooth NMF [2, 5] or excitation-filter NMF [28] (see also Chap. 1) may be incorporated easily and in a systematic manner within the framework. This was remarked and addressed in [8], where a general source separation framework allowing specifying various spectral and spatial models for each individual source is proposed. The latter research work is supplied with a software called Flexible Audio Source Separation Toolbox (FASST) that implements all these possible model variants in a flexible way. Finally, let us note that many informed or user-assisted/guided audio source separation approaches were extended to the multichannel case within the same paradigm [15, 29].
4.3.2 Joint NTF Modeling of All Sources
One of the shortcomings of the multichannel NMF modeling presented in Sect. 4.3.1 is the following. While for single-channel NMF one needs fixing an appropriate number of components K or determining this number automatically, which is not always easy (see, e.g., [30]), in the multichannel NMF, as presented in Sect. 4.3.1, one needs determining not only the total number of components \(K = \sum _{j=1}^J K_j\), but also the number of components \(K_j\) for each source, which may vary from one source to another. To overcome this problem the following idea was introduced in [15], and then extended in other works [16, 17]. It is now assumed that instead of representing each source with an individual NMF \(\{ \mathbf{W}_j, \mathbf{H}_j \}\) all the sources share the components of the same NMF \(\{ \mathbf{W}, \mathbf{H} \}\), where \(\mathbf{W} = [w_{fk}]_{f,k} \in \mathbb R_+^{F \times K}\), and \(\mathbf{H} = [h_{kn}]_{k,n} \in \mathbb R_+^{K \times N}\). Moreover, in order to specify associations between K NMF components and J sources, a new \((J \times K)\) nonnegative matrix \(\mathbf{Q} = [q_{jk}]_{j,k} \in \mathbb R_+^{J \times K}\) is introduced, and the source variances \(v_{jfn}\) are now structured as:
Assuming the columns of \(\mathbf{Q}\) are normalized to sum to one (i.e., \(\sum _{j=1}^J q_{jk} = 1\)), which is always possible to achieve thanks to scale ambiguity between the columns of \(\mathbf{Q}\) and that of say \(\mathbf{W}\) in (4.6), each \(q_{jk}\) represents the proportion of association of the component k to the source j.
By denoting with \(\mathbf{V} = \{v_{jfn}\}_{j,f,n}\) a 3-valence tensor of source variances, (4.6) may be also rewritten in a tensor/vector form as a sum of K rank-1 tensors:
where “\(\circ \)” denotes the tensor outer product, \(\mathbf{w}_k\) and \(\mathbf{q}_k\) are the k-th columns of matrices \(\mathbf{W}\) and \(\mathbf{Q}\) respectively, and \(\mathbf{h}_k\) is the k-th line of matrix \(\mathbf{H}\). The tensor decomposition as in (4.6) and (4.7) is called parallel factor (PARAFAC) or canonical decomposition (CANDECOMP) [31]. A visualization of these NTF spectral models is shown on Fig. 4.3.
We here call this model multichannel NTF, as introduced in [15], though some authors [16, 17] continue calling it multichannel NMF. Note also that a fully nonnegative NTF modeling [10,11,12] was applied for multichannel audio source separation as well. Those approaches apply an NTF decomposition directly to the nonnegative tensor of power spectrograms of the multichannel mixture, while here it is applied to the latent nonnegative tensor of power spectrograms of the sources, and the overall modeling is not fully nonnegative, as mentioned in the introduction.
One can easily note that the NTF decomposition (4.6) generalizes that of (4.4). Indeed, (4.6) can be reduced to (4.4) by setting for each column of \(\mathbf{Q}\) all the values to 0 except one that is set to 1, and by fixing the values of \(\mathbf{Q}\). Finally, the multichannel NTF modeling has the following potential advantages over the multichannel NMF modeling:
-
One does not need specifying in advance the number of components \(K_j\) for each source, but only the total number of components K. The components are then allocated automatically via the matrix \(\mathbf{Q}\), which may be also more optimal than a manual user-specified allocation.
-
Some components may be shared between different sources, which means that the modeling is more compact. This happens when there are more than one non-zero entry in one column of matrix \(\mathbf{Q}\).
It should be noted however that it is desirable that the matrix \(\mathbf{Q}\) is quite sparse, i.e., that there are few components for which there are more than one non-zero entry in the corresponding column of matrix \(\mathbf{Q}\). Otherwise, the components are not well allocated between sources, and this may not lead to a good separation result. Thus, it is possibly desirable to add some sparsity-inducing penalty on \(\mathbf{Q}\) to the corresponding optimization criterion.
4.4 Spatial Models and Constraints
Spatial covariance \(\mathbf{R}_{jfn}\) might be assumed fully unconstrained, though in that case, as already mentioned in Sect. 4.2, the parameter estimation would certainly lead to data overfitting, since there are more parameters than observations, i.e., the STFT coefficients in the multichannel mixture. In order to cope with that it is necessary to introduce some constraints on spatial covariances.
First of all, when the sources are static, it is reasonable to assume that the spatial covariances are time-invariant, i.e., \(\mathbf{R}_{jfn} = \mathbf{R}_{jf}\) are independent of n. This assumption is made in many approaches [8, 12, 16,17,18] and it allows highly reducing the number of free parameters to be estimated. We assume the time-invariant case within this section and the time-varying case will be briefly discussed at the end.
On top of the time-invariance, additional constraints may be introduced as well, and most often it is achieved either by imposing some particular structure or via probabilistic priors.
The early works [12, 19, 20] constraint the spatial covariance \(\mathbf{R}_{jf}\) further and assume that the rank of the matrix is one, which is refereed to as rank-1 spatial covariance. This was introduced based on the following reasoning. Let us assume that the mixture (4.1) is a convolutive mixture of J point sources. In that case the spatial images \(\mathbf{y}_{jfn}\) in (4.1) may be approximated as [32]
where \(s_{jfn} \in \mathbb C\) are the STFT coefficients of the point sources and \(\mathbf{a}_{jf} = \left[ a_{Ijf}, \ldots ,\right. \left. a_{Ijf} \right] ^T \in {\mathbb C}^I\) are the channel-wise vectors of discrete Fourier transforms (DFTs) of the impulse responses of the convolutive mixing filters. The equality in (4.8) holds indeed only approximately and becomes more and more accurate when the sizes of the mixing filters impulse responses are comparable or smaller than the length of the STFT analysis window [32]. This approximation is referred to as narrowband approximation. Assuming now that each source STFT coefficient \(s_{jfn}\) follows a zero-mean Gaussian distribution with variance \(v_{jfn}\), one can easily show that source images \(\mathbf{y}_{jfn}\) are distributed as in (4.2) with
We see that the spatial covariance \(\mathbf{R}_{jf}\) in (4.9) is indeed a rank-1 matrix.
It was proposed in [18] not to constraint the spatial covariance \(\mathbf{R}_{jf}\) or to parametrize it in a different way (see [18] for details), but in both cases so as the matrix remains full rank. This modeling, refereed to as full rank spatial covariance, allows to go beyond the limits of the narrowband approximation (4.8), thus it is more suitable than the rank-1 model in case of long reverberation times. It may be also more suitable in case when the point sources assumption is not fully verified. Indeed, as explained in Sect. 4.7.2 below, modeling a source image with a full rank model can be recast as a sum of I point sources with different rank-1 spatial covariances and shared spectral variance.
Another approach [17] consists in assuming that the spatial covariance is a weighted sum of so-called direction of arrival (DOA) kernels that are rank-1 spatial covariances modeling plane waves coming from several predefined directions. These directions may be specified in 2D plane or in 3D space (see Fig. 4.4 for a 2D example). Rank-1 DOA kernels corresponding to these directions \(\theta _l\) (\(l = 1, \ldots , L\)) are then defined as
with \(\mathbf{d}(f,\theta _l)\) being a relative steering vector for the direction \(\theta _l\) defined as
where c is the speed of the sound (343 m/s), \(\nu _f\) is the frequency (in Hz) corresponding to the frequency bin f, and \(\tau _{i,i'}(\theta _l)\) is the time difference of arrival (TDOA) (in seconds) between microphones i and \(i'\) from the direction \(\theta _l\). Note that this relative steering vector is defined without taking into account the ILDs, but only IPDs (see [33] for a definition taking as well into account ILDs). Finally, the spatial covariance is defined as a weighted sum of DOA kernels \(\mathbf{K}_{fl}\) from (4.10) as
with \(z_{jl}\) being nonnegative weights.
If the DOAs of all or of some sources are known to some extend, it is possible to introduce this information for example via prior distributions on the spatial covariances. In [34] those priors are defined via inverse Wishart distributions as follows
with
where \(\mathbf{d}(f,\theta _l)\) is a steering vector which may be defined as in (4.11), \(\varvec{\varOmega }_f = \left[ \sin (2 \pi \nu _f q_{ii'} / c) / (2 \pi \nu _f q_{ii'} / c) \right] _{ii'}\) is a matrix modeling reverberation part (i.e., non-direct part) of the impulse response, and \(\sigma ^2_\mathrm{rev}\) is a positive constant depending on the amount of reverberation as compared to the direct part of impulse response.
There are also other models that do not fall into the LGM framework as formulated here. These models include for example multichannel high-resolution NMF (HR-NMF) [35] or a method where the source variance prior parametrization is factorized by NMF [36].
Finally, several approaches [37,38,39] address time-varying case, where \(\mathbf{R}_{jfn}\) is not independent any more on n, though still constrained in different ways.
4.5 Main Steps and Sources Estimation
Let us denote by \(\varvec{\theta } = \{ \mathbf{R}_{jfn}, v_{jfn} \}_{j,f,n}\) the whole set of model parameters, assuming some constraints from those overviewed in Sects. 4.3 and 4.4 hold. Given a model \(\varvec{\theta }\) specified and an estimation criterion (see Sect. 4.6 below) chosen, most of LGM-based approaches are based on the following main steps:
-
1.
The STFT \(\mathbf{X}\) of the multichannel mixture signal is computed.
-
2.
The model is estimated with an algorithm (see Sect. 4.7 below) optimizing the chosen criterion.
-
3.
The source images are estimated in the STFT domain via Wiener filtering as:
$$\begin{aligned} \hat{\mathbf{y}}_{jfn} = \mathbf{R}_{jfn} v_{jfn} \left[ \sum _{j=1}^J \mathbf{R}_{jfn} v_{jfn} \right] ^{-1} \mathbf{x}_{fn}, \end{aligned}$$(4.15)where \(\mathbf{R}_{jfn}\) and \(v_{jfn}\) are the spatial covariances and spectral variances as specified in (4.2).
-
4.
The source images in time domain are then reconstructed by applying the inverse STFT to \(\widehat{\mathbf{Y}} = \{ \hat{\mathbf{y}}_{jfn} \}_{j,f,n}\).
In the online approaches [40, 41], where the separation must be performed for every new frame, the same steps are repeated for each frame and the model estimation algorithm is modified so as to update the model parameters in an incremental and causal (i.e., only the passed and current frames are used) manner.
4.6 Model Estimation Criteria
In order to estimate the model parameters \(\varvec{\theta }\) from the observed data, i.e., from the STFT of the multichannel mixture signal \(\mathbf{X}\), one needs specifying a model estimation criterion.
4.6.1 Maximum Likelihood
One of the most popular choices for model estimation is the maximum likelihood (ML) criterion that writes
In the case of LGM modeling (4.2) this criterion can be shown [16] equivalent to minimizing the following cost function:
where
Note that the cost (4.17) is not well defined (i.e., its value is infinite) when \(I > 1\) and matrices \(\widehat{\varvec{\Sigma }}_{\mathbf{x},fn}\) are not full rank, which is the case in definition (4.18). However, this is not a problem per se. Indeed, the infinite term \(- \log \mathrm{det} \left( \widehat{\varvec{\Sigma }}_{\mathbf{x},fn} \right) \) is independent on \(\varvec{\theta }\) and can be simply removed from the cost (4.17), since it has no influence on the optimization over \(\varvec{\theta }\). Otherwise, a small regularization term may be added to \(\widehat{\varvec{\Sigma }}_{\mathbf{x},fn}\), which would make it full rank. Also, there exist alternative definitions of \(\widehat{\varvec{\Sigma }}_{\mathbf{x},fn}\) [8, 42], where it might be full rank by construction.
Formulation with the cost (4.17) is interesting, since, as one can note, it is a generalization of the IS-NMF cost in the single channel case (see Chap. 1). Indeed, \(C_\mathrm{IS}(\varvec{\theta })\) becomes the single channel IS divergence when \(I = 1\).
4.6.2 Maximum a Posteriori
When a prior distribution \(p (\varvec{\theta })\) on model parameters is specified, like for example the spatial covariance prior in (4.13), the maximum a posteriori (MAP) criterion is usually used instead of the ML criterion. It writes
Note that in case of prior in (4.13) we have \(p (\varvec{\theta }) = \prod _{f=1}^F p \left( \mathbf{R}_{jf} | \varvec{\varPsi }_{jf}, m \right) ^N\), since the prior is applied to each time-frequency bin.
If one tries rewriting (4.19) in a form similar to (4.17), it would result in simply adding \(- \log p (\varvec{\theta }')\) term to (4.17).
4.6.3 Other Criteria
Several other criteria were proposed as well. For example, we have seen that the ML criterion formulated as in (4.17) generalizes the single channel IS NMF to the multichannel case, as such it was proposed in [16] to generalize the single-channel NMF with Euclidean distance (EUC NMF) to the multichannel case. This is achieved by replacing the cost function (4.17) with the following one
where \(\left\| \mathbf{A} \right\| _F\) denotes the Frobenius norm of a matrix \(\mathbf{A}\), and the data covariance matrix \(\widehat{\varvec{\Sigma }}_{\mathbf{x},fn}\) is defined slightly differently than in (4.18). Notably, it is defined as [16, 17]
where all the operation, i.e., the absolute value \(\left| \cdot \right| \), the square root \(\sqrt{\cdot }\), the multiplication \(\times \) and the sign (\(\mathrm{sign} \left( a \right) = a / |a|\)), are applied element-wise to the corresponding matrices.
There is also the variational Bayes (VB) criterion [43], which consists in computing directly the posterior distribution of the source STFT coefficients while marginalizing over all possible model parameters.
4.7 Model Estimation Algorithms
There exist several model parameter estimation algorithms [8, 16]. Though, due to the probabilistic formulation of the LGM model (4.2), the expectation-maximization (EM) algorithm [44] is one of the most popular choices. As we will see below, the use of the EM algorithm results not in just one algorithm, but it leads to a family of algorithms. Indeed, each particular implementation of the EM algorithm depends on several choices, as will be explained below. Because of the EM popularity we will mostly concentrate here on the different variants of EM and will only mention briefly other algorithms.
To present the variants of EM algorithm we consider the LGM model (4.2) with time-invariant unconstrained full rank spatial covariances \(\mathbf{R}_{jf}\) and spatial variances \(v_{jfn}\) structured with NTF model (4.6). This is in fact a variant of multichannel NTF similar to the one described in [15], but with full rank covariances instead of rank-1 covariances as in [15]. Since no probabilistic priors on parameters are assumed, the variants of EM algorithm presented below are for the optimization of the ML criterion (4.16).
4.7.1 Variants of EM Algorithm
In one of its general formulations the EM algorithm [44] to optimize the ML criterion (4.16) consists first in specifying
-
so-called observed data \(\mathbf{X}\) that are usually the multichannel mixture STFT coefficients in the case of multichannel source separation, as considered here, and
-
so-called latent data \(\mathbf{Z}\). The choice of latent data may be quite different and different choices would lead to different EM variants.
Assuming that a probabilistic model parametrized by \(\varvec{\theta }\) is specified, the EM algorithm is usually applied in the following case. It is applied when it is difficult to optimize in a closed form the ML criterion (4.16) maximizing \(\log p (\mathbf{X} | \varvec{\theta })\), while it is easy to maximize in a closed form or via some simplified iterative procedure the log-likelihood \(\log p (\mathbf{X}, \mathbf{Z} | \varvec{\theta })\) of so-called complete data \(\{ \mathbf{X}, \mathbf{Z} \}\). The choice of latent data \(\mathbf{Z}\) is usually done accordingly.
The EM algorithm consists then in iterating the following two steps:
-
E-step: Compute an auxiliary function as follows:
$$\begin{aligned} Q(\varvec{\theta }, \varvec{\theta }^{(\ell )}) = \mathbb E_{\mathbf{X}| \mathbf{Z}, \varvec{\theta }^{(\ell )}} \log p (\mathbf{X}, \mathbf{Z} | \varvec{\theta }). \end{aligned}$$(4.22) -
M-step: Optimize the auxiliary function to update model parameters according to the following criterion:
$$\begin{aligned} \varvec{\theta }^{(\ell +1)} = \mathrm{arg} \max _{\varvec{\theta }} Q(\varvec{\theta }, \varvec{\theta }^{(\ell )}), \end{aligned}$$(4.23)where \(\varvec{\theta }^{(\ell )}\) denotes the model parameters estimated at the \(\ell \)-th iteration.
It is often possible to optimize the criterion (4.23) in a closed form. However, sometimes, depending on the choice of latent data \(\mathbf{Z}\), it is not possible. In that case either another iterative optimization algorithm may be applied or any algorithm can be used provided that it assures at each iteration of EM the following non-decreasing of the auxiliary function:
In the latter case the algorithm is called generalized EM (GEM) [44], and the ways the optimization (4.24) is performed lead again to different variants of the algorithm.
To summarize let us list various choices that lead to different EM algorithm variants and thus different model parameters estimation results. These choices include:
-
1.
Choice of latent data \(\mathbf{Z}\), for example:
-
Latent data consist of NMF/NTF components [12] defined as
$$\begin{aligned} c_{kjfn} \sim \mathscr {N}_c(0, w_{jfk} h_{jkn}), \qquad k = 1, \ldots , K_j \end{aligned}$$(4.25)in case of NMF spectral model (4.4), or as
$$\begin{aligned} c_{kjfn} \sim \mathscr {N}_c(0, w_{fk} h_{kn} q_{jk}), \qquad k = 1, \ldots , K \end{aligned}$$(4.26)in case of NTF spectral model (4.6).
-
Latent data consist of so-called sub-sources [8] (see Sect. 4.7.2 below).
-
Latent data consist of point sources [15] \(s_{jfn}\) as in the narrowband approximation (4.8).
-
Latent data consist of spatial source images [27] \(\mathbf{y}_{jfn}\) as in (4.2).
-
Latent data consist of binary TF activations of the predominant source (see, e.g., [45] for details).
-
-
2.
Choice of maximization step updates in case of GEM algorithm, for example:
-
3.
Choice of initial parameters \(\varvec{\theta }^{(0)}\), for example:
-
4.
Choice of number of EM algorithm iterations, for example:
-
Fixed number of iterations (the most common choice).
-
Iterating till some stopping criterion depending on the likelihood value is satisfied.
-
A so-called spatial image EM (SIEM) algorithm, where the latent data are the spatial source images, is given in details in the Chap. 7. In the following section we present in details a so-called sub-source EM algorithm based on MU rules (SSEM/MU) [8], where the latent data are the sub-sources and MU rules are used for the NTF spectral model parameters updates within the M-step. Other variants of the EM and GEM algorithms may be found in the corresponding papers.
4.7.2 Detailed Presentation of SSEM/MU Algorithm
Recall that our model consists of time-invariant unconstrained full rank spatial covariances \(\mathbf{R}_{jf}\) and spatial variances \(v_{jfn}\) structured with NTF model (4.6). Thus, it can be parametrized as
with nonnegative matrices \(\mathbf{Q}\), \(\mathbf{W}\) and \(\mathbf{H}\) specified in Sect. 4.3.2.
The SSEM/MU algorithm presented below is a partial case of a more general algorithm from [8], though applied to a slightly different model (here the spectral variances are structured with NTF model, while in [8] they are structured with NMF model).
Each spatial \(I \times I\) covariance \(\mathbf{R}_{jf}\) being full rank, its rank equals to I. For each source j we introduce I so-called point sub-sources \(s_{ji,fn} \in \mathbb C\) (\(i = 1, \ldots , I\)) that share the same spectral variance \(v_{jfn}\), in other words they are distributed as
Moreover, each spatial covariance \(\mathbf{R}_{jf}\) can be non-uniquely represented as
where \(\mathbf{A}_{jf}\) is an \(I \times I\) complex-valued matrix. By introducing a \(J \, I\)-length vector
and an \(I \times J \, I\) matrix
one can show [8] that the LGM modeling (4.3) is equivalent (up to the noise term \(\mathbf{b}_{fn}\)) to
with \(s_{ji,fn}\) (components of \(\mathbf{s}_{fn}\)) being mutually independent and distributed as in (4.28), the noise term \(\mathbf{b}_{fn}\) being distributed as
with an anisotropic covariance matrix \(\varvec{\Sigma }_{\mathbf{b},fn} = \sigma ^2_{\mathbf{b},f} \mathbf{I}_I\). The noise term \(\mathbf{b}_{fn}\) is needed for a so-called simulated annealing procedure that is necessary in this case (see [12] for details), where the noise variance \(\sigma ^2_{\mathbf{b},f}\) is usually decreased over the algorithm iterations.
Let us now compute the auxiliary function \(Q(\varvec{\theta }, \varvec{\theta }^{(\ell )})\) defined in (4.22). Below we will omit sometimes the indexing of parameters with \((\ell )\), and it will be clear from the context what are the parameters estimated on previous step and what are the parameters to be updated on the current step. The log-likelihood of the complete data \(\{ \mathbf{X}, \mathbf{Z} \}\) writesFootnote 4
where
is computed as in (4.18),
and \(d_{IS} (x | y) = \frac{x}{y} - \log \frac{x}{y} - 1\) is the scalar IS divergence (see Chap. 1).
By applying the conditional expectation operator \(\mathbb E_{\mathbf{X}| \mathbf{S}, \varvec{\theta }^{(\ell )}} \left[ \cdot \right] \) the auxiliary function \(Q(\varvec{\theta }, \varvec{\theta }^{(\ell )})\) writes then
with \(\widehat{\varvec{\Sigma }}_{\mathbf{xs},fn}\), \(\widehat{\varvec{\Sigma }}_{\mathbf{s},fn}\) and \(\hat{\xi }_{jfn}\) defined as
and computed as follows:
where
We now proceed with the M-step (4.23). Maximizing the auxiliary function (4.39) over \(\mathbf{A}_{f}\) leads to the following closed-form solutionFootnote 5:
Maximization of the auxiliary function (4.39) over \(\mathbf{Q}\), \(\mathbf{W}\) and \(\mathbf{H}\), i.e., the minimization of \(\sum _{j,f,n} d_{IS} (\hat{\xi }_{jfn} | v_{jfn})\) with \(v_{jfn}\) computed as in (4.6), does not allow a closed-form solution. As such, to update \(\mathbf{Q}\), \(\mathbf{W}\) and \(\mathbf{H}\), several iterations of the following MU rules [15] are applied:
Applying these MU rules does not guarantee auxiliary function minimization as in (4.23), but only its non-decreasing as in (4.24). As such, this is in fact a GEM algorithm.
Algorithm 1 summarizes one iteration of the SSEM/MU algorithm derived above.
4.7.3 Other Algorithms
Another very popular choice for multichannel NMF model parameters estimation is the majorization-minimization (MM) algorithm [46], which is used for example in [16, 17]. Note that the EM algorithm is interpretable as a partial case of the MM algorithm.
4.8 Conclusion
In this chapter we have introduced multichannel NMF methods for audio source separation. Potential advantages and disadvantages of these methods are discussed. Despite a quickly growing popularity of deep learning that is now of a great interest for audio source separation, multichannel NMF methods remain still an important area of research and in our opinion cannot be completely replaced by deep learning-based methods in all situations. Indeed, especially in fully blind settings, where no training data are available, deep learning is not a suitable path any more, while multichannel NMF is still applicable.
As for the further research on multichannel NMF we would like highlighting the following possible paths which have been already started to be explored. One research direction consists in proposing more sophisticated spatial and spectral models adapted to the mixing conditions and sources of interest, as well as in proposing new models going beyond the limitations of the LGM modeling. Another direction consists in combining some aspects of multichannel NMF with deep learning.
Notes
- 1.
Throughout the chapter we will generally refer to all these methods as multichannel NMF, while precising when we are speaking about multichannel NTF.
- 2.
The spatial image of a source means not the source signal itself, but its contribution into the I-channel mixture.
- 3.
Due to the scale ambiguity between \(\mathbf{R}_{jfn}\) and \(v_{jfn}\) in (4.2), the loudness can be fully attributed to \(v_{jfn}\).
- 4.
When we write \(\overset{\mathrm{c}}{=}\), that means that the equality is up to some constant that is independent on model parameters \(\varvec{\theta }\), and thus has no influence on the optimization over parameters in (4.23).
- 5.
Note that if the spatial covariances \(\mathbf{R}_{jf}\) are needed, they can be always computed with (4.29).
References
D.D. Lee, H.S. Seung, Learning the parts of objects with nonnegative matrix factorization. Nature 401, 788–791 (1999)
T. Virtanen, Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15(3),1066–1074 (2007)
M.N. Schmidt, R.K. Olsson, Single-channel speech separation using sparse non-negative matrix factorization, in Spoken Language Proceesing, ISCA International Conference on (INTERSPEECH) (2006)
L. Le Magoarou, A. Ozerov, N.Q. Duong, Text-informed audio source separation. Example-based approach using non-negative matrix partial co-factorization. J. Signal Process. Syst. 79(2), 117–131 (2015)
C.Févotte, N. Bertin, J.-L. Durrieu, Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural Comput. 21(3), 793–830 (2009)
D. El Badawy, N.Q. Duong, A. Ozerov, On-the-fly audio source separation—a novel user-friendly framework. IEEE/ACM Trans. Audio Speech Lang. Process. 25(2), 261–272 (2017)
E. Vincent, N. Bertin, R. Badeau, Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Trans. Audio Speech Lang. Process. 18, 528–537 (2010)
A. Ozerov, E. Vincent, F. Bimbot, A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Process. 20(4), 1118–1133 (2012)
N. Mohammadiha, P. Smaragdis, A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)
D. FitzGerald, M. Cranitch, E. Coyle, Non-negative tensor factorisation for sound source separation, in Proceeding of the Irish Signals and Systems Conference, Dublin, Ireland, Sept 2005
D. FitzGerald, M. Cranitch, E. Coyle, Extended nonnegative tensor factorisation models for musical sound source separation. Comput. Intell. Neurosci. 2008(872425),15 (2008)
A. Ozerov, C. Févotte, Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 18(3), 550–563 (2010)
H. Sawada, R. Mukai, S. Araki, S. Makino, A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. Speech Audio Process. 12(5), 530–538 (2004)
M.I. Mandel, D.P. Ellis, T. Jebara, An EM algorithm for localizing multiple sound sources in reverberant environments. NIPS. 19 (2006)
A. Ozerov, C. Févotte, R. Blouet, J.-L. Durrieu, Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, (May 2011), pp. 257–260
H. Sawada, H. Kameoka, S. Araki, N. Ueda, Multichannel extensions of non-negative matrix factorization with complex-valued data. IEEE Trans. Audio Speech Lang. Process. 21(5), 971–982 (2013)
J. Nikunen, T. Virtanen, Direction of arrival based spatial covariance model for blind sound source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(3), 727–739 (2014)
N.Q. Duong, E. Vincent, R. Gribonval, Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)
C.Févotte, J.-F. Cardoso, Maximum likelihood approach for blind audio source separation using time-frequency gaussian source models, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (IEEE, 2005), pp. 78–81
E. Vincent, S. Arberet, R. Gribonval, Underdetermined instantaneous audio source separation via local gaussian modeling, in International Conference on Independent Component Analysis and Signal Separation. (Springer, 2009), pp. 775–782
H. Kameoka, T. Yoshioka, M. Hamamura, J. Le Roux, K. Kashino, Statistical model of speech signals based on composite autoregressive system with application to blind source separation, in International Conference on Latent Variable Analysis and Signal Separation, (Springer, 2010), pp. 245–253
T. Higuchi, H. Takeda, T. Nakamura, H. Kameoka, A unified approach for underdetermined blind signal separation and source activity detection by multichannel factorial hidden markov models, in INTERSPEECH, (2014), pp. 850–854
J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, Parametric coding of stereo audio. EURASIP J. Appl. Signal Process. 2005, 1305–1322 (2005)
M.I. Mandel, R.J. Weiss, D.P. Ellis, Model-based expectation-maximization source separation and localization. IEEE Trans. Audio Speech Lang. Process. 18(2), 382–394 (2010)
E. Vincent, X. Rodet, Underdetermined source separation with structured source priors, in International Conference on Independent Component Analysis and Signal Separation, (Springer, 2004), pp. 327–334
E. Vincent, Musical source separation using time-frequency source priors. IEEE Trans. Audio Speech Lang. Process. 14(1), 91–98 (2006)
S. Arberet, A. Ozerov, N.Q. Duong, E. Vincent, R. Gribonval, F. Bimbot, P. Vandergheynst, Nonnegative matrix factorization and spatial covariance model for under-determined reverberant audio source separation, in 10th International Conference on Information Sciences Signal Processing and their Applications (ISSPA), 2010, (IEEE, 2010), pp. 1–4
T. Virtanen, A. Klapuri, Analysis of polyphonic audio using source-filter model and non-negative matrix factorization, in Advances in Models for Acoustic Processing, Neural Information Processing Systems Workshop, (Citeseer, 2006)
N. Souviraà-Labastie, A. Olivero, E. Vincent, F. Bimbot, Multi-channel audio source separation using multiple deformed references. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(11), 1775–1787 (2015)
V.Y.F. Tan, C. Févotte, Automatic relevance determination in nonnegative matrix factorization with the beta-divergence. IEEE Trans. Pattern Anal. Mach. Intell. 35(7), 1592–1605 (2013)
R. Bro, Parafac. tutorial and applications. Chemom. Intell. Lab. Syst. 38(2), 149–171 (1997)
L. Parra, C. Spence, Convolutive blind separation of non-stationary sources. IEEE Trans. Speech Audio Process. 8(3), 320–327 (2000)
S. Gannot, E. Vincent, S. Markovich-Golan, A. Ozerov, A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 25(4), 692–730 (2017)
N.Q. Duong, E. Vincent, R. Gribonval, Spatial location priors for gaussian model based reverberant audio source separation. EURASIP J. Adv. Signal Process. 2013(1), 149 (2013)
R. Badeau, M.D. Plumbley, Multichannel high-resolution nmf for modeling convolutive mixtures of non-stationary signals in the time-frequency domain. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 22(11), 1670–1680 (2014)
D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, R. Horaud, An inverse-gamma source variance prior with factorized parameterization for audio source separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2016), pp. 136–140
N.Q. Duong, H. Tachibana, E. Vincent, N. Ono, R. Gribonval, S. Sagayama, Multichannel harmonic and percussive component separation by joint modeling of spatial and spectral continuity, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2011), pp. 205–208
T. Higuchi, N. Takamune, T. Nakamura, H. Kameoka, Underdetermined blind separation and tracking of moving sources based on DOA-HMM, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2014), pp. 3191–3195
D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, R. Horaud, A variational EM algorithm for the separation of time-varying convolutive audio mixtures. IEEE/ACM Trans. Audio Speech Lang. Process. 24(8), 1408–1423 (2016)
M. Togami, Online speech source separation based on maximum likelihood of local gaussian modeling, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (IEEE, 2011), pp. 213–216
L.S. Simon, E. Vincent, A general framework for online audio source separation, in International conference on Latent Variable Analysis and Signal Separation, (Springer, 2012), pp. 397–404
N.Q. Duong, E. Vincent, R. Gribonval, Under-determined reverberant audio source separation using local observed covariance and auditory-motivated time-frequency representation, in International Conference on Latent Variable Analysis and Signal Separation, (Springer, 2010), pp. 73–80
K. Adiloğlu, E. Vincent, Variational bayesian inference for source separation and robust feature extraction. IEEE/ACM Trans. Audio Speech Lang. Process. 24(10), 1746–1758 (2016)
A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat.Soc. Ser. B (Statistical Methodology) 39, 1–38 (1977)
J. Thiemann, E. Vincent, A fast EM algorithm for Gaussian model-based source separation, in Proceedings of the 21st European Signal Processing Conference (EUSIPCO), (IEEE, 2013), pp. 1–5
D.R. Hunter, K. Lange, A tutorial on mm algorithms. Am. Stat. 58(1), 30–37 (2004)
Acknowledgements
Cédric Févotte acknowledges funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No 681839 (project FACTORY).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Ozerov, A., Févotte, C., Vincent, E. (2018). An Introduction to Multichannel NMF for Audio Source Separation. In: Makino, S. (eds) Audio Source Separation. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-73031-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-73031-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73030-1
Online ISBN: 978-3-319-73031-8
eBook Packages: EngineeringEngineering (R0)