4.1 Introduction

Nonnegative matrix factorisation (NMF) [1] is a dimensionality reduction technique that consists in approximating a nonnegative data matrix (a matrix with nonnegative entries) as a product of two nonnegative matrices of lower rank than the initial data matrix. This also can be viewed as an approximation of data matrix as a sum of few rank-1 nonnegative matrices. It was first successfully applied for single-channel source separation [2], where the nonnegative matrix of magnitude or power spectrogram is decomposed, and became a state of the art reference. The success of this method is mainly due to universality of this quite simple modeling (it is applicable to various types of audio sources including speech [3, 4], music [2, 5], environmental sounds [6], etc.) and due to the flexibility of this modeling allowing adding various constraints to it, such as for example harmonicity of spectral patterns [7], smoothness of their activation coefficients [2, 5], pre-trained spectral patterns [8, 9], etc.

Given the success of the NMF for single-channel source separation, there were several attempts to extend it to the case of multichannel source separation. Earlier ideas were relying on stacking magnitude or power spectrograms of all channels into a 3-valence nonnegative tensor and decomposing it with nonnegative tensor factorisation (NTF) methods [10] or other NTF-like nonnegative structured approximations [11, 12]. This gave some interesting results. However, since only nonnegative power spectrograms are involved, such approaches rely only on the amplitude information, while completely discarding the phases of the short time Fourier transforms (STFTs). In other words, these approaches do not allow exploiting the interchannel phase differences (IPDs), but only the interchannel level differences (ILDs). However, the IPDs may be very important for multichannel source separation, and they are indeed exploited by several clustering-based methods [13, 14]. Using IPDs becomes even more critical for the far-field case (i.e., when the distances between the microphones are much smaller than the distances between the sources and microphones), where the information carried by the ILDs becomes almost non-discriminating.

It is clear that a fully nonnegative (e.g., NTF-like) modeling is unable to model jointly source power spectrograms, ILDs and IPDs, since the phase information is discarded in the nonnegative tensor of multichannel mixture power spectrograms. As such, it was proposed to resort to a semi-nonnegative modeling [8, 12, 15,16,17], where the latent source power spectrograms are modeled with NMF [8, 12] or NTF [15,16,17], while the mixing system is modeled differently, not with a nonnegative model. This modeling, often referred to as multichannel NMF [12] or multichannel NTF [15]Footnote 1 depending on the model of the source power spectrograms, is usually achieved via a Gaussian probabilistic modeling applied directly to the complex-valued STFTs of all channels.

The multichannel NMF modeling treats the complex-valued STFT coefficients as realizations of zero-mean circular complex-valued Gaussian random variables with structured variances (via NMF) and covariances. This leads to the fact that this modeling reduces to Itakura Saito (IS) NMF in the single channel case (see Chap. 1), thus being its natural extension to the multichannel case. Moreover, it allows integrating many other NMF-like models (see Chap. 1 and [8]) in an easy and flexible manner. Finally, it combines both spectral and spatial (including ILDs and IPDs) cues within a unified framework. When one of these two cues does not allow separating the sources efficiently, the algorithm relies on the other cue, and vice versa. In our opinion the multichannel NMF is one of the first attempts of combining these two cues in a systematic and principal way.

4.2 Local Gaussian Model

Multichannel NMF can be formulated as based on a so-called local Gaussian model (LGM) that is more general itself (than the multichannel NMF) and allows modeling and combining spatial and spectral cues in a systematic way. In a most general manner the LGM may be formulated as follows. Let us first assume that we deal with a multichannel (I-channel) mixture of J sources to be separated. Assuming all the signals are converted into the STFT domain, this can be written as

$$\begin{aligned} \mathbf{x}_{fn} = \sum _{j=1}^J \mathbf{y}_{jfn}, \end{aligned}$$
(4.1)

where \(\mathbf{x}_{fn} = \left[ x_{1,fn}, \ldots , x_{I,fn}\right] ^T \in \mathbb C^I\) and \(\mathbf{y}_{jfn} = \left[ y_{1,jfn}, \ldots , x_{I,jfn}\right] ^T \in \mathbb C^I\) (\(j = 1, \ldots , J\)) are the channel-wise vectors of STFT coefficients of the mixture and of the j-th source spatial image,Footnote 2 respectively; and \(f = 1, \ldots , F\) and \(n = 1, \ldots , N\) are the frequency and time indices, respectively. Given the above-introduced notations, the LGM modeling [18] assumes that each source image (I-length complex-valued vector \(\mathbf{y}_{jfn}\)) is modeled as a zero-mean circular complex Gaussian random vector as follows

$$\begin{aligned} \mathbf{y}_{jfn} \sim \mathcal N_c \left( 0, \mathbf{R}_{jfn} v_{jfn} \right) , \end{aligned}$$
(4.2)

where the complex-valued covariance matrix is positive definite Hermitian, and it is composed of two factors:

  • a spatial covariance \(\mathbf{R}_{jfn} \in \mathbb C^{I \times I}\) representing the spatial characteristics of the j-th source image at the time-frequency (TF) point (fn), and

  • a spectral variance \(\mathbf{v}_{jfn} \in \mathbb R\) representing the spectral characteristics of the j-th source image at the TF point (fn).

Given the model parameters, i.e., the spatial covariances \(\mathbf{R}_{jfn}\) and the spectral variances \(\mathbf{v}_{jfn}\), the random vectors \(\mathbf{y}_{jfn}\) in (4.2) are also assumed mutually independent in time, frequency and between sources. Note that the LGM modeling was not proposed in [18] for the first time, indeed, its variants were already considered in [19, 20]. However, the formulation from [18] is quite general to cover all the cases, that is why we have chosen here this formulation.

Given the multichannel mixing equation and the above independence assumptions, the mixture STFT coefficients may be shown distributed as

$$\begin{aligned} \mathbf{x}_{fn} \sim \mathscr {N}_c \left( 0, \sum _{j=1}^J \mathbf{R}_{jfn} v_{jfn} \right) . \end{aligned}$$
(4.3)

The model parameters are usually estimated in the maximum likelihood (ML) sense from the observed mixture \(\mathbf{X} = \left\{ x_{ifn} \right\} _{i,f,n}\). However, a direct ML estimation of parameters under the modeling (4.3) would lead to the data overfitting, since the number of scalar parameters exceeds the number of the mixture STFT coefficients. As such, various constraints are applied to both spectral variances and spatial covariances, as it is presented in detail in Sects. 4.3 and 4.4 respectively. In the case of multichannel NMF we address in this chapter, the spectral variances are usually represented by low-rank nonnegative matrices or tensors. However, other approaches consider different models (e.g., such as composite autoregressive models [21], source-excitation models [8] or hidden Markov models [22]) to structure the spectral variances, that is why the LGM modeling is more general than the multichannel NMF. As it is discussed in Sect. 4.4 below, spectral covariances are usually not modeled with fully nonnegative structures. This is the reason why we are speaking about semi-nonnegative modeling in the introduction.

For the sake of better understanding, we now give an interpretation to the spatial covariance matrix \(\mathbf{R}_{jfn}\), and relate it to the methods used for multichannel audio compression. For the sake of simplicity and also since most of audio recording are stereo (i.e., two channel mixtures), we consider the case of \(I = 2\). The spatial covariance matrix \(\mathbf{R}_{jfn}\) is in general a full-rank positive definite Hermitian complex-valued matrix. An example of a spatial covariance matrix is represented on Fig. 4.1. Note that this is a rather “fake” (or incomplete) representation, since it is difficult to represent a 2-dimensional complex-valued covariance matrix on a 2-dimentional real plane.

Fig. 4.1
figure 1

An illustration of a spatial covariance matrix \(\mathbf{R}_{jfn}\) in the 2-channel case (\(I = 2\)). While dropping the indices j, f and n, the covariance matrix eigendecomposition may be written as \(\mathbf{R} = \mathbf{U} \varvec{\Lambda } \mathbf{U}^H\), with \(\mathbf{U} = \left[ \mathbf{u}_1, \mathbf{u}_2 \right] \), \(\mathbf{u}_1, \mathbf{u}_2 \in {\mathbb C}^2\) being the eigenvectors and \(\varvec{\Lambda } = \mathrm{diag} \left( \left[ \lambda _1, \lambda _2 \right] \right) \), \(\lambda _1, \lambda _2 \in {\mathbb R}_+\) being the eigenvalues. This illustration is not fully complete, since a 2D complex-valued covariance matrix is represented on a 2D real plane

Since the spatial covariance matrix \(\mathbf{R}_{jfn}\) is complex-valued Hermitian, it can be easily shown that in the 2-dimensional case we consider here it is uniquely encoded by only four real scalars. Indeed, its 2 diagonal entries are real and the 2 complex-valued off-diagonal entries are conjugate. These four real-valued parameters may be uniquely converted into the following, in a sense more meaningful, real-valued parameters:

  • Loudness,Footnote 3

  • ILD,

  • IPD,

  • Diffuseness that can be also replaced by interchannel coherence (IC) [23].

It is worth to note that the last three spatial parameters (ILD, IPD and IC) are also used for parametric coding of stereo audio [23]. This is somehow expected, indeed, the models that are suitable for compression should be also suitable for sources separation, since in both cases the models tend to reduce the redundancy in the signal.

Finally, let us also stress that the LGM modeling seems more general (and thanks to Gaussian formulation more principal) than blind source separation (BSS) approaches based on ILD/IPD clustering [13, 24]. Indeed, the diffuseness or IC is not taken at all into account within the latter approaches.

4.3 Spectral Models

In this section we present and discuss spectral models used within various multichannel NMF approaches. These models include NMF models, NTF models and their extensions.

4.3.1 NMF Modeling of Each Source

NMF modeling of each source, which is usually referred to as mutichannel NMF, consists in structuring the source variances \(v_{jfn}\) in (4.2) with NMF structure as in the single-channel NMF case (see Chap. 1):

$$\begin{aligned} v_{jfn} = \sum _{k=1}^{K_j} w_{jfk} h_{jkn}, \end{aligned}$$
(4.4)

where the source-dependent \(K_j\) is usually smaller than both F and N, and \(w_{jfk}\) and \(h_{jkn}\) are all nonnegative. By introducing nonnegative matrices (i.e., matrices with nonnegative entries) \(\mathbf{V}_j = [v_{jfn}]_{f,n} \in \mathbb R_+^{F \times N}\), \(\mathbf{W}_j = [w_{jfk}]_{f,k} \in \mathbb R_+^{F \times K_j}\), and \(\mathbf{H}_j = [h_{jkn}]_{k,n} \in \mathbb R_+^{K_j \times N}\), (4.4) may be rewritten in a matrix form as:

$$\begin{aligned} \mathbf{V}_j = \mathbf{W}_j \mathbf{H}_j. \end{aligned}$$
(4.5)

A visualization of these NMF spectral models is shown on Fig. 4.2.

Fig. 4.2
figure 2

A visualization of spectral models of multichannel NMF. Source variances \(\mathbf{V}_j\) of each of J (here \(J=3\)) sources are modeled with NMF with \(K_j\) (here \(K_j = 2\)) components, which can be decomposed as a sum of \(K_j\) rank-1 matrices (\(\mathbf{w}_{j,k}\) and \(\mathbf{h}_{j,k}\) are the columns and the lines of matrices \(\mathbf{W}\) and \(\mathbf{H}\), respectively)

This kind of spectral models in the case of mutichannel source separation were first introduced in [25, 26], though with more sophisticated NMF-like structures suitable for harmonic music instruments and with different optimization criteria than those we discuss in this chapter. Spectral models based on usual NMF, exactly as in (4.5), were proposed in [12], and then extended/re-considered in many other works [8, 15,16,17, 27].

A very attractive property of this modeling is that any NMF or NMF-like structure based on the IS divergence, such as for example harmonic NMF [7], smooth NMF [2, 5] or excitation-filter NMF [28] (see also Chap. 1) may be incorporated easily and in a systematic manner within the framework. This was remarked and addressed in [8], where a general source separation framework allowing specifying various spectral and spatial models for each individual source is proposed. The latter research work is supplied with a software called Flexible Audio Source Separation Toolbox (FASST) that implements all these possible model variants in a flexible way. Finally, let us note that many informed or user-assisted/guided audio source separation approaches were extended to the multichannel case within the same paradigm [15, 29].

4.3.2 Joint NTF Modeling of All Sources

One of the shortcomings of the multichannel NMF modeling presented in Sect. 4.3.1 is the following. While for single-channel NMF one needs fixing an appropriate number of components K or determining this number automatically, which is not always easy (see, e.g., [30]), in the multichannel NMF, as presented in Sect. 4.3.1, one needs determining not only the total number of components \(K = \sum _{j=1}^J K_j\), but also the number of components \(K_j\) for each source, which may vary from one source to another. To overcome this problem the following idea was introduced in [15], and then extended in other works [16, 17]. It is now assumed that instead of representing each source with an individual NMF \(\{ \mathbf{W}_j, \mathbf{H}_j \}\) all the sources share the components of the same NMF \(\{ \mathbf{W}, \mathbf{H} \}\), where \(\mathbf{W} = [w_{fk}]_{f,k} \in \mathbb R_+^{F \times K}\), and \(\mathbf{H} = [h_{kn}]_{k,n} \in \mathbb R_+^{K \times N}\). Moreover, in order to specify associations between K NMF components and J sources, a new \((J \times K)\) nonnegative matrix \(\mathbf{Q} = [q_{jk}]_{j,k} \in \mathbb R_+^{J \times K}\) is introduced, and the source variances \(v_{jfn}\) are now structured as:

$$\begin{aligned} v_{jfn} = \sum _{k=1}^{K} w_{fk} h_{kn} q_{jk}. \end{aligned}$$
(4.6)

Assuming the columns of \(\mathbf{Q}\) are normalized to sum to one (i.e., \(\sum _{j=1}^J q_{jk} = 1\)), which is always possible to achieve thanks to scale ambiguity between the columns of \(\mathbf{Q}\) and that of say \(\mathbf{W}\) in (4.6), each \(q_{jk}\) represents the proportion of association of the component k to the source j.

By denoting with \(\mathbf{V} = \{v_{jfn}\}_{j,f,n}\) a 3-valence tensor of source variances, (4.6) may be also rewritten in a tensor/vector form as a sum of K rank-1 tensors:

$$\begin{aligned} \mathbf{V} = \sum _{k=1}^{K} \mathbf{w}_k \circ \mathbf{h}^T_k \circ \mathbf{q}_k, \end{aligned}$$
(4.7)

where “\(\circ \)” denotes the tensor outer product, \(\mathbf{w}_k\) and \(\mathbf{q}_k\) are the k-th columns of matrices \(\mathbf{W}\) and \(\mathbf{Q}\) respectively, and \(\mathbf{h}_k\) is the k-th line of matrix \(\mathbf{H}\). The tensor decomposition as in (4.6) and (4.7) is called parallel factor (PARAFAC) or canonical decomposition (CANDECOMP) [31]. A visualization of these NTF spectral models is shown on Fig. 4.3.

We here call this model multichannel NTF, as introduced in [15], though some authors [16, 17] continue calling it multichannel NMF. Note also that a fully nonnegative NTF modeling [10,11,12] was applied for multichannel audio source separation as well. Those approaches apply an NTF decomposition directly to the nonnegative tensor of power spectrograms of the multichannel mixture, while here it is applied to the latent nonnegative tensor of power spectrograms of the sources, and the overall modeling is not fully nonnegative, as mentioned in the introduction.

One can easily note that the NTF decomposition (4.6) generalizes that of (4.4). Indeed, (4.6) can be reduced to (4.4) by setting for each column of \(\mathbf{Q}\) all the values to 0 except one that is set to 1, and by fixing the values of \(\mathbf{Q}\). Finally, the multichannel NTF modeling has the following potential advantages over the multichannel NMF modeling:

  • One does not need specifying in advance the number of components \(K_j\) for each source, but only the total number of components K. The components are then allocated automatically via the matrix \(\mathbf{Q}\), which may be also more optimal than a manual user-specified allocation.

  • Some components may be shared between different sources, which means that the modeling is more compact. This happens when there are more than one non-zero entry in one column of matrix \(\mathbf{Q}\).

It should be noted however that it is desirable that the matrix \(\mathbf{Q}\) is quite sparse, i.e., that there are few components for which there are more than one non-zero entry in the corresponding column of matrix \(\mathbf{Q}\). Otherwise, the components are not well allocated between sources, and this may not lead to a good separation result. Thus, it is possibly desirable to add some sparsity-inducing penalty on \(\mathbf{Q}\) to the corresponding optimization criterion.

Fig. 4.3
figure 3

A visualization of spectral models of multichannel NTF. Source variances \(\mathbf{V}_j\) are stuck in a common 3-valence tensor \(\mathbf{V}\) modeled with PARAFAC model [31] with K (here \(K = 6\)) components, which can be decomposed as a sum of K rank-1 3-valence tensors

4.4 Spatial Models and Constraints

Spatial covariance \(\mathbf{R}_{jfn}\) might be assumed fully unconstrained, though in that case, as already mentioned in Sect. 4.2, the parameter estimation would certainly lead to data overfitting, since there are more parameters than observations, i.e., the STFT coefficients in the multichannel mixture. In order to cope with that it is necessary to introduce some constraints on spatial covariances.

First of all, when the sources are static, it is reasonable to assume that the spatial covariances are time-invariant, i.e., \(\mathbf{R}_{jfn} = \mathbf{R}_{jf}\) are independent of n. This assumption is made in many approaches [8, 12, 16,17,18] and it allows highly reducing the number of free parameters to be estimated. We assume the time-invariant case within this section and the time-varying case will be briefly discussed at the end.

On top of the time-invariance, additional constraints may be introduced as well, and most often it is achieved either by imposing some particular structure or via probabilistic priors.

The early works [12, 19, 20] constraint the spatial covariance \(\mathbf{R}_{jf}\) further and assume that the rank of the matrix is one, which is refereed to as rank-1 spatial covariance. This was introduced based on the following reasoning. Let us assume that the mixture (4.1) is a convolutive mixture of J point sources. In that case the spatial images \(\mathbf{y}_{jfn}\) in (4.1) may be approximated as [32]

$$\begin{aligned} \mathbf{y}_{jfn} = \mathbf{a}_{jf} s_{jfn}, \end{aligned}$$
(4.8)

where \(s_{jfn} \in \mathbb C\) are the STFT coefficients of the point sources and \(\mathbf{a}_{jf} = \left[ a_{Ijf}, \ldots ,\right. \left. a_{Ijf} \right] ^T \in {\mathbb C}^I\) are the channel-wise vectors of discrete Fourier transforms (DFTs) of the impulse responses of the convolutive mixing filters. The equality in (4.8) holds indeed only approximately and becomes more and more accurate when the sizes of the mixing filters impulse responses are comparable or smaller than the length of the STFT analysis window [32]. This approximation is referred to as narrowband approximation. Assuming now that each source STFT coefficient \(s_{jfn}\) follows a zero-mean Gaussian distribution with variance \(v_{jfn}\), one can easily show that source images \(\mathbf{y}_{jfn}\) are distributed as in (4.2) with

$$\begin{aligned} \mathbf{R}_{jf} = \mathbf{a}_{jf} \mathbf{a}_{jf}^H. \end{aligned}$$
(4.9)

We see that the spatial covariance \(\mathbf{R}_{jf}\) in (4.9) is indeed a rank-1 matrix.

It was proposed in [18] not to constraint the spatial covariance \(\mathbf{R}_{jf}\) or to parametrize it in a different way (see [18] for details), but in both cases so as the matrix remains full rank. This modeling, refereed to as full rank spatial covariance, allows to go beyond the limits of the narrowband approximation (4.8), thus it is more suitable than the rank-1 model in case of long reverberation times. It may be also more suitable in case when the point sources assumption is not fully verified. Indeed, as explained in Sect. 4.7.2 below, modeling a source image with a full rank model can be recast as a sum of I point sources with different rank-1 spatial covariances and shared spectral variance.

Another approach [17] consists in assuming that the spatial covariance is a weighted sum of so-called direction of arrival (DOA) kernels that are rank-1 spatial covariances modeling plane waves coming from several predefined directions. These directions may be specified in 2D plane or in 3D space (see Fig. 4.4 for a 2D example). Rank-1 DOA kernels corresponding to these directions \(\theta _l\) (\(l = 1, \ldots , L\)) are then defined as

$$\begin{aligned} \mathbf{K}_{fl} = \mathbf{d}(f,\theta _l) \mathbf{d}(f,\theta _l)^H \end{aligned}$$
(4.10)

with \(\mathbf{d}(f,\theta _l)\) being a relative steering vector for the direction \(\theta _l\) defined as

$$\begin{aligned} \mathbf{d}(f,\theta _l) = \left[ 1, e^{-2 \pi \tau _{2,1}(\theta _l) \nu _f / c}, \ldots , e^{-2 \pi \tau _{I,1}(\theta _l) \nu _f / c} \right] ^T, \end{aligned}$$
(4.11)

where c is the speed of the sound (343 m/s), \(\nu _f\) is the frequency (in Hz) corresponding to the frequency bin f, and \(\tau _{i,i'}(\theta _l)\) is the time difference of arrival (TDOA) (in seconds) between microphones i and \(i'\) from the direction \(\theta _l\). Note that this relative steering vector is defined without taking into account the ILDs, but only IPDs (see [33] for a definition taking as well into account ILDs). Finally, the spatial covariance is defined as a weighted sum of DOA kernels \(\mathbf{K}_{fl}\) from (4.10) as

$$\begin{aligned} \mathbf{R}_{jf} = \sum _{l=1}^L z_{jl} \mathbf{K}_{fl}, \end{aligned}$$
(4.12)

with \(z_{jl}\) being nonnegative weights.

Fig. 4.4
figure 4

Example of a set of predefined directions in 2D plane for a given microphone array

If the DOAs of all or of some sources are known to some extend, it is possible to introduce this information for example via prior distributions on the spatial covariances. In [34] those priors are defined via inverse Wishart distributions as follows

$$\begin{aligned} p \left( \mathbf{R}_{jf} | \varvec{\varPsi }_{jf}, m \right) = \frac{|\varvec{\varPsi }_{jf}|^m |\mathbf{R}_{jf}|^{-(m+I)} e^{-\mathrm{tr} \left[ \varvec{\varPsi }_{jf}{} \mathbf{R}^{-1}_{jf} \right] }}{\pi ^{I(I-1)/2} \prod _{i=1}^I \Gamma (m-i+1)}, \end{aligned}$$
(4.13)

with

$$\begin{aligned} \varvec{\varPsi }_{jf} = (m - I) \left( \mathbf{d}(f,\theta _l) \mathbf{d}(f,\theta _l)^H + \sigma ^2_\mathrm{rev} \varvec{\varOmega }_f \right) , \end{aligned}$$
(4.14)

where \(\mathbf{d}(f,\theta _l)\) is a steering vector which may be defined as in (4.11), \(\varvec{\varOmega }_f = \left[ \sin (2 \pi \nu _f q_{ii'} / c) / (2 \pi \nu _f q_{ii'} / c) \right] _{ii'}\) is a matrix modeling reverberation part (i.e., non-direct part) of the impulse response, and \(\sigma ^2_\mathrm{rev}\) is a positive constant depending on the amount of reverberation as compared to the direct part of impulse response.

There are also other models that do not fall into the LGM framework as formulated here. These models include for example multichannel high-resolution NMF (HR-NMF) [35] or a method where the source variance prior parametrization is factorized by NMF [36].

Finally, several approaches [37,38,39] address time-varying case, where \(\mathbf{R}_{jfn}\) is not independent any more on n, though still constrained in different ways.

4.5 Main Steps and Sources Estimation

Let us denote by \(\varvec{\theta } = \{ \mathbf{R}_{jfn}, v_{jfn} \}_{j,f,n}\) the whole set of model parameters, assuming some constraints from those overviewed in Sects. 4.3 and 4.4 hold. Given a model \(\varvec{\theta }\) specified and an estimation criterion (see Sect. 4.6 below) chosen, most of LGM-based approaches are based on the following main steps:

  1. 1.

    The STFT \(\mathbf{X}\) of the multichannel mixture signal is computed.

  2. 2.

    The model is estimated with an algorithm (see Sect. 4.7 below) optimizing the chosen criterion.

  3. 3.

    The source images are estimated in the STFT domain via Wiener filtering as:

    $$\begin{aligned} \hat{\mathbf{y}}_{jfn} = \mathbf{R}_{jfn} v_{jfn} \left[ \sum _{j=1}^J \mathbf{R}_{jfn} v_{jfn} \right] ^{-1} \mathbf{x}_{fn}, \end{aligned}$$
    (4.15)

    where \(\mathbf{R}_{jfn}\) and \(v_{jfn}\) are the spatial covariances and spectral variances as specified in (4.2).

  4. 4.

    The source images in time domain are then reconstructed by applying the inverse STFT to \(\widehat{\mathbf{Y}} = \{ \hat{\mathbf{y}}_{jfn} \}_{j,f,n}\).

In the online approaches [40, 41], where the separation must be performed for every new frame, the same steps are repeated for each frame and the model estimation algorithm is modified so as to update the model parameters in an incremental and causal (i.e., only the passed and current frames are used) manner.

4.6 Model Estimation Criteria

In order to estimate the model parameters \(\varvec{\theta }\) from the observed data, i.e., from the STFT of the multichannel mixture signal \(\mathbf{X}\), one needs specifying a model estimation criterion.

4.6.1 Maximum Likelihood

One of the most popular choices for model estimation is the maximum likelihood (ML) criterion that writes

$$\begin{aligned} \varvec{\theta } = \mathrm{arg} \max _{\varvec{\theta }'} p (\mathbf{X} | \varvec{\theta }'). \end{aligned}$$
(4.16)

In the case of LGM modeling (4.2) this criterion can be shown [16] equivalent to minimizing the following cost function:

$$\begin{aligned} C_\mathrm{IS}(\varvec{\theta }) = \sum _{f,n=1}^{F,N} \mathrm{tr} \left( \widehat{\varvec{\Sigma }}_{\mathbf{x},fn} \varvec{\Sigma }^{-1}_{\mathbf{x},fn} \right) - \log \mathrm{det} \left( \widehat{\varvec{\Sigma }}_{\mathbf{x},fn} \varvec{\Sigma }^{-1}_{\mathbf{x},fn} \right) - I, \end{aligned}$$
(4.17)

where

$$\begin{aligned} \widehat{\varvec{\Sigma }}_{\mathbf{x},fn} = \mathbf{x}_{fn} \mathbf{x}^H_{fn} \qquad \text{ and } \qquad \varvec{\Sigma }_{\mathbf{x},fn} = \mathbf{R}_{jfn} v_{jfn}. \end{aligned}$$
(4.18)

Note that the cost (4.17) is not well defined (i.e., its value is infinite) when \(I > 1\) and matrices \(\widehat{\varvec{\Sigma }}_{\mathbf{x},fn}\) are not full rank, which is the case in definition (4.18). However, this is not a problem per se. Indeed, the infinite term \(- \log \mathrm{det} \left( \widehat{\varvec{\Sigma }}_{\mathbf{x},fn} \right) \) is independent on \(\varvec{\theta }\) and can be simply removed from the cost (4.17), since it has no influence on the optimization over \(\varvec{\theta }\). Otherwise, a small regularization term may be added to \(\widehat{\varvec{\Sigma }}_{\mathbf{x},fn}\), which would make it full rank. Also, there exist alternative definitions of \(\widehat{\varvec{\Sigma }}_{\mathbf{x},fn}\) [8, 42], where it might be full rank by construction.

Formulation with the cost (4.17) is interesting, since, as one can note, it is a generalization of the IS-NMF cost in the single channel case (see Chap. 1). Indeed, \(C_\mathrm{IS}(\varvec{\theta })\) becomes the single channel IS divergence when \(I = 1\).

4.6.2 Maximum a Posteriori

When a prior distribution \(p (\varvec{\theta })\) on model parameters is specified, like for example the spatial covariance prior in (4.13), the maximum a posteriori (MAP) criterion is usually used instead of the ML criterion. It writes

$$\begin{aligned} \varvec{\theta } = \mathrm{arg} \max _{\varvec{\theta }'} p (\varvec{\theta }' | \mathbf{X}) = \mathrm{arg} \max _{\varvec{\theta }'} p (\mathbf{X} | \varvec{\theta }') p (\varvec{\theta }'). \end{aligned}$$
(4.19)

Note that in case of prior in (4.13) we have \(p (\varvec{\theta }) = \prod _{f=1}^F p \left( \mathbf{R}_{jf} | \varvec{\varPsi }_{jf}, m \right) ^N\), since the prior is applied to each time-frequency bin.

If one tries rewriting (4.19) in a form similar to (4.17), it would result in simply adding \(- \log p (\varvec{\theta }')\) term to (4.17).

4.6.3 Other Criteria

Several other criteria were proposed as well. For example, we have seen that the ML criterion formulated as in (4.17) generalizes the single channel IS NMF to the multichannel case, as such it was proposed in [16] to generalize the single-channel NMF with Euclidean distance (EUC NMF) to the multichannel case. This is achieved by replacing the cost function (4.17) with the following one

$$\begin{aligned} C_\mathrm{FRB}(\varvec{\theta }) = \sum _{f,n=1}^{F,N} \left\| \widehat{\varvec{\Sigma }}_{\mathbf{x},fn} - \varvec{\Sigma }_{\mathbf{x},fn} \right\| _F^2, \end{aligned}$$
(4.20)

where \(\left\| \mathbf{A} \right\| _F\) denotes the Frobenius norm of a matrix \(\mathbf{A}\), and the data covariance matrix \(\widehat{\varvec{\Sigma }}_{\mathbf{x},fn}\) is defined slightly differently than in (4.18). Notably, it is defined as [16, 17]

$$\begin{aligned} \widehat{\varvec{\Sigma }}_{\mathbf{x},fn} = \sqrt{ \left| \mathbf{x}_{fn} \mathbf{x}^H_{fn} \right| } \times \mathrm{sign} \left( \mathbf{x}_{fn} \mathbf{x}^H_{fn} \right) , \end{aligned}$$
(4.21)

where all the operation, i.e., the absolute value \(\left| \cdot \right| \), the square root \(\sqrt{\cdot }\), the multiplication \(\times \) and the sign (\(\mathrm{sign} \left( a \right) = a / |a|\)), are applied element-wise to the corresponding matrices.

There is also the variational Bayes (VB) criterion [43], which consists in computing directly the posterior distribution of the source STFT coefficients while marginalizing over all possible model parameters.

4.7 Model Estimation Algorithms

There exist several model parameter estimation algorithms [8, 16]. Though, due to the probabilistic formulation of the LGM model (4.2), the expectation-maximization (EM) algorithm [44] is one of the most popular choices. As we will see below, the use of the EM algorithm results not in just one algorithm, but it leads to a family of algorithms. Indeed, each particular implementation of the EM algorithm depends on several choices, as will be explained below. Because of the EM popularity we will mostly concentrate here on the different variants of EM and will only mention briefly other algorithms.

To present the variants of EM algorithm we consider the LGM model (4.2) with time-invariant unconstrained full rank spatial covariances \(\mathbf{R}_{jf}\) and spatial variances \(v_{jfn}\) structured with NTF model (4.6). This is in fact a variant of multichannel NTF similar to the one described in [15], but with full rank covariances instead of rank-1 covariances as in [15]. Since no probabilistic priors on parameters are assumed, the variants of EM algorithm presented below are for the optimization of the ML criterion (4.16).

4.7.1 Variants of EM Algorithm

In one of its general formulations the EM algorithm [44] to optimize the ML criterion (4.16) consists first in specifying

  • so-called observed data \(\mathbf{X}\) that are usually the multichannel mixture STFT coefficients in the case of multichannel source separation, as considered here, and

  • so-called latent data \(\mathbf{Z}\). The choice of latent data may be quite different and different choices would lead to different EM variants.

Assuming that a probabilistic model parametrized by \(\varvec{\theta }\) is specified, the EM algorithm is usually applied in the following case. It is applied when it is difficult to optimize in a closed form the ML criterion (4.16) maximizing \(\log p (\mathbf{X} | \varvec{\theta })\), while it is easy to maximize in a closed form or via some simplified iterative procedure the log-likelihood \(\log p (\mathbf{X}, \mathbf{Z} | \varvec{\theta })\) of so-called complete data \(\{ \mathbf{X}, \mathbf{Z} \}\). The choice of latent data \(\mathbf{Z}\) is usually done accordingly.

The EM algorithm consists then in iterating the following two steps:

  • E-step: Compute an auxiliary function as follows:

    $$\begin{aligned} Q(\varvec{\theta }, \varvec{\theta }^{(\ell )}) = \mathbb E_{\mathbf{X}| \mathbf{Z}, \varvec{\theta }^{(\ell )}} \log p (\mathbf{X}, \mathbf{Z} | \varvec{\theta }). \end{aligned}$$
    (4.22)
  • M-step: Optimize the auxiliary function to update model parameters according to the following criterion:

    $$\begin{aligned} \varvec{\theta }^{(\ell +1)} = \mathrm{arg} \max _{\varvec{\theta }} Q(\varvec{\theta }, \varvec{\theta }^{(\ell )}), \end{aligned}$$
    (4.23)

    where \(\varvec{\theta }^{(\ell )}\) denotes the model parameters estimated at the \(\ell \)-th iteration.

It is often possible to optimize the criterion (4.23) in a closed form. However, sometimes, depending on the choice of latent data \(\mathbf{Z}\), it is not possible. In that case either another iterative optimization algorithm may be applied or any algorithm can be used provided that it assures at each iteration of EM the following non-decreasing of the auxiliary function:

$$\begin{aligned} Q(\varvec{\theta }^{(\ell +1)}, \varvec{\theta }^{(\ell )}) \ge Q(\varvec{\theta }^{(\ell )}, \varvec{\theta }^{(\ell )}). \end{aligned}$$
(4.24)

In the latter case the algorithm is called generalized EM (GEM) [44], and the ways the optimization (4.24) is performed lead again to different variants of the algorithm.

To summarize let us list various choices that lead to different EM algorithm variants and thus different model parameters estimation results. These choices include:

  1. 1.

    Choice of latent data \(\mathbf{Z}\), for example:

    • Latent data consist of NMF/NTF components [12] defined as

      $$\begin{aligned} c_{kjfn} \sim \mathscr {N}_c(0, w_{jfk} h_{jkn}), \qquad k = 1, \ldots , K_j \end{aligned}$$
      (4.25)

      in case of NMF spectral model (4.4), or as

      $$\begin{aligned} c_{kjfn} \sim \mathscr {N}_c(0, w_{fk} h_{kn} q_{jk}), \qquad k = 1, \ldots , K \end{aligned}$$
      (4.26)

      in case of NTF spectral model (4.6).

    • Latent data consist of so-called sub-sources [8] (see Sect. 4.7.2 below).

    • Latent data consist of point sources [15] \(s_{jfn}\) as in the narrowband approximation (4.8).

    • Latent data consist of spatial source images [27] \(\mathbf{y}_{jfn}\) as in (4.2).

    • Latent data consist of binary TF activations of the predominant source (see, e.g., [45] for details).

  2. 2.

    Choice of maximization step updates in case of GEM algorithm, for example:

    • Closed-form updates in case of EM algorithm.

    • Alternating closed-form updates over subsets of parameters [27] (each subset of parameters is updated by a closed-form update, while the other parameters are fixed).

    • Multiplicative update (MU) rules [5] to update NMF/NTF spectral model parameters [8].

  3. 3.

    Choice of initial parameters \(\varvec{\theta }^{(0)}\), for example:

    • Random parameters initialization [8].

    • Parameters initialization using the source separation results obtained by a different algorithm [12].

  4. 4.

    Choice of number of EM algorithm iterations, for example:

    • Fixed number of iterations (the most common choice).

    • Iterating till some stopping criterion depending on the likelihood value is satisfied.

A so-called spatial image EM (SIEM) algorithm, where the latent data are the spatial source images, is given in details in the Chap. 7. In the following section we present in details a so-called sub-source EM algorithm based on MU rules (SSEM/MU) [8], where the latent data are the sub-sources and MU rules are used for the NTF spectral model parameters updates within the M-step. Other variants of the EM and GEM algorithms may be found in the corresponding papers.

4.7.2 Detailed Presentation of SSEM/MU Algorithm

Recall that our model consists of time-invariant unconstrained full rank spatial covariances \(\mathbf{R}_{jf}\) and spatial variances \(v_{jfn}\) structured with NTF model (4.6). Thus, it can be parametrized as

$$\begin{aligned} \varvec{\theta } = \left\{ \{ \mathbf{R}_{jf} \}_{j,f}, \mathbf{Q}, \mathbf{W}, \mathbf{H} \right\} , \end{aligned}$$
(4.27)

with nonnegative matrices \(\mathbf{Q}\), \(\mathbf{W}\) and \(\mathbf{H}\) specified in Sect. 4.3.2.

The SSEM/MU algorithm presented below is a partial case of a more general algorithm from [8], though applied to a slightly different model (here the spectral variances are structured with NTF model, while in [8] they are structured with NMF model).

Each spatial \(I \times I\) covariance \(\mathbf{R}_{jf}\) being full rank, its rank equals to I. For each source j we introduce I so-called point sub-sources \(s_{ji,fn} \in \mathbb C\) (\(i = 1, \ldots , I\)) that share the same spectral variance \(v_{jfn}\), in other words they are distributed as

$$\begin{aligned} s_{ji,fn} \sim \mathscr {N}_c(0, v_{jfn}). \end{aligned}$$
(4.28)

Moreover, each spatial covariance \(\mathbf{R}_{jf}\) can be non-uniquely represented as

$$\begin{aligned} \mathbf{R}_{jf} = \mathbf{A}_{jf} \mathbf{A}_{jf}^H, \end{aligned}$$
(4.29)

where \(\mathbf{A}_{jf}\) is an \(I \times I\) complex-valued matrix. By introducing a \(J \, I\)-length vector

$$\begin{aligned} \mathbf{s}_{fn} = \left[ s_{11,fn}, \ldots , s_{1I,fn}, s_{21,fn}, \ldots , s_{2I,fn}, \,\cdots \,, s_{J1,fn}, \ldots , s_{JI,fn} \right] ^T, \end{aligned}$$
(4.30)

and an \(I \times J \, I\) matrix

$$\begin{aligned} \mathbf{A}_{f} = \left[ \mathbf{A}_{1f}, \mathbf{A}_{2f}, \ldots , \mathbf{A}_{Jf} \right] , \end{aligned}$$
(4.31)

one can show [8] that the LGM modeling (4.3) is equivalent (up to the noise term \(\mathbf{b}_{fn}\)) to

$$\begin{aligned} \mathbf{x}_{fn} = \mathbf{A}_{fn} \mathbf{s}_{fn} + \mathbf{b}_{fn}, \end{aligned}$$
(4.32)

with \(s_{ji,fn}\) (components of \(\mathbf{s}_{fn}\)) being mutually independent and distributed as in (4.28), the noise term \(\mathbf{b}_{fn}\) being distributed as

$$\begin{aligned} \mathbf{b}_{fn} \sim \mathscr {N}_c(0, \varvec{\Sigma }_{\mathbf{b},fn}), \end{aligned}$$
(4.33)

with an anisotropic covariance matrix \(\varvec{\Sigma }_{\mathbf{b},fn} = \sigma ^2_{\mathbf{b},f} \mathbf{I}_I\). The noise term \(\mathbf{b}_{fn}\) is needed for a so-called simulated annealing procedure that is necessary in this case (see [12] for details), where the noise variance \(\sigma ^2_{\mathbf{b},f}\) is usually decreased over the algorithm iterations.

Let us now compute the auxiliary function \(Q(\varvec{\theta }, \varvec{\theta }^{(\ell )})\) defined in (4.22). Below we will omit sometimes the indexing of parameters with \((\ell )\), and it will be clear from the context what are the parameters estimated on previous step and what are the parameters to be updated on the current step. The log-likelihood of the complete data \(\{ \mathbf{X}, \mathbf{Z} \}\) writesFootnote 4

$$\begin{aligned} \log p&(\mathbf{X}, \mathbf{Z} | \varvec{\theta }) = \log p (\mathbf{X} | \mathbf{Z}, \varvec{\theta }) + \log p (\mathbf{Z} | \varvec{\theta }) \nonumber \\&\overset{\mathrm{c}}{=} - \sum _{f,n} \mathbf{tr} \left[ \varvec{\Sigma }_{\mathbf{b},fn}^{-1} \left( \varvec{\Sigma }_{\mathbf{x},fn} - \mathbf{A}_{fn} \varvec{\Sigma }^H_{\mathbf{xs},fn} - \varvec{\Sigma }_{\mathbf{xs},fn} \mathbf{A}^H_{fn} + \mathbf{A}_{fn} \varvec{\Sigma }_{\mathbf{s},fn} \mathbf{A}^H_{fn} \right) \right] \nonumber \\&\qquad \qquad \qquad \qquad \qquad - \sum _{f,n} \log \left| \varvec{\Sigma }_{\mathbf{b},fn} \right| - I \sum _{j,f,n} d_{IS} (\xi _{jfn} | v_{jfn}), \end{aligned}$$
(4.34)

where

$$\begin{aligned} \varvec{\Sigma }_{\mathbf{x},fn} = \widehat{\varvec{\Sigma }}_{\mathbf{x},fn} = \mathbf{x}_{fn} \mathbf{x}^H_{fn} \end{aligned}$$
(4.35)

is computed as in (4.18),

$$\begin{aligned} \varvec{\Sigma }_{\mathbf{xs},fn}= & {} \mathbf{x}_{fn} \mathbf{s}^H_{fn}, \end{aligned}$$
(4.36)
$$\begin{aligned} \varvec{\Sigma }_{\mathbf{s},fn}= & {} \mathbf{s}_{fn} \mathbf{s}^H_{fn}, \end{aligned}$$
(4.37)
$$\begin{aligned} \xi _{j,fn}= & {} \frac{1}{I} \sum _{i=1}^I |{s}_{ji,fn}|^2, \end{aligned}$$
(4.38)

and \(d_{IS} (x | y) = \frac{x}{y} - \log \frac{x}{y} - 1\) is the scalar IS divergence (see Chap. 1).

By applying the conditional expectation operator \(\mathbb E_{\mathbf{X}| \mathbf{S}, \varvec{\theta }^{(\ell )}} \left[ \cdot \right] \) the auxiliary function \(Q(\varvec{\theta }, \varvec{\theta }^{(\ell )})\) writes then

$$\begin{aligned}&Q(\varvec{\theta }, \varvec{\theta }^{(\ell )}) \overset{\mathrm{c}}{=} - \sum _{f,n} \mathbf{tr} \left[ \varvec{\Sigma }_{\mathbf{b},fn}^{-1} \left( \widehat{\varvec{\Sigma }}_{\mathbf{x},fn} - \mathbf{A}_{fn} \widehat{\varvec{\Sigma }}^H_{\mathbf{xs},fn} - \widehat{\varvec{\Sigma }}_{\mathbf{xs},fn} \mathbf{A}^H_{fn} + \mathbf{A}_{fn} \widehat{\varvec{\Sigma }}_{\mathbf{s},fn} \mathbf{A}^H_{fn} \right) \right] \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad - \sum _{f,n} \log \left| \varvec{\Sigma }_{\mathbf{b},fn} \right| - I \sum _{j,f,n} d_{IS} (\hat{\xi }_{jfn} | v_{jfn}), \end{aligned}$$
(4.39)

with \(\widehat{\varvec{\Sigma }}_{\mathbf{xs},fn}\), \(\widehat{\varvec{\Sigma }}_{\mathbf{s},fn}\) and \(\hat{\xi }_{jfn}\) defined as

$$\begin{aligned} \widehat{\varvec{\Sigma }}_{\mathbf{xs},fn}= & {} \mathbb E_{\mathbf{X}| \mathbf{S}, \varvec{\theta }^{(\ell )}} \left[ \varvec{\Sigma }_{\mathbf{xs},fn} \right] , \end{aligned}$$
(4.40)
$$\begin{aligned} \widehat{\varvec{\Sigma }}_{\mathbf{s},fn}= & {} \mathbb E_{\mathbf{X}| \mathbf{S}, \varvec{\theta }^{(\ell )}} \left[ \varvec{\Sigma }_{\mathbf{s},fn} \right] , \end{aligned}$$
(4.41)
$$\begin{aligned} \hat{\xi }_{jfn}= & {} \mathbb E_{\mathbf{X}| \mathbf{S}, \varvec{\theta }^{(\ell )}} \left[ \xi _{jfn} \right] , \end{aligned}$$
(4.42)

and computed as follows:

$$\begin{aligned} \widehat{\varvec{\Sigma }}_{\mathbf{xs},fn}= & {} \widehat{\varvec{\Sigma }}_{\mathbf{x},fn} \varvec{\varOmega }_{\mathbf{s},fn}^H, \end{aligned}$$
(4.43)
$$\begin{aligned} \widehat{\varvec{\Sigma }}_{\mathbf{s},fn}= & {} \varvec{\varOmega }_{\mathbf{s},fn} \widehat{\varvec{\Sigma }}_{\mathbf{x},fn} \varvec{\varOmega }_{\mathbf{s},fn}^H + (\mathbf{I}_{J \, I} - \varvec{\varOmega }_{\mathbf{s},fn} \mathbf{A}_{f}) \varvec{\Sigma }_{\mathbf{s},fn}, \end{aligned}$$
(4.44)
$$\begin{aligned} \hat{\xi }_{jfn}= & {} \frac{1}{I} \sum _{i = (j-1)I+1}^{j I} \widehat{\varvec{\Sigma }}_{\mathbf{s},fn}(i, i), \end{aligned}$$
(4.45)

where

$$\begin{aligned} \varvec{\varOmega }_{\mathbf{s},fn}= & {} \varvec{\Sigma }_{\mathbf{s},fn} \mathbf{A}_{f}^H \varvec{\Sigma }^{-1}_{\mathbf{x},fn}, \end{aligned}$$
(4.46)
$$\begin{aligned} \varvec{\Sigma }_{\mathbf{x},fn}= & {} \mathbf{A}_{f} \varvec{\Sigma }_{\mathbf{s},fn} \mathbf{A}_{f}^H + \varvec{\Sigma }_{\mathbf{b},fn}, \end{aligned}$$
(4.47)
$$\begin{aligned} \varvec{\Sigma }_{\mathbf{s},fn}= & {} \mathrm{diag} \left( [ \underbrace{v_{1,fn}, \ldots , v_{1,fn}}_\text {{ I} times}, \underbrace{v_{2,fn}, \ldots , v_{2,fn}}_\text {{ I} times}, \,\cdots \,, \underbrace{v_{J,fn}, \ldots , v_{J,fn}}_\text {{ I} times} ] \right) . \end{aligned}$$
(4.48)

We now proceed with the M-step (4.23). Maximizing the auxiliary function (4.39) over \(\mathbf{A}_{f}\) leads to the following closed-form solutionFootnote 5:

$$\begin{aligned} \mathbf{A}_{f} = \widehat{\varvec{\Sigma }}_{\mathbf{xs},fn} \widehat{\varvec{\Sigma }}_{\mathbf{s},fn}^{-1}. \end{aligned}$$
(4.49)

Maximization of the auxiliary function (4.39) over \(\mathbf{Q}\), \(\mathbf{W}\) and \(\mathbf{H}\), i.e., the minimization of \(\sum _{j,f,n} d_{IS} (\hat{\xi }_{jfn} | v_{jfn})\) with \(v_{jfn}\) computed as in (4.6), does not allow a closed-form solution. As such, to update \(\mathbf{Q}\), \(\mathbf{W}\) and \(\mathbf{H}\), several iterations of the following MU rules [15] are applied:

$$\begin{aligned} q_{jk}\leftarrow & {} q_{jk} \left( \frac{\sum _{f,n} w_{fk} h_{kn} \hat{\xi }_{jfn} v_{jfn}^{-2}}{\sum _{f,n} w_{fk} h_{kn} v_{jfn}^{-1}} \right) , \end{aligned}$$
(4.50)
$$\begin{aligned} w_{fk}\leftarrow & {} w_{fk} \left( \frac{\sum _{j,n} h_{kn} q_{jk} \hat{\xi }_{jfn} v_{jfn}^{-2}}{\sum _{j,n} h_{kn} q_{jk} v_{jfn}^{-1}} \right) , \end{aligned}$$
(4.51)
$$\begin{aligned} h_{kn}\leftarrow & {} h_{kn} \left( \frac{\sum _{j,f} w_{fk} q_{jk} \hat{\xi }_{jfn} v_{jfn}^{-2}}{\sum _{j,f} w_{fk} q_{jk} v_{jfn}^{-1}} \right) . \end{aligned}$$
(4.52)

Applying these MU rules does not guarantee auxiliary function minimization as in (4.23), but only its non-decreasing as in (4.24). As such, this is in fact a GEM algorithm.

Algorithm 1 summarizes one iteration of the SSEM/MU algorithm derived above.

figure a

4.7.3 Other Algorithms

Another very popular choice for multichannel NMF model parameters estimation is the majorization-minimization (MM) algorithm [46], which is used for example in [16, 17]. Note that the EM algorithm is interpretable as a partial case of the MM algorithm.

4.8 Conclusion

In this chapter we have introduced multichannel NMF methods for audio source separation. Potential advantages and disadvantages of these methods are discussed. Despite a quickly growing popularity of deep learning that is now of a great interest for audio source separation, multichannel NMF methods remain still an important area of research and in our opinion cannot be completely replaced by deep learning-based methods in all situations. Indeed, especially in fully blind settings, where no training data are available, deep learning is not a suitable path any more, while multichannel NMF is still applicable.

As for the further research on multichannel NMF we would like highlighting the following possible paths which have been already started to be explored. One research direction consists in proposing more sophisticated spatial and spectral models adapted to the mixing conditions and sources of interest, as well as in proposing new models going beyond the limitations of the LGM modeling. Another direction consists in combining some aspects of multichannel NMF with deep learning.