Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The principal aim of blind source separation (BSS) is to extract the underlying source signals from only a set of observations. Due to the diverse promising and exciting applications, BSS has attracted a substantial amount of attention in both the academic field as well as the industry. During the last decade, tremendous developments have been achieved in the application of BSS, particularly in wireless communication, medical signal processing, geophysical exploration, and image enhancement/recognition. The so-called cocktail-party problem within the BSS context refers to the phenomenon of extracting original voice signals of the speakers from the signals recorded from several microphones. Similar examples in the field of radio communication involve the observations that correspond to the outputs of several antenna elements in response to several transmitters that represent the original signals. In the analysis of medical signals, electroencephalography (EEG), magnetoencephalography (MEG), and electrocardiogram (ECG) data represent the observations and BSS is used as a signal processing tool to assist noninvasive medical diagnosis. BSS has also been applied to the data analysis in other areas such as telecommunication, finance, and seismology. Further evidence of these applications can be found in [16]. A review of the current literature shows that there are three main classifications of BSS. These include linear and nonlinear, instantaneous and convolutive, overcomplete and underdetermined. In the first classification, linear algorithms dominate the BSS research field due to its simplicity in analysis and its explicit separability. Linear BSS assumes that the mixture is represented by a linear combination of sources. Extension of BSS for solving nonlinear mixtures has also been introduced. This model takes nonlinear distorted signals into consideration and offers a more accurate representation of a realistic environment. In the second classification, when the observed signals consist of combinations of multiple time-delayed versions of the original sources and/or mixed signals themselves, the system is referred as the convolutive mixture. Otherwise, the absence of time delays results in the instantaneous mixture of observed signals. Finally, when the number of observed signals exceeds the number of sources, this refers to the overcomplete BSS. Conversely, when the number of observed signals is less than the number of sources, this becomes the underdetermined BSS.

In general and for many practical applications, the challenging case for source separation is when only one monaural recording is available. This leads to the single channel blind source separation (SCBSS) where the problem can be stated as one observation mixed with several unknown sources. In this work, we consider the case of two sources, namely

$$\begin{aligned} y(t)=x_1 (t)+x_2 (t) \end{aligned}$$
(8.1)

where \(t=1,2,\ldots ,T\) denotes time index and the goal is to estimate the two sources \(x_1 (t)\) and \(x_2 (t)\) given only the observation signal \(y(t)\). Unlike conventional assumption used in BSS where the sources are assumed to be statistical independent which is rather too restrictive, in this chapter, the sources are characterized as nonstationary processes with time-varying spectra [7]. This assumption is practically justified since most signals encountered in applications are nonstationary with time-varying spectra. Examples include speech, audio, EEG, stock market index, and seismic trace.

Solutions to SCBSS using nonnegative matrix factorization (NMF) [8] have recently gained popularity. They exploit an appropriate time-frequency (TF) analysis on the mono input recordings, yielding a TF representation that can be decomposed as

$$\begin{aligned} \left| \mathbf{Y} \right| ^{.2}\approx \mathbf{DH} \end{aligned}$$
(8.2)

where \(\left| \mathbf{Y} \right| ^{.2}\in \mathfrak {R}_+^{F\times T_s } \) is the power TF representation of the mixture \(y(t)\) which is factorized as the product of two nonnegative matrices , \(\mathbf{D}\in \mathfrak {R}_+^{F\times I} \) and \(\mathbf{H}\in \mathfrak {R}_+^{I\times T_s } \). The superscript ‘\(\cdot \)’ represents element-wise operation. F and \(T_s \) represent the total frequency units and time slots in the TF domain, respectively. If I is chosen to be \(I=T_s \), no benefit is achieved in terms of representation. Thus the idea is to determine \(I<T_s \) so the matrix \(\mathbf{D}\) can be compressed and reduced to its integral components so that it contains only a set of spectral basis vectors, and \(\mathbf{H}\) is an encoding matrix that describes the amplitude of each basis vector at each time point. Because NMF gives a parts-based decomposition [8, 9], it has recently been proposed for separating drums from polyphonic music [10] and automatic transcription of polyphonic music [11]. Commonly used cost functions for NMF are the generalized Kullback-Leibler (KL) divergence and Least Square (LS) distance [8]. A sparseness constraint [12] can be added to these cost functions for optimizing \(\mathbf{D}\) and \(\mathbf{H}\). Other cost functions for audio spectrograms factorization have also been introduced such as that of [13] that assume multiplicative gamma-distributed noise in power spectrograms, while [14] attempts to incorporate phase into the factorization by using a probabilistic phase model. Notwithstanding the above, families of parameterized cost functions, such as the Beta divergence [15] and Csiszar’s divergences [16], have also been presented for the source separation. However, they have some crucial limitations that explicitly use training knowledge of the sources [17]. As a consequence, these methods are only able to deal with a very specific set of signals and situations.

Model-based techniques have also been proposed for SCSS which usually require training a set of isolated recordings. The sources are trained by using a Hidden Markov model (HMM) based on Gaussian Mixture Model (GMM) and they are combined in a factorial HMM to separate the mixture [18]. Good separation requires detailed source models that might use thousands of full spectral states, e.g., in [19] HMMs with 8,000 states were required to accurately represent one person’s speech for a source separation task. The large state space is required because it attempts to capture every possible instance of the signal. These model-based techniques, however, consume a long time not only in training the prior parameters but also presenting many difficult challenges during the inference stage.

From the above, it is clear that existing solutions to SCBSS are still practically limited and fall short of the success enjoyed in other areas of source separation. In this chapter, a novel separation system is proposed and the contributions are summarized as follows:

  1. (i)

    A separability analysis in the TF domain for SCBSS and development a quantitative performance measure to evaluate the degree of “separateness” in the monaural mixed signal.

  2. (ii)

    A separation framework based on the cochleagram. Unlike the spectrogram that deals only with uniform resolution, the gammatone filterbank produces nonuniform TF domain (termed as the cochleagram) whereby each TF unit has different resolution. We prove that the mixed signal is more separable in the cochleagram than the spectrogram and the log-frequency.

  3. (iii)

    Development of two-dimensional NMF (NMF2D) signal model optimized under the Itakura-Saito (IS) divergence with Quasi-EM and MGD updates (IS-NMF2D). Two new algorithms have been developed to estimate the spectral and temporal features of the audio source model. The first algorithm is founded on the framework of Quasi-EM (Expectation-Maximization) while the second algorithm is based on the multiplicative gradient decent (MGD) update rule. Both algorithms have the unique property of scale-invariant whereby the lower energy components in the TF domain can be treated with equal importance as the higher energy components. This is to be contrasted with other methods based on LS distance [20] and KL divergence [21], which favor the high-energy components but neglect the low-energy components.

The chapter is organized as follows: Sect. 8.2 introduces the TF matrix representation using the gammatone filterbank. Section 8.3 delves into the separability analysis of the single-channel mixture in the nonuniform TF domain. In Sect. 8.4, the two new algorithms are derived and the proposed separation system is developed. Experimental results and a series of performance comparison with methods are presented in Sect. 8.5. Finally, Sect. 8.6 concludes the chapter.

2 Time-Frequency Representation

In the task of audio source separation, one critical decision is to choose a suitable TF domain to represent the time-varying contents of the signals. There are several types of TF representations and the most widely used ones are spectrogram [22] and log-frequency spectrogram (using constant-Q transform) [23]. This is documented over the last few years in the research of audio source separation [1021]. In this work, however, we develop our separation algorithms using a TF representation based on the gammatone filterbank.

2.1 Gammatone Filterbank and Cochleagram

The Gammatone filterbank [24] is a cochlear filtering model which decomposes an input signal into the time-frequency domain using a set of gammatone filters. The specific steps of generate cochleagram are summarized as (Table 8.1).

Table 8.1 Cochleagram computation

In [25, 26], it was noted that some crucial differences exist in the TF representation of how sound is analyzed by the ear. In particular, the ear’s frequency subbands get wider for higher frequencies, whereas the classical spectrogram as computed by the Short-Time Fourier Transform (STFT) has an equal-spaced bandwidth across all frequency channels. Since speech signals are characterized as highly nonstationary and nonperiodic whereas music changes continuously, therefore, application of the Fourier transform will produce errors when complicated transient phenomena such as the mixture of speech and music is contained in the analyzed signal. Unlike the spectrogram, the log-frequency spectrogram possesses nonuniform TF resolution. However, it does not exactly match the nonlinear resolution of the cochlear since their center frequencies are distributed logarithmically along the frequency axis and all filters have constant-Q factor [23]. On a separate hand, the gammatone filters used in the cochlear model (3) are approximately logarithmically spaced with constant-Q for frequencies from \({f_s }/{10}\) to \({f_s }/2\) (\(f_s \) denotes the sampling frequency), and approximately linearly spaced for frequencies below\({f_s }/{10}\). Hence, this characteristic results in selective nonuniform resolution in the TF representation of the analyzed audio signal. Figure 8.1 shows the frequency response of a general gammatone filterbank for \(f_s =16\) kHz. It is seen that the higher frequencies correspond to the wider frequency subbands which resemble closely to the human perception of frequencies [27]. Therefore, the cochleagram is developed as an alternative TF analysis tool for source separation to overcome the limitations associated with the Fourier transform approach.

3 Single Channel Source Separability Analysis

For separation, one generates the TF mask corresponding to each source and applies the generated mask to the mixture to obtain the estimated source TF representation. In particular, when the sources do not overlap in the TF domain, an optimum mask \(M_i^{\text {opt}} (f,t_s )\) exists which allows one to extract the ith original source from the mixture as

$$\begin{aligned} X_i (f,t_s )=M_i^{\text {opt}} (f,t_s )Y(f,t_s ) \end{aligned}$$
(8.5)

Given any TF mask \(M_i (f,t_s )\) such that \(0\le M_i (f,t_s )\le 1\) for all \((f,t_s )\), we define the separability for the target source \(x_i (t)\) in the presence of the interfering sources \(p_i (t)=\sum \limits _{j=1,j\ne i}^N {x_j (t)} \) as

$$\begin{aligned} S_{M_i }^{Y\rightarrow X_i ,P_i } =\frac{\left\| {M_i (f,t_s )X_i (f,t_s )} \right\| _F^2 }{\left\| {X_i (f,t_s )} \right\| _F^2 }-\frac{\left\| {M_i (f,t_s )P_i (f,t_s )} \right\| _F^2 }{\left\| {X_i (f,t_s )} \right\| _F^2 } \end{aligned}$$
(8.6)

where \(X_i \left( {f,t_s } \right) \) and \(P_i (f,t_s )\) are the TF representations of \(x_i (t)\) and \(p_i (t)\), respectively. \(\parallel \cdot \parallel _F\) is the Frobenius norm. We also define the separability of the mixture with respect to all the N sources as:

$$\begin{aligned} S_{M_1 ,\ldots ,M_N }^{Y\rightarrow X_1 ,\ldots ,X_N } =\frac{1}{N}\sum _{i=1}^N {\;S_{M_i }^{Y\rightarrow X_i ,P_i } } \end{aligned}$$
(8.7)

Equation (8.6) is equivalent to measuring the success of extracting the ith source \(X_i (f,t_s )\) from the mixture \(Y(f,t_s )\) given the TF mask \(M_i (f,t_s )\). Similarly, (8.7) measures the success of extracting all the N sources simultaneously from the mixture. To further analyze the separability, we invoke the following: (i) Preserved signal ratio (PSR) that determines how well the mask preserves the source of interest and (ii) Signal-to-interference ratio (SIR) that indicates how well the mask suppresses the interfering sources:

$$\begin{aligned} \begin{array}{l} \textit{PSR}_{M_i }^{X_i } =\frac{\left\| {M_i (f,t_s )X_i (f,t_s )} \right\| _F^2 }{\left\| {X_i (f,t_s )} \right\| _F^2 } \\ \textit{SIR}_{M_i }^{X_i } =\frac{\left\| {M_i (f,t_s )X_i (f,t_s )} \right\| _F^2 }{\left\| {M_i (f,t_s )P_i (f,t_s )} \right\| _F^2 } \\ \end{array} \end{aligned}$$
(8.8)

Using (8.8), it can be shown that (8.7) can be expressed as \(S_{M_i }^{Y\rightarrow X_i ,P_i } =\textit{PSR}_{M_i }^{X_i } -{\textit{PSR}_{M_i }^{X_i } }/{\textit{SIR}_{M_i }^{X_i } }\). Analyzing the terms in (8.6), we have

$$\begin{aligned} \begin{array}{ll} \textit{PSR}_{M_i }^{X_i } &{}:=\left\{ {{\begin{array}{ll} 1,&{}\quad \quad \hbox {if} \sup p M_i^\mathrm{{opt}} =\sup p M_i \\ {<}1,&{}\quad \quad \hbox {if} \sup p M_i^\mathrm{{opt}} \subset \sup p M_i \\ \end{array} }} \right. \\ \textit{SIR}_{M_i }^{X_i } &{}:=\left\{ {{\begin{array}{ll} \infty ,&{}\quad \hbox {if} \sup p \left[ {M_i X_i } \right] \cap \sup p P_i =\oslash \\ \mathrm {finite},&{}\quad \hbox {if} \sup p \left[ {M_i X_i } \right] \cap \sup p P_i \ne \oslash \\ \end{array}}} \right. \\ \end{array} \end{aligned}$$
(8.9)

where ‘supp’ denotes the support. When \(S_{M_i }^{Y\rightarrow X_i ,P_i } =1\) (i.e. \(\textit{PSR}_{M_i }^{X_i } =1\) and \(\textit{SIR}_{M_i }^{X_i } =\infty )\), this indicates that the mixture \(y(t)\) is separable with respect to the i \(^{th}\) source \(x_i (t)\). In other words, \(X_i (f,t_s )\) does not overlap with \(P_i (f,t_s )\) and the TF mask \(M_i (f,t_s )\) has perfectly separated the i \(^{th}\) source \(X_i (f,t_s )\) from the mixture \(Y(f,t_s )\). This corresponds to \(M_i (f,t_s )=M_i^{opt} (f,t_s )\) in (8.5). Hence, this is the maximum attainable \(S_{M_i }^{Y\rightarrow X_i ,P_i } \) value. For other cases of \(\textit{PSR}_{M_i }^{X_i } \) and \(\textit{SIR}_{M_i }^{X_i } \), we have \(S_{M_i }^{Y\rightarrow X_i ,P_i } <\; 1\). Using this concept, we can extend the analysis for the case of separating N sources. A mixture is fully separable to all the N sources if and only if \(S_{M_1 ,...,M_N }^{Y\rightarrow X_1 ,...,X_N } =1\) in (8.7). For the case \(S_{M_1 ,...,M_N }^{Y\rightarrow X_1 ,...,X_N } <1\), this implies that some of the sources overlap with each other in the TF domain and therefore, they cannot be fully separated. Thus, \(S_{M_1 ,...,M_N }^{Y\rightarrow X_1 ,...,X_N } \) provides the quantitative performance measure for evaluating how separable is the mixture in the TF domain. In our comparison, the following TF representations are used to test the mixture’s separability: spectrogram, log-frequency spectrogram, and cochleagram. In the log-frequency spectrogram, the frequency scale is set to logarithmic and grouped into 175 frequency bins in the range of 50–8 kHz with 24 bins per octave while the bandwidth follows the constant-Q rule [23]. To ensure fair comparison, we generate the ideal binary mask (IBM) [27] directly from the original sources. To reiterate our aim, the separability analysis is undertaken without recourse to any separation algorithms but utilizing only the energy of the sources to ascertain the degree of “separateness” of the mixture in different TF domains. These results have been tabulated in Fig. 8.1. The symbols ‘M’ and ‘S’ denotes music and speech, respectively.

Fig. 8.1
figure 1

Averaged separability performance

Fig. 8.2
figure 2

Separability under different window length

In Fig. 8.1, three types of mixture have been used: (i) music mixed with music, (ii) speech mixed with music, and (iii) speech mixed with speech. The speech signals are selected from 10 male and 10 female speeches taken from TIMIT database and are normalized to unit energy. The 10 music sources are selected from the RWC database [28] and also normalized to unit energy. Two sources are randomly chosen from the databases and the mixed signal is generated by adding the sources. All mixed signals are sampled at 16 kHz sampling rate. TF representation using different window length has also been investigated and the results are tabulated in Fig. 8.2.

Figure 8.2 shows the average separability results for all types of the mixture based on different window length. The bracketed number shows the number of data points corresponding to the particular window length. It is clear that, for both spectrogram and log-frequency spectrogram settings, the STFT with 1024-point window length is the best setting to analyze the separability performance. The results of PSR, SIR, and separability for each TF domain are obtained by averaging over 300 realizations. Following the listening performance test proposed in [29], we conclude that \(S_{M_i }^{Y\rightarrow X_i ,P_i } >0.8\) leads to acceptable separation performance. Therefore, all TF representations satisfy this condition. While this is true, the spectrogram gives only a mediocre level of separability with averaged \(S_{M_1 ,M_2 }^{Y\rightarrow X_1 ,X_2 } \approx 0.86\) while the log-frequency spectrogram shows a better result with \(S_{M_1 ,M_2 }^{Y\rightarrow X_1 ,X_2 } \approx 0.94\). Nevertheless, the cochleagram yields the best separability with \(S_{M_1 ,M_2 }^{Y\rightarrow X_1 ,X_2 } \approx 0.98\). Notwithstanding this, it is also seen that the average SIR of the cochleagram exhibits a much higher value than those of spectrogram and log-frequency spectrogram. This implies that the amount of interference between any two sources is lesser in the cochleagram.

4 The Proposed Method

In this section, two new algorithms are developed, namely the Quasi-EM IS-NMF2D and the MGD IS-NMF2D. The former algorithm optimizes the parameters of the signal model using the Expectation-Maximization approach, whereas the latter is directly based on the multiplicative gradient descent. To facilitate the derivation of these algorithms, we first consider the signal model in terms of the power TF representation

4.1 Signal Models

Since the sources have time-varying spectra, it is befitting to adopt a model whose power spectra can be described separately in terms of time and frequency. Although conventional NMF model can still be used, it will need a large number of spectral components and requires a clustering step to group and assign each spectral component to the appropriate source. As a result, the NMF model may not always yield the optimal results. An alternative model is to use the two-dimensional NMF model (NMF2D) [2, 3, 30, 31]. This model extends the basic NMF to be a two-dimensional convolution of \(\mathbf{D}\) and \(\mathbf{H}\) i.e. \(\left| \mathbf{Y} \right| ^{.2}\approx \sum _{\tau ,\phi } {\mathop {\mathbf{D}^{\tau }}\limits ^{\downarrow \phi } \mathop {\mathbf{H}^{\phi }}\limits ^{\rightarrow \tau } } \) where the vertical arrow in \(\mathop {\mathbf{D}^{\tau }}\limits ^{\downarrow \phi } \) denotes the downward shift that moves each element in the matrix down by \(\phi \) rows, and the horizontal arrow in \(\mathop {\mathbf{H}^{\phi }}\limits ^{\rightarrow \tau } \) denotes the right shift operator that moves each element in the matrix to the right by \(\tau \) columns. In scalar representation, the \(\left( {f,t_s } \right) \)th element in \(\left| \mathbf{Y} \right| ^{.2}\) is given by \(\left| {\mathbf{Y}_{f,t_s } } \right| ^{2}\;\;\;\approx \;\;\sum _{i=1}^I {\sum _{\tau =0}^{\tau _{\max } } {\sum _{\phi =0}^{\phi _{\max } } {\mathbf{D}_{f-\phi ,i}^\tau \mathbf{H}_{i,t_s -\tau }^\phi } } } \) where \(\mathbf{D}_{{f}',{i}'}^{{\tau }'} \) is the \(\left( {{f}',{\tau }',{i}'} \right) \)th element of \(\mathbf{D}\) and \(\mathbf{H}_{{i}',{t}'_s }^{{\phi }'} \) is the \(\left( {{i}',{\phi }',{t}'_s } \right) \)th element of \(\mathbf{H}\). In source separation, this model compactly represents the characteristics of the nonstationary sources by a time-frequency profile convolved in both time and frequency by a time-frequency weight matrix. \(\mathbf{D}_i^\tau \) represents the spectral basis of ith source in the TF domain and \(\mathbf{H}_i^\phi \) represents the corresponding temporal code for each spectral basis.

The TF representation of the mixture in (8.1) is given by \(Y(f,t_s )=X_1 (f,t_s )+X_2 (f,t_s )\) where \(Y(f,t_s )\), \(X_1 (f,t_s )\) and \(X_2 (f,t_s )\) denote the TF components that are obtained by applying the gammatone filterbank to the mixture. The time slots are given by \(t_s =1,2,\ldots ,T_s \) while frequencies by \(f=1,2,\ldots ,F\). Since each component is a function of \(t_s \) and \(f\), we represent this as a \(F\times T_s \) matrix \(\mathbf{Y}=\left[ {Y(f,t_s )} \right] _{t_s =1,2,\ldots ,T_s }^{f=1,2,\ldots ,F } \) and \(\mathbf{X}_i =\left[ {X_i (f,t_s )} \right] _{t_s =1,2,\ldots ,T_s }^{f=1,2,\ldots ,F } \). It is shown in Sect. 8.3 that the sources are almost perfectly separable in the cochleagram. This therefore enable us to express the power TF representation as \(\left| \mathbf{Y} \right| ^{.2}\approx \sum _{i=1}^I {\left| {\mathbf{X}_i } \right| ^{.2}} \) which we will model as \(\left| {\mathbf{Y}_{f,t_s } } \right| ^{2}\;\;\;\approx \;\;\sum _{i=1}^I {\sum _{\tau =0}^{\tau _{\max } } {\sum _{\phi =0}^{\phi _{\max } } {\mathbf{D}_{f-\phi ,i}^\tau \mathbf{H}_{i,t_s -\tau }^\phi } } } \). The source we seek to determine are \(\left\{ {{}\left| {X_i (f,t_s )} \right| ^{.2}} \right\} _{i=1}^I \) and this will be obtained by using the matrix factorization as \(\left| {\tilde{X}_i (f,t_s )} \right| ^{.2}=\sum _{\tau =0}^{\tau _{\max } } {\sum _{\phi =0}^{\phi _{\max } } {\mathbf{D}_{f-\phi ,i}^\tau \mathbf{H}_{i,t_s -\tau }^\phi } } \). In the following, we propose two novel algorithms to estimate \(\mathbf{D}_{f,i}^\tau \) and \(\mathbf{H}_{i,t_s }^\phi \) from the mixture signal.

4.2 Algorithm 1: Quasi-EM Formulation of IS-NMF2D (Quasi-EM IS-NMF2D)

We consider the following generative model defined as:

$$ \mathbf{y}_{t_s } =\sum _{k=1}^K {\mathbf{c}_{k,t_s } } , \forall t_s =1,\ldots ,T_s \mathbf{c}_{k,t_s } =\left[ {c_{k,1,t_s } ,\ldots ,c_{F,1,t_s } } \right] ^{\mathbf{T}} $$
$$\begin{aligned} c_{k,f,t_s } \sim N_c \left( {0\;,\;\sum _{\tau ,\phi } {\mathbf{H}_{k,t_s -\tau }^\phi \mathbf{D}_{f-\phi ,k}^\tau } } \right) \end{aligned}$$
(8.10)

where \(\mathbf{y}_{t_s } \in C^{F\times 1}\), \(\mathbf{c}_{k,t_s } \in C^{F\times 1}\) and \(N_c \left( {u,\Sigma } \right) \) denotes the proper complex Gaussian distribution and the components \(\mathbf{c}_{1,t_s ,} \ldots _, \mathbf{c}_{K,t_s } \) are both mutually and individually independent. The Expectation-Maximization (EM) framework is developed for the ML estimation of \(\mathbf{\theta }=\left\{ {{}\mathbf{D}^{\tau },\mathbf{H}^{\phi }{}} \right\} \). Due to the additive structure of the generative model (8.10), the parameters describing each component \(\mathbf{C}_k =\left[ {\mathbf{c}_{k,1,} \ldots _, \mathbf{c}_{k,T_s } } \right] \) can be updated separately. We now consider a partition of the parameter space \(\mathbf{\theta }=\bigcup _{k=1}^K {\mathbf{\theta }_k } \) as \(\mathbf{\theta }_k =\left\{ {{}\mathbf{D}_k^\tau ,\mathbf{H}_k^\phi {}} \right\} \) where \(\mathbf{D}_k^\tau \) is the kth column of \(\mathbf{D}^{\tau }\) and \(\mathbf{H}_k^\phi \) is the k \(^{th}\) row of \(\mathbf{H}^{\phi }\). The EM algorithm works by formulating the conditional expectation of the negative log likelihood of \(\mathbf{C}_k \) as

$$\begin{aligned} Q_k^{ML} \left( {\mathbf{\theta }{ }_k|\mathbf{{\theta }'}} \right) =-\int \limits _{\mathbf{C}_k } {p\!\left( {\mathbf{C}_k |\mathbf{Y},\mathbf{{\theta }'}} \right) \log p\!\left( {\mathbf{C}_k |\mathbf{\theta }{ }_k} \right) \;} d\mathbf{C}_k \end{aligned}$$
(8.11)

where \(\mathbf{{\theta }'}\) always contains the most recent parameter values of \(\left\{ {{}\mathbf{D}^{\tau }, \mathbf{H}^{\phi } } \right\} \).

4.2.1 Expressions of the E- and M-step

One iteration of the EM algorithm includes computing the E-step and maximizing the M-step \(Q_k^{ML} \left( {\mathbf{\theta }{ }_k|\mathbf{{\theta }'}} \right) \) for \(k=1,\ldots ,K\). The minus hidden-data log likelihood is defined as

$$\begin{aligned} -\log p\left( {\mathbf{C}_k |\mathbf{\theta }{ }_k} \right)&=-\sum \limits _{t_s =1}^{T_s } {\sum \limits _{f=1}^F {\log N_c } } \left( {c_{k,f,t_s } \left| {\;0 ,\;\sum \limits _{\tau ,\phi } {\mathbf{D}_{f-\phi ,k}^\tau \mathbf{H}_{k,t_s -\tau }^\phi } } \right. } \right) \\ \quad \quad \quad \quad \quad \quad \;&\dot{=}\sum \limits _{t_s =1}^{T_s } {\sum \limits _{f=1}^F {\log \left( {\sum \limits _{\tau ,\phi } {\mathbf{D}_{f-\phi ,k}^\tau \mathbf{H}_{k,t_s -\tau }^\phi } } \right) } } +\frac{\left| { c_{k,f,t_s } } \right| ^{2}}{\sum \limits _{\tau ,\phi } {\mathbf{D}_{f-\phi ,k}^\tau \mathbf{H}_{k,t_s -\tau }^\phi }} \end{aligned}$$
(8.12)

where ‘\(\dot{=}\)’ in the second line denotes equality up to constant terms. Then, by virtue of (10), the hidden-data posterior also has a Gaussian form as \(p\left( {\mathbf{C}_k |\mathbf{Y},\mathbf{\theta }} \right) =\prod \limits _{t_s =1}^{T_s } {\prod \limits _{f=1}^F {N_c \left( {c_{k,f,t_s } \left| { u_{k,f,t_s }^{post} \lambda _{k,f,t_s }^{post} } \right. } \right) } } \) where \(u_{k,f,t_s }^{post} \) and \(\lambda _{k,f,t_s }^{post} \) are the posterior mean and variance of \(c_{k,f,t_s } \) given as:

$$ u_{k,f,t_s }^{post} =\frac{\sum \limits _{\tau ,\phi } {\mathbf{D}_{f-\phi ,k}^\tau \mathbf{H}_{k,t_s -\tau }^\phi } }{\sum \limits _{\tau ,\phi ,l} {\mathbf{D}_{f-\phi ,l}^\tau \mathbf{H}_{l,t_s -\tau }^\phi } }\mathbf{Y}_{f,t_s } $$
$$\begin{aligned} \lambda _{k,f,t_s }^{post} =\frac{\sum \limits _{\tau ,\phi } {\mathbf{D}_{f-\phi ,k}^\tau \mathbf{H}_{k,t_s -\tau }^\phi } }{\sum \limits _{\tau ,\phi ,l} {\mathbf{D}_{f-\phi ,l}^\tau \mathbf{H}_{l,t_s -\tau }^\phi } }\sum \limits _{\tau ,\phi ,l\ne k} {\mathbf{D}_{f-\phi ,l}^\tau \mathbf{H}_{l,t_s -\tau }^\phi } \end{aligned}$$
(8.13)

Thus, the E-step merely includes computing the posterior power \(\mathbf{V}_k \) of component \(\mathbf{C}_k \), defined as \([\mathbf{V}_k ]_{f,t_s } =v_{k,f,t_s } =\left| { u_{k,f,t_s }^{post} } \right| ^{2}+\lambda _{k,f,t_s }^{post} \). The M-step can be treated as one-component NMF problem:

$$\begin{aligned} Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right)&\dot{=}\sum \limits _{t_s =1}^{T_s } {\sum \limits _{f=1}^F {\log \left( {\sum \limits _{\tau ,\phi } {\mathbf{D}_{f-\phi ,k}^\tau \mathbf{H}_{k,t_s -\tau }^\phi } } \right) } } +\frac{\left| {u_{k,f,t_s }^{pos{t}'} } \right| ^{2}+\lambda _{k,f,t_s }^{pos{t}'} }{\sum \limits _{\tau ,\phi } {\mathbf{D}_{f-\phi ,k}^\tau \mathbf{H}_{k,t_s -\tau }^\phi } } \\ \quad \quad \quad \quad \;\;&\dot{=}\sum \limits _{t_s =1}^{T_s } {\sum \limits _{f=1}^F {d_{IS} \left( {|u_{k,f,t_s }^{pos{t}'} |^{2}+\lambda _{k,f,t_s }^{pos{t}'} \;\left| {\;\sum \limits _{\tau ,\phi } {\mathbf{D}_{f-\phi ,k}^\tau \mathbf{H}_{k,t_s -\tau }^\phi } } \right. } \right) }} \end{aligned}$$
(8.14)

where \(d_{IS} (\cdot |\cdot )\) is the IS divergence [32] and is formally defined as \(d_{IS} (a|b)=(a/b)-\log \left( {a/b} \right) -1\). The IS divergence has the property of scale invariant, i.e., \(d_{IS} (\kappa {}a|\kappa {}b)=d_{IS} (a|b)\) for any \(\kappa \). This implies that any low energy components \((a ,b)\)will bear the same relative importance as the high energy ones \((\kappa {}a ,\kappa {}b)\). This is particularly important in situations where\(\left| \mathbf{Y} \right| ^{.2}\) is characterized by a large dynamic range such as the audio short-term spectra.

4.2.2 Estimation of the Spectral Basis and Temporal Code Using Quasi-EM Method

The spectral basis and temporal code can be obtained from (8.14). The derivative of a given element of \(g_{k,f,t_s } =\sum \limits _{\tau ,\phi } {\mathbf{D}_{f-\phi ,k}^\tau \mathbf{H}_{k,t_s -\tau }^\phi } \) with respect to \(\mathbf{D}_{f,k}^\tau \) and \(\mathbf{H}_{k,t_s }^\phi \) is given by:

$$\begin{aligned} \begin{array}{l} \frac{\partial g_{k,f,t_s } }{\partial \mathbf{D}_{{f}',{k}'}^{{\tau }'} }=\frac{\partial \sum \limits _{\tau ,\phi } {\mathbf{D}_{f-\phi ,k}^\tau \mathbf{H}_{k,t_s -\tau }^\phi } }{\partial \mathbf{D}_{{f}',{k}'}^{{\tau }'} }=\mathbf{H}_{{k}',t_s -{\tau }'}^{f-{f}'} \\ \frac{\partial g_{k,f,t_s } }{\partial \mathbf{H}_{{k}',{t}'_s }^{{\phi }'} }=\frac{\partial \sum \limits _{\tau ,\phi } {\mathbf{D}_{f-\phi ,k}^\tau \mathbf{H}_{k,t_s -\tau }^\phi } }{\partial \mathbf{H}_{{k}',{t}'_s }^{{\phi }'} }=\mathbf{D}_{f-{\phi }',{k}'}^{t_s -{t}'_s } \\ \end{array} \end{aligned}$$
(8.15)

The derivatives of (8.14) corresponding to \(\mathbf{D}_{f,k}^\tau \) and \(\mathbf{H}_{k,t_s }^\phi \) is then obtained as

$$\begin{aligned} \begin{array}{rl} \frac{\partial Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right) }{\partial \mathbf{D}_{{f}',{k}'}^{{\tau }'} } &{}=\frac{\partial }{\partial \mathbf{D}_{{f}',{k}'}^{{\tau }'} }\sum \limits _{f,t_s } {\log \left( {g_{k,f,t_s } } \right) } +\frac{{v}'_{k,f,t_s } }{g_{k,f,t_s } } \\ &{}=\sum \limits _{\phi ,t_s } {\left( {\frac{g_{k,{f}'+\phi ,t_s } -{v}'_{k,{f}'+\phi ,t_s } }{g_{k,{f}'+\phi ,t_s }^2 }} \right) \;\mathbf{H}_{{k}',t_s -{\tau }'}^\phi } \\ \frac{\partial Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right) }{\partial \mathbf{H}_{{k}',{t}'_s }^{{\phi }'} }&{}=\frac{\partial }{\partial \mathbf{H}_{{k}',{t}'_s }^{{\phi }'} }\sum \limits _{f,t_s } {\log \left( {g_{k,f,t_s } } \right) } +\frac{{v}'_{k,f,t_s } }{g_{k,f,t_s } } \\ \;\;&{}=\sum \limits _{\tau ,f} {\left( {\frac{g_{k,f,{t}'_s +\tau } -{v}'_{k,f,{t}'_s +\tau } }{g_{k,f,{t}'_s +\tau }^2 }} \right) \;\mathbf{D}_{f-{\phi }',{k}'}^\tau } \end{array} \end{aligned}$$
(8.16)

Unlike the conventional EM algorithm, it is not possible to directly set \({\partial Q_k^{ML} \left( {\mathbf{\theta }{ }_k|\mathbf{{\theta }'}} \right) }/{\mathbf{D}_{{f}',{k}'}^{{\tau }'} }=0\) and \({\partial Q_k^{ML} \left( {\mathbf{\theta }{ }_k|\mathbf{{\theta }'}} \right) }/{\mathbf{H}_{{k}',{t}'_s }^{{\phi }'} }=0\) because of the nonlinear coupling between and via \({v}'_{k,f,t_s } \). Thus, closed-form expressions for estimating \(\mathbf{D}_{f,k}^ \tau \) and \(\mathbf{H}_{k,t_s }^\phi \) cannot be accomplished. To overcome this problem, we use the following update rules and unify it as part of the M-step:

$$\begin{aligned} \mathbf{\theta }_k \leftarrow \mathbf{\theta }_k \cdot \left( {\frac{\left[ {\nabla Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right) } \right] _- }{\left[ {\nabla Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right) } \right] _+ }} \right) \end{aligned}$$
(8.17)

where \(\nabla Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right) =\left[ {\nabla Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right) } \right] _+ -\left[ {\nabla Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right) } \right] _- \). For each \(\mathbf{D}_k^\tau \) and \(\mathbf{H}_k^\phi \) variables, we have:

$$\begin{aligned} \begin{array}{l} \left[ {\nabla Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right) } \right] _{{}-}^{{}\mathbf{D}} =\sum \limits _{\phi ,t_s } {\left( {g_{k,{f}'+\phi ,t_s } } \right) ^{-2}{v}'_{k,{f}'+\phi ,t_s } \mathbf{H}_{{k}',t_s -{\tau }'}^\phi } \\ \left[ {\nabla Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right) } \right] _{{}+}^{{}\mathbf{D}} =\sum \limits _{\phi ,t_s } {\left( {g_{k,{f}'+\phi ,t_s } } \right) ^{-1}\mathbf{H}_{{k}',t_s -{\tau }'}^\phi } \\ \end{array} \end{aligned}$$
(8.18)

and

$$\begin{aligned} \begin{array}{l} \left[ {\nabla Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right) } \right] _{{}-}^{{}\mathbf{H}} =\sum \limits _{\tau ,f} {\mathbf{D}_{f-{\phi }',{k}'}^\tau \left( {g_{k,f,{t}'_s +\tau } } \right) ^{-2}{v}'_{k,f,{t}'_s +\tau } } \\ \left[ {\nabla Q_k^{ML} \left( {\mathbf{\theta }_k |\mathbf{{\theta }'}} \right) } \right] _{{}+}^{{}\mathbf{H}} =\sum \limits _{\tau ,f} {\mathbf{D}_{f-{\phi }',{k}'}^\tau \left( {g_{k,f,{t}'_s +\tau } } \right) ^{-1}} \\ \end{array} \end{aligned}$$
(8.19)

Inserting (8.18) and (8.19) into (8.17) leads to

$$\begin{aligned} \mathbf{D}_{{f}',{k}'}^{{\tau }'} \leftarrow \mathbf{D}_{{f}',{k}'}^{{\tau }'} \frac{\sum \limits _{\phi ,t_s } {\left( {g_{k,{f}'+\phi ,t_s } } \right) ^{-2}{v}'_{k,{f}'+\phi ,t_s } \mathbf{H}_{{k}',t_s -{\tau }'}^\phi } }{\sum \limits _{\phi ,t_s } {\left( {g_{k,{f}'+\phi ,t_s } } \right) ^{-1}\mathbf{H}_{{k}',t_s -{\tau }'}^\phi } } \end{aligned}$$
(8.20)

Similarly, the update rules in \(\mathbf{H}_{{k}',{t}'_s }^{{\phi }'} \) writes

$$\begin{aligned} \mathbf{H}_{{k}',{t}'_s }^{{\phi }'} \leftarrow \mathbf{H}_{{k}',{t}'_s }^{{\phi }'} \frac{\sum \limits _{\tau ,f} {\mathbf{D}_{f-{\phi }',{k}'}^\tau \left( {g_{k,f,{t}'_s +\tau } } \right) ^{-2}{v}'_{k,f,{t}'_s +\tau } } }{\sum \limits _{\tau ,f} {\mathbf{D}_{f-{\phi }',{k}'}^\tau \left( {g_{k,f,{t}'_s +\tau } } \right) ^{-1}} } \end{aligned}$$
(8.21)

It can be verified that the above update rules have an advantage of ensuring the nonnegativity constraints of \(\mathbf{D}_{f,k}^\tau \) and \(\mathbf{H}_{k,t_s }^\phi \) are always maintained during every iteration.

4.3 Algorithm 2: Multiplicative Gradient Descent Formulation of IS-NMF2D (MGD IS-NMF2D)

We consider the following generative model defined as:

$$\begin{aligned} \left| {\mathbf{Y}_{f,t_s } } \right| ^{2}\;\;\;=\;\left( {\sum _{i=1}^I {\sum _{\tau =0}^{\tau _{\max } } {\sum _{\phi =0}^{\phi _{\max } } {\mathbf{D}_{f-\phi ,i}^\tau \mathbf{H}_{i,t_s -\tau }^\phi } } } } \right) {\bullet }\mathbf{E}_{f,t_s } \end{aligned}$$
(8.22)

where \(\mathbf{E}_{f,t_s } \) is a scalar of multiplicative independent and identically distributed (i.i.d.) Gamma noise with unit mean, i.e., \(p(\mathbf{E}_{f,t_s } )=\xi (\mathbf{E}_{f,t_s } |\alpha ,\beta )\) where \(\xi (\mathbf{E}_{f,t_s } |\alpha ,\beta )\) denotes the Gamma probability density function (pdf) defined as: \(\xi (\mathbf{E}_{f,t_s } |\alpha ,\beta )=\frac{\beta ^{\alpha }}{\Gamma (\alpha )}\left( {\mathbf{E}_{f,t_s } } \right) ^{\alpha -1}\exp \left( {-\beta \mathbf{E}_{f,t_s } } \right) ,\;\mathbf{E}_{f,t_s } \ge 0\). Next, we define \(\mathbf{D}=\left[ {\mathbf{D}^{1} \mathbf{D}^{2}\cdots \mathbf{D}^{\tau _{\max } }} \right] \) and \(\mathbf{H}=\left[ {\mathbf{H}^{1} \mathbf{H}^{2}\cdots \mathbf{H}^{\phi _{\max } }} \right] \). Under the independent and identically distributed (i.i.d.) noise assumption, the term \(-\log p\!\left( {\mathbf{Y}\;|\mathbf{D},\mathbf{H}} \right) \) becomes

$$ -\log p\left( {\mathbf{Y}\;|\mathbf{D},\mathbf{H}} \right) =\frac{-\sum \nolimits _{t_s =1}^{Ts} {\sum \nolimits _{f=1}^F {\log } } \xi \left( {\;\left. {\frac{|\mathbf{Y}|_{f,t_s }^{\cdot 2} }{\sum \nolimits _{i=1}^I {\sum \nolimits _{\tau =0}^{\tau _{\max } } {\sum \nolimits _{\phi =0}^{\phi _{\max } } {\mathbf{D}_{f-\phi ,i}^\tau \mathbf{H}_{i,t_s -\tau }^\phi } } } }\;} \right| \alpha ,\beta } \right) }{\sum \nolimits _{i=1}^I {\sum \nolimits _{\tau =0}^{\tau _{\max } } {\sum \nolimits _{\phi =0}^{\phi _{\max } } {\mathbf{D}_{f-\phi ,i}^\tau \mathbf{H}_{i,t_s -\tau }^\phi } } } } $$
$$\begin{aligned} \quad \!\!\dot{=} d_{IS} \left( {|\mathbf{Y}|_{f,t_s }^{\cdot 2} \;\left| {\;\sum _{i=1}^I {\sum _{\tau =0}^{\tau _{\max } } {\sum _{\phi =0}^{\phi _{\max } } {\mathbf{D}_{f-\phi ,i}^\tau \mathbf{H}_{i,t_s -\tau }^\phi } } } } \right. } \right) \end{aligned}$$
(8.23)

where \(\dot{=}\) in the second line denotes equality up to constant terms. Thus, the cost function is \(C_{IS}^{NMF2D} =-\log p\left( {\mathbf{Y}\;|\mathbf{D},\mathbf{H}} \right) \). The derivatives of (23) corresponding to \(\mathbf{D}^\tau \) and \(\mathbf{H}^\phi \)are given by

$$\begin{aligned} \frac{\partial C_{IS}^{NMF2D} }{\partial \mathbf{D}_{{f}',{i}'}^{{\tau }'} }&=\frac{\partial }{\partial \mathbf{D}_{{f}',{i}'}^{{\tau }'} }\sum _{f,t_s } {\left( {\frac{\left| \mathbf{Y} \right| _{f,t_s }^2 }{\mathbf{Z}_{f,t_s } }-\log \frac{\left| \mathbf{Y} \right| _{f,t_s }^2 }{\mathbf{Z}_{f,t_s } }-1} \right) } \\ \quad \quad \quad \quad&=-\sum _{\phi ,t_s } {\left( {\left( {\mathbf{Z}_{{f}'+\phi ,t_s } } \right) ^{-2}\left( {\left| \mathbf{Y} \right| _{{f}'+\phi ,t_s }^2 -\mathbf{Z}_{{f}'+\phi ,t_s } } \right) } \right) {}{}\mathbf{H}_{{i}',t_s -{\tau }'}^\phi } \end{aligned}$$
(8.24)
$$\begin{aligned} \frac{\partial C_{IS}^{NMF2D} }{\partial \mathbf{H}_{{i}',{t}'_s }^{{\phi }'} }&=\sum _{f,t_s } {\mathbf{D}_{f-{\phi }',{i}'}^{t_s -{t}'_s } \left( {\left( {\mathbf{Z}_{f,t_s } } \right) ^{-2}\left( {\mathbf{Z}_{f,t_s } -\left| \mathbf{Y} \right| _{f,t_s }^2 } \right) } \right) } \\ \quad \quad \quad \quad \;&=-\sum _{\tau ,f} {\mathbf{D}_{f-{\phi }',{i}'}^\tau \left( {\left( {\mathbf{Z}_{f,{t}'_s +\tau } } \right) ^{-2}\left( {\left| \mathbf{Y} \right| _{f,{t}'_s +\tau }^2 -\mathbf{Z}_{f,{t}'_s +\tau } } \right) } \right) } \\ \end{aligned}$$
(8.25)

where \(\mathbf{Z}=\sum \limits _\tau {\sum \limits _\phi {\mathop {\mathbf{D}^\tau }\limits ^{\downarrow \phi } \mathop {\mathbf{H}^\phi }\limits ^{\rightarrow \tau } } } \). The standard gradient decent approach gives

$$\begin{aligned} \mathbf{D}_{{f}',{i}'}^{{\tau }'} \leftarrow \mathbf{D}_{{f}',{i}'}^{{\tau }'} -\eta _D \frac{\partial Cost_{IS}^{NMF2D} }{\partial \mathbf{D}_{{f}',{i}'}^{{\tau }'} }\quad \hbox {and}\quad \mathbf{H}_{{i}',{t}'_s }^{{\phi }'} \leftarrow \mathbf{H}_{{i}',{t}'_s }^{{\phi }'} -\eta _H \frac{\partial Cost_{IS}^{NMF2D} }{\partial \mathbf{H}_{{i}',{t}'_s }^{{\phi }'} } \end{aligned}$$
(8.26)

where \(\eta _D \) and \(\eta _H \) are positive learning rates and can be obtained as

$$\begin{aligned} \eta _D =\frac{\mathbf{D}_{{f}',{i}'}^{{\tau }'} }{\sum \limits _{\phi ,t_s } {\left( {\mathbf{Z}_{{f}'+\phi ,t_s } } \right) ^{-1}\mathbf{H}_{{i}',t_s -{\tau }'}^\phi } }\quad \hbox {and}\quad \eta _H =\frac{\mathbf{H}_{{i}',{t}'_s }^{{\phi }'} }{\sum \limits _{\tau ,f} {\mathbf{D}_{f-{\phi }',{i}'}^\tau \left( {\mathbf{Z}_{f,{t}'_s +\tau } } \right) ^{-1}} } \end{aligned}$$
(8.27)

Inserting (8.27) into (8.26) gives the multiplicative gradient decent rules

$$\begin{aligned} \mathbf{D}_{{f}',{i}'}^{{\tau }'} \leftarrow \mathbf{D}_{{f}',{i}'}^{{\tau }'} \frac{\sum \limits _{\phi ,t_s } {\left( {\mathbf{Z}_{{f}'+\phi ,t_s } } \right) ^{-2}\left| \mathbf{Y} \right| _{{f}'+\phi ,t_s }^2 \mathbf{H}_{{i}',t_s -{\tau }'}^\phi } }{\sum \limits _{\phi ,t_s } {\left( {\mathbf{Z}_{{f}'+\phi ,t_s } } \right) ^{-1}\mathbf{H}_{{i}',t_s -{\tau }'}^\phi } } \end{aligned}$$
(8.28)

and

$$\begin{aligned} \mathbf{H}_{{i}',{t}'_s }^{{\phi }'} \leftarrow \mathbf{H}_{{i}',{t}'_s }^{{\phi }'} \frac{\sum \limits _{\phi ,t_s } {\left( {\mathbf{Z}_{f,{t}'_s +\tau } } \right) ^{-2}\left| \mathbf{Y} \right| _{f,{t}'_s +\tau }^2 \mathbf{D}_{f-{\phi }',{i}'}^\tau } }{\sum \limits _{\tau ,f} {\mathbf{D}_{f-{\phi }',{i}'}^\tau \left( {\mathbf{Z}_{f,{t}'_s +\tau } } \right) ^{-1}} } \end{aligned}$$
(8.29)

The key difference between both algorithms is that the Quasi-EM IS-NMF2D algorithm prevents zeros in the factors, i.e., \(\mathbf{D}^\tau \) and \(\mathbf{H}^\phi \) cannot take entries equal to zero. On the contrary, this is not a feature shared by the MGD IS-NMF2D algorithm since zero coefficients are invariant under MGD updates. If the MGD IS-NMF2D algorithm attains a fixed point solution with zero entries, then it cannot be determined since the limit point is a stationary point [33]. Consequently, the resulting factorizations rendered by these algorithms are not equivalent. For this reason, the Quasi-EM IS-NMF2D algorithm can be considered more reliable for updating \(\mathbf{D}^\tau \) and \(\mathbf{H}^\phi \). We have summarized both proposed algorithms in Table 8.2. Details of the source separation performance between these algorithms will be shown in Sect. 8.5 where \(\psi =10^{-6}\) is the threshold for ascertaining the convergence.

4.4 Estimation of Sources

The two matrices that we seek to separate from \(\left| {\mathbf{Y}_{f,t_s } } \right| ^{2}\;\) are \(\left| {\tilde{X}_1 (f,t_s )} \right| ^{.2}\) and \(\left| {{\tilde{X}}_2 (f,t_s )} \right| ^{.2}\). These matrices are estimated as \(\left| {\tilde{X}_1 (f,t_s )} \right| ^{.2}=\sum \limits _{\tau =0}^{\tau _{\max } } {\sum \limits _{\phi =0}^{\phi _{\max } } {\mathbf{D}_{f-\phi ,1}^\tau \mathbf{H}_{1,t_s -\tau }^\phi } } \) and \(\left| {\tilde{X}_2 (f,t_s )} \right| ^{.2}=\sum \limits _{\tau =0}^{\tau _{\max } } {\sum \limits _{\phi =0}^{\phi _{\max } } {\mathbf{D}_{f-\phi ,2}^\tau \mathbf{H}_{2,t_s -\tau }^\phi }}\) [29] which are then used to generate the binary mask as \(\mathbf{mask}_i (f,t_s )=1 \) if \(\left| {\tilde{X}_i (f,t_s )} \right| ^{.2}>\left| {\tilde{X}_j (f,t_s )} \right| ^{.2}\) and zero otherwise. Finally, the estimated time-domain sources are obtained as \({\tilde{x}}_i =\hbox {Resynthesize} (\mathbf{mask}_i {\cdot } \mathbf{Y})\) for \(i=1,2\) where \({\tilde{x}}_i =[\tilde{x}_i (1),\ldots ,\tilde{x}_i (T)]^{\mathbf{T}}\) denotes the i \(^{th}\) estimated source. The time-domain estimated sources are resynthesized using the approach in [22] by weighting the mixture cochleagram by the mask and correcting phase shifts introduced during the gammatone filtering.

5 Experimental Results and Analysis

The proposed separation system is tested on recorded audio signals. All recordings and processing are conducted using a PC with Intel Core 2 CPU 6600 @ 2.4 GHz and 2 GB RAM. For mixture generation, three types of mixtures are used, i.e., mixture of music and speech, mixture of different kinds of music, and mixture of different kinds of speech. The speech sources (male and female) are selected from the TIMIT speech database while the music sources (jazz and piano) from the RWC database [28]. All mixtures are sampled at 16 kHz sampling rate. In all cases, the sources are mixed with equal average power over the duration of the signals. As for our proposed algorithms, the convolutive components are selected as follows:

  1. (i)

    For jazz and speech mixture, \(\tau =\left\{ {{}0,\ldots ,4{}} \right\} \) and \(\phi =\left\{ {{}0,\ldots ,4{}} \right\} \).

  2. (ii)

    For jazz and piano mixture, \(\tau =\left\{ {{}0,\ldots ,6{}} \right\} \) and \(\phi =\left\{ {{}0,\ldots ,9{}} \right\} \).

  3. (iii)

    For piano and speech mixture, \(\tau =\left\{ {{}0,\ldots ,6{}} \right\} \) and \(\phi =\left\{ {{}0,\ldots ,9{}} \right\} \).

  4. (iv)

    For speech and speech mixture, \(\tau =\left\{ {{}0,1{}} \right\} \) and \(\phi =\left\{ {{}0,1,2} \right\} \).

Table 8.2 Pseudo codes for Quasi-EM IS-NMF2D and IS-NMF2D (MGD) algorithms

These parameters are selected after conducting Monte Carlo tests over 100 realizations of audio mixture. We have evaluated our separation performance in terms of the Signal-to-Distortion ratio (SDR) that unifies the Signal-to-Interference ratio (SIR) and Signal-to-Artifacts ratio (SAR). MATLAB routines for computing these criteria are obtained from the SiSEC’08 webpage [34].

5.1 Separation Performance Under Different TF Representations

In Sect. 8.2, the separability analysis was undertaken by using the IBM to determine the “separateness” of the mixture without recourse to the separation algorithms. In this section, the impact of separation algorithm is analyzed. Instead of using the IBM, the Quasi-EM IS-NMF2D algorithm is now used to estimate the mask according to Sect. 8.4. In this situation, we are investigating the performance of mixture separation (rather than mixture separability). Speech signals and music are used to generate the monoaural mixture recording. The separation performance is evaluated by using three types of TF representation: (i) spectrogram (STFT with 1024-point Hamming windowed FFT and 50 % overlap), (ii) log-frequency spectrogram (as described in Sect. 8.3 with 1024-point Hamming windowed FFT), and (iii) cochleagram based on Gammatone filterbank of 128 channels, filter order of 4 (i.e., \(h=4\) in (4)), and each filter output is divided into 20 ms time frames with 50 % overlap. To validate the parameters setting of cochleagram (e.g. \(h\) and \(v)\), we have constructed an experiment based on three speech sources and tested the result by fixing the parameter h in (3) to unity. The experiment is then repeated by progressively increasing h from 2 to 10. Over this range, the optimal separability is obtained when \(h = 4\). The parameter \(v\) determines the rate of decay of the impulse response of the gammatone filters. In most audio processing tasks, it is set to \(v(f) =1.019ERB(f)\) where \(ERB(f)=24.7+0.108f\) is the equivalent rectangular bandwidth of the filter with the center frequency \(f\). A range of values for v has been tested, i.e., \(v(f)=(1.019\ +\ c)ERB(f)\) where c ranges from \(-\)0.5 to 0.5 with increment of 0.1. The obtained result indicates that the optimal separability is obtained by setting \(c=0\) . As c moves away from 0, the separability result progressively deteriorates. This confirms the validity of setting \(v(f)=1.019ERB(f)\) for the cochleagram.

Fig. 8.3
figure 3

Separation results using different TF representations

where ‘J’, ‘M’, ‘F’, ‘P’, ‘S’ denote jazz, male speech, female speech, piano, and speech, respectively.

Figure 8.3 shows the comparison of our proposed algorithm based on the spectrogram, log-frequency spectrogram, and cochleagram under various audio mixtures. The separation results for all mixture types based on the spectrogram gives an average SDR of 0.51 dB while the log-frequency spectrogram an average SDR of 2.8 dB. However, a significantly higher performance is attained by the cochleagram with an average SDR of 8 dB. This leads to a substantial improvement gain of 7.5 dB and 5.2 dB, respectively. The major reason for the large discrepancy is due to the mixing ambiguity between \(\left| {\mathbf{X}_1 } \right| ^{.2}\) and \(\left| {\mathbf{X}_2 } \right| ^{.2}\). The larger the mixing ambiguity between \(\left| {\mathbf{X}_1 } \right| ^{.2}\) and \(\left| {\mathbf{X}_2 } \right| ^{.2}\), the more TF units will be ambiguous which subsequently decreases the probability of correct assignment of each unit to the sources and eventually results in poorer separation performance. To validate this, Fig. 8.4 shows the spectrogram of the original sources, the mixed signal, and the estimated sources using the proposed Quasi-EM IS-NMF2D algorithm. Both figures indicate that the STFT lacks provision for further low-level information of a TF unit and therefore, the resulting spectrogram fails to infer the dominating source. This leads to high degree of ambiguity in TF domain and causes lack of uniqueness in extracting the spectral-temporal features of the sources

Fig. 8.4
figure 4

Separation results in spectrogram

Similar to the above, Fig. 8.5 shows the separation results based on the log-frequency spectrogram. Compared with spectrogram, the separation performance is better since log-frequency spectrogram has the propensity of nonuniform time frequency resolution. However, the transform operation used by the log-frequency spectrogram is still based on the Fourier Transform which may not be an optimal option. On the other hand, the results of separation in the cochleagram have led to significant SDR improvement. The cochleagram enables the mixed signal to be more separable and thus reduces the mixing ambiguity between \(\left| {\mathbf{X}_1 } \right| ^{.2}\) and \(\left| {\mathbf{X}_2 } \right| ^{.2}\).

This explains the average performance of separating mixture jazz music and female utterance is the highest among all the mixtures because both sources have very distinguishable TF patterns in the cochleagram. This is evident in Fig. 8.6, which shows the separation results in the cochleagram. The plot clearly shows that the spectral energy of the two audio sources has been clustered at different frequencies in the cochleagram due to their different fundamental frequencies. These prominent features have been separated using our proposed Quasi-EM IS-NMF2D algorithm.

Fig. 8.5
figure 5

Separation results in log-frequency spectrogram

Fig. 8.6
figure 6

Separation results in cochleagram

Fig. 8.7
figure 7

ab Original spectral bases of jazz music and female utterance in the cochleagram. cd The corresponding estimated spectral bases

Fig. 8.8
figure 8

ab Original spectral bases of jazz music and female utterance in the spectrogram. cd The corresponding estimated spectral bases

The performance of source separation also depends on how accurate the spectral bases are estimated. Given the different types of TF representation, a question arises as to which set of estimated spectral bases have yielded better approximation to the respective original sources’ spectral bases. Figure 8.7 shows the results of the original and the estimated spectral basis \(\mathbf{D}_i^\tau \) for the above mixture when the factorization is performed in the cochleagram. In Fig. 8.7, panels (a and b) refer to the original spectral bases of the jazz music and female utterance, respectively. Panels (c and d) refer to the estimated spectral bases. In comparison, we have also included similar factorization results of the same mixture in the spectrogram and log-frequency spectrogram. These are shown in Figs. 8.8 and 8.9, respectively. In sharp contrast with Fig. 8.7, it is noted that the estimated spectral bases in Figs. 8.8 and 8.9 are quite dissimilar to the original spectral bases. Thus, the construction of the separating mask will inevitably introduce errors in assigning the TF units to the respective sources. Therefore, the recovered sources are very coarse with very low values of SDR in Fig. 8.3.

Fig. 8.9
figure 9

ab Original spectral bases of jazz music and female utterance in the log-frequency spectrogram. cd The corresponding estimated spectral bases

5.2 Comparison Between Different Cost Functions

In the following, experiments are conducted to evaluate the efficiency of the proposed algorithm under different cost functions. Here, we consider the Least Square (LS) distance and Kullback-Leibler (KL) divergence. Figure 8.10 shows the separation results in the cochleagram based on LS, KL, and IS cost functions. In Fig. 8.10, it is noted that Quasi-EM IS-NMF2D algorithm outperforms those of LS distance and KL divergence with an average SDR of 3.1 and 1.8 dB, respectively. This is evidenced by the fact that the IS divergence holds a desirable property of scale invariant so that low energy components can be precisely estimated and they bear the same relative importance as the high energy ones. On the contrary, factorizations obtained with LS distance and KL divergence tend to favor the high energy components at the expense of disregarding the low energy ones. In the cochleagram, the dynamic range of the mixture signal can be considerably large such that the dominating signal at a particular TF unit can manifest either as low or high energy components. In addition, these components tend to exist as clusters. As such, when either LS distance- or KL divergence is used, clusters with low energy tend to be ignored in favor of the high energy ones. This leads to mixing ambiguities especially for low energy ones in which case when they are subsumed together leads to significant lost of spectral–temporal information of the sources. Figure 8.11 shows how different cost functions have impacted the separation performance. It is clearly seen that the LS-NMF2D algorithm fails to determine the correct TF components of each source. Panels (a and b) show a considerable level of mixing ambiguities (red box marked area) that have not been accurately resolved by the LS-NMF2D algorithm. The KL-NMF2D exhibits better performance but ignores some low energy TF components in the red box marked area of (c). On the other hand, the proposed algorithm has successfully extracted the low energy components for both female speech and jazz music with high accuracy.

Fig. 8.10
figure 10

Separation results with different cost functions

Fig. 8.11
figure 11

Separation results: ab, cd and ef denote the recovered female speech and jazz music in the cochleagram by using the algorithms with different cost function

5.3 Comparing with Different SCBSS Methods

We have made comparison with the recently published EMD SCBSS [35], which first decomposes the given signal into spectrally independent modes using EMD algorithm, and then, ICA is applied to extract statistically independent sources. All the above single channel BSS methods will be tested across all types of mixture and compared in terms of SDR. Table 8.3 summarizes the comparison results. In comparison, the Quasi-EM IS-NMF2D with cochleagram leads to the best separation performance for all types of the mixture. The EMD SCBSS also performs with relative acceptable results compared with Quasi-EM IS-NMF2D. However, it is interesting to point out that the advantage of using Quasi-EM IS-NMF2D with cochleagram is that this method is less complex than the EMD SCBSS and simultaneously it retains a higher level of the separation performance.

Table 8.3 Separation results using different SCBSS methods

5.4 Separating More than Two Sources

The proposed method can be extended to the case when \(i>2\) sources. If more than two sources are mixed in a single channel, this requires specifying the number of sources to be separated. Since the method is blind, the separability of the complex mixture depends highly on how accurate the spectral bases \(\mathbf{D}_i^\tau \) can be estimated from the TF mixture. Consequently, a set of distinguishable spectral basis of each source for a generic case is a necessary condition to achieve good separation performance. Thus, we adopt three different types of sources, e.g., jazz, piano, and trumpet to generate a complex mixture. The convolutive components in the proposed algorithm are selected as \(\tau =\left\{ {0,\ldots ,3} \right\} \) and \(\phi =\left\{ {0,\ldots ,31} \right\} \). Table 8.4 shows the overall separation results. It is seen that mixtures generated by all music sources have been recovered quite successfully. Figure 8.12 shows an example of separating the mixture of Jazz, piano, and trumpet music. It can be seen that three music sources are almost completely separated by using the proposed method. In addition, it is noted that the separation performance has deteriorated when the number of sources increases from two. Increased number of sources will mean that there exists more interference in separating every target source and hence results in higher probability of incurring an error. Comparing the results in the table, mixtures containing speech somehow results in slightly poorer performance than mixtures of music sources only. One reason is the seemingly more overlaps in the TF domain between the speech and music sources. It is observed from Fig. 8.6 that music pitches tend to jump discretely while speech pitches do not. Consequently, this leads to less efficiency in the estimation of the spectral basis from the mixture signal. In addition, we have tested the performance of the proposed method on recordings mixed with \(i>3\) sources. We have found that the proposed method works well for mixtures of music sources that are characterized with distinguishable spectral basis. However, the performance shows degradation when separating mixture contains speech sources.

Table 8.4 Separation results of three sources
Fig. 8.12
figure 12

Decomposition results. ac denote the original Jazz, piano, and trumpet music, d is the mixture and eg denote the recovered sources using the proposed method

5.5 Separating Real Music Recording

In the final experiment, the proposed method is tested on professionally produced music recordings of the well-known song namely “You raise me up” by Kenny G. The music consists of two excerpts of length approximately 23 s on mono channel and resampled to 16 kHz. The song is an instrumental music consisting of saxophone and piano sound. The factors of \(\tau \) and \(\phi \) shifts are set to have \(\tau _{\max } =8\) and \(\phi _{\max } =32\). Since the original source spatial images are not available for this experiment, the separation performance is assessed perceptually and informally by analyzing the log-frequency spectrogram of the estimated source images and listening to the separated sound. This task was a tough task since the instruments play many different notes in the recording. Figure 8.13 shows the separation results of the saxophone and piano sound. The high pitch of continuous saxophone sound is shown in the middle panel of Fig. 8.13 while the notes of the piano are evidently present in Fig. 8.11 bottom panel. Overall, our proposed method successfully separated the professionally produced music recordings and gives a perceptually pleasant listening experience.

Fig. 8.13
figure 13

Separation result for song “You raised me up” by Kenny G. Top Recorded music. Middle Separated saxophone sound. Bottom Separated piano sound

6 Conclusion

In this chapter, a novel method to solve the single channel audio source separation is proposed. In addition, two algorithms for nonnegative matrix two-dimensional factorization optimized using the Itakura-Saito divergence are presented: Quasi-EM IS-NMF2D and MGD IS-NMF2D. Coupled with the theoretical support of signal separability in the TF domain, the separation system using the gammatone filterbank with these algorithms have shown to yield considerable success. The proposed method enjoys at least three significant advantages: First, it avoids strong constraints of separating sources without training knowledge. Second, the cochleagram rendered by the gammatone filterbank has nonuniform TF resolution which enables the mixed signal to be more separable and thus improves the efficiency of source separation. Finally, the method holds a desirable property of scale invariant which enables low energy components in the cochleagram to bear the same relative importance as the high energy ones. The proposed cochleagram-based IS-NMF2D method in particular using the Quasi-EM algorithm has yielded significant improvements in source separation compared with other nonnegative matrix factorizations.