Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

12.1 Introduction

In this chapter we introduce multimicrophone methods for speech separation and noise reduction methods, which are based on beamforming. Traditionally, beamforming methods are adopted from classical array processing techniques, in which a beam of high response is steered towards the desired source, while suppressing other directions. These methods were mainly applied in communications and radar domains. They usually assume free-field propagation, i.e., the angle-of-arrival fully determines the source position, although several design methods take multi-path propagation into account.

Statistically optimal beamformers are powerful multichannel filtering tools that optimize a certain design criteria while adapting to the received data, hence usually referred to as data-dependent approaches. A plethora of optimization criteria were proposed. The MVDR beamformer, also referred to as Capon beamformer [1], minimizes the noise power at the output of the beamformer subject to a unit gain constraint in the look direction. Frost [2] presented an adaptive implementation of the MVDR beamformer for wideband signals. Griffiths and Jim [3] proposed the GSC which is an efficient decomposition of the MVDR beamformer into two branches: one satisfying the constraint on the desired source, and the other for minimizing the noise (and interference).

Several researchers, e.g. [4] (see also Van Veen and Buckley [5]) have proposed modifications to the MVDR beamformer to deal with multiple linear constraints, denoted linearly constrained minimum variance (LCMV). Their work was motivated by the desire to apply further control to the array beampattern, beyond that of a steer-direction gain constraint. Hence, the LCMV can be applied to construct a beampattern satisfying certain constraints for a set of directions, while minimizing the array response in all other directions. Breed and Strauss [6] proved that the LCMV extension has also an equivalent GSC structure, which decouples the constraining and the minimization operations. The multichannel Wiener filter (MWF) is another important beamforming criterion, which minimizes the minimum mean squared error (MMSE) between the desired signal and the array output. It can be shown that the MWF decomposes into an MVDR beamformer followed by a single-channel Wiener post-filter [7]. A comprehensive analysis of beamformer criteria can be found in [5, 8].

Speech signals usually propagate in acoustic enclosures, e.g., rooms, and not in free-field. In the presence of obstacles, the sound wave is subject to diffractions and reflections, depending on its wavelength. Due to the typically small absorbtion coefficients of the obstacles, many successive wave reflections occur before their power decay. This induces multiple propagation paths between each source and each microphone, each with a different delay and attenuation factor. This phenomenon is often referred to as reverberation. The AIR (and its respective ATF) encompasses all these reflections and is usually a very long (few thousands taps) and time-varying filter.

Due to this intricate propagation regime, resorting to beampatterns as a function of the angle-of-arrival implies a reduction of a complex multi-parameter problem to an arbitrary single parameter problem. Classical beamformers, that construct their steering-vector under the assumption of free-field propagation, are often prone to performance degradation when applied in reverberant environments. It is therefore very important to take the reverberation effects into account while designing beamformers.

To circumvent the simplified free-field assumption, it was proposed [9, 10] to substitute the delay-only steering vector by the (normalized) ATFs relating the source and the microphones. This concept was later extended to the multiple sources scenario for extracting the desired source(s) from a mixture of desired and interference sources [11].

The MWF is also widely applied for speech enhancement, especially in its more flexible form, the speech distortion weighted-MWF (SDW-MWF) [12], which introduces a tradeoff factor that controls the amount of speech distortion versus the level of the residual noise.

A recent review paper [13] surveys many beamforming design criteria and their relation to BSS techniques.

12.2 Background

In this section we formulate the problem of speaker separation using spatial filtering methods. The signals and their propagation models are defined in Sect. 12.2.1. The following Sects. 12.2.2 and 12.2.3 are dedicated to defining spatial filters and criteria for evaluating their performances, respectively.

12.2.1 Generic Propagation Model

The speech sources are typically modeled in the STFT as quasi-stationary complex random processes with zero mean and time-varying variance, with stationarity time of the order of tens of milliseconds. Let us consider the case of J speech sources, denoted:

$$\begin{aligned} s_j(n, f) \sim \mathscr {N}\left( 0, \phi _{s_j}\left( n, f\right) \right) \end{aligned}$$
(12.1)

for \(j=1,\ldots ,J\) where \(\phi _{s_j}(n,f)\) denotes the time-varying signals spectra, and the indices \(n=0,1,\ldots ,\) and \(f=0,1,\ldots ,F-1\) stand for the time-frame and frequency-bin index, and F denotes the STFT window length.

Given a microphone array comprising M microphones, the received microphone signals are given in an \(M\times 1\) vector notation by

$$\begin{aligned} {\mathbf {x}}(n, f) = \sum _{j=1}^{J}{\mathbf {c}}_{j}(n,f)+{\mathbf {u}}(n,f) \end{aligned}$$
(12.2)

where \({\mathbf {c}}_{j}(n,f)\) for \(j=1,\ldots ,J\) denotes the J vectors of speech sources as received by the microphone array and \({\mathbf {u}}(n,f)\) denotes the \(M\times 1\) dimensional vector comprising the noise components received at the microphones. Modeling the speech sources as coherent point sources and modeling the AIR as time-invariant convolution system, the speech components at the microphones are modeled in the STFT domain as a simple multiplication

$$\begin{aligned} {\mathbf {c}}_{j}(n,f) \triangleq {\mathbf {a}}_{j}(f) s_{j}(n, f) \end{aligned}$$
(12.3)

where

$$\begin{aligned} {\mathbf {a}}_{j}(f) = \left[ \begin{array}{ccc}a_{j1}(f),&\cdots&a_{jI}(f)\end{array}\right] ^{T} \end{aligned}$$
(12.4)

denotes the \(M\times 1\) dimensional vector of ATFs relating the j-th source and the microphone array. Note that we assume that the STFT length is longer than the effective length of the AIR, such that convolution in the time-domain can be approximated as multiplication in the STFT domain (theoreticaly, only cyclic-convolution in the time-domain transforms to multiplication in the STFT domain [14]). Note that we assume that the AIR are time-invariant, i.e., the sources and the enclosure are static. This assumption can be relaxed to slowly time-varying environments, in which case the separating algorithm needs to adapt faster than the the system variations. However, for brevity we consider here time-invariant systems. The covariance matrices of the sources are given by:

$$\begin{aligned} \varvec{\varPhi }_{c_j}(n, f) \triangleq \text {E}\left[ {\mathbf {c}}_{j}(n, f){\mathbf {c}}_{j}^{H}(n, f)\right] = {\mathbf {a}}_{j}(f){\mathbf {a}}_{j}^{H}(f)\phi _{s_j}(n, f). \end{aligned}$$
(12.5)

The noise-field is also assumed stationary, and the covariance matrix of its components at the microphone array is defined as:

$$\begin{aligned} \varvec{\varPhi }_{u}(f) = \text {E}\left[ {\mathbf {u}}(n, f){\mathbf {u}}^{H}(n, f)\right] . \end{aligned}$$
(12.6)

Note that the noise-stationarity assumption can be relaxed to slowly time-varying statistics, however, for ease of notation and derivation we assume that the noise is stationary.

12.2.2 Spatial Filtering

The spatial filter which is designed to extract the j-th speech source is denoted by \({\mathbf {w}}_{j}(n,f)\). Its corresponding output is defined by:

$$\begin{aligned} y_{j}(n,f) \triangleq {\mathbf {w}}_{j}^{H}(n,f){\mathbf {x}}(n,f). \end{aligned}$$
(12.7)

Note that generally the spatial filter may vary over time. By substituting (12.2) into (12.7), the output of the j-th spatial filter is decomposed into different components:

$$\begin{aligned} y_{j}(n,f) = \sum _{j'=1}^{J}d_{j,j'}(n,f)+v_{j}(n, f) \end{aligned}$$
(12.8)

where

$$\begin{aligned} d_{j,j'}(n,f) \triangleq {\mathbf {w}}_{j}^{H}(n,f){\mathbf {c}}_{j'}(n,f) \end{aligned}$$
(12.9)

is the component that corresponds to the \(j'\)-th source at the output of the j-th spatial filter and

$$\begin{aligned} v_{j}(n,f) \triangleq {\mathbf {w}}_{j}^{H}(n,f){\mathbf {u}}(n,f) \end{aligned}$$
(12.10)

is the noise component at the j-th output. The aim of the j-th spatial filter is to maintain the j-th speech source, i.e., \(d_{j,j}(n,f)\approx s_{j}(n,f)\), attenuate the other speech sources, i.e., \(d_{j,j'}(n,f)\approx 0\) for \(j'\ne j\), and reduce the noise, i.e., \(v_{j}(n,f)\approx 0\). Note that aiming to obtain the dry signal of the j-th source (the original source before the convolution with the AIR) is a cumbersome task, and that in many practical scenarios obtaining the desired source as picked up by one of the microphones, which is denoted the reference microphone, is sufficient. Let us assume that the first microphone is selected as the reference microphone, and therefore the desired source at the output of the j-th spatial filter is \(d_{j,j}(n,f) = a_{j1}(f)s_{j}(n,f)\). The RTFs relating the received components of the \(j'\)-th source at all microphones with its component at the reference microphone is defined as [10]:

$$\begin{aligned} {\tilde{\mathbf {a}}}_{j'}(f) \triangleq \frac{{\mathbf {a}}_{j'}(f)}{a_{j'1}(f)} \end{aligned}$$
(12.11)

for \(j'=1,\ldots ,J\). In the following sections, for the sake of clarity, we present the derivations of the various spatial filtering criteria using the ATF vectors rather than the RTF vectors.

12.2.3 Second-Order Moments and Criteria

Let us consider the output of the j-th spatial filter which aims to extract the j-th source while reducing noise and other interfering speakers, and define criteria to evaluate its performance. The difference between the output of the spatial filter and the desired signal is denoted as the error signal. The variance of the error signal, also known as the mean squared error (MSE) which we denote as \(\chi _{j}(n, f)\), can be decomposed to its various components:

$$\begin{aligned} \chi _{j}(n, f) \triangleq&\,\text {E}\left[ \left| y_j(n, f)-s_j(n, f)\right| ^2\right] \nonumber \\ =&\,\delta _j(n, f)+\sum _{j'\ne j}\psi _{d_{j,j'}}(n,f)+\psi _{vj}(n,f) \end{aligned}$$
(12.12)

where

$$\begin{aligned} \delta _{j}(n,f) \triangleq \text {E}\left[ \left| s_{j}(n,f)-d_{j,j}(n,f)\right| ^2\right] =|1-{\mathbf {w}}_{j}^{H}(n,f){\mathbf {a}}_{j}(f)|^2\phi _{s_j}(n, f) \end{aligned}$$
(12.13)

is the distortion of the j-th source component,

$$\begin{aligned} \psi _{d_{j,j'}}(n,f) \triangleq \text {E}\left[ \left| d_{j,j'}(n,f)\right| ^2\right] =\left| {\mathbf {w}}_{j}^{H}(n,f){\mathbf {a}}_{j'}(f)\right| ^2\phi _{s_{j'}}(n, f) \end{aligned}$$
(12.14)

is the variance of the residual \(j'\)-th signal component, for \(j'\ne j\) and

$$\begin{aligned} \psi _{v_{j}}(n,f) \triangleq \text {E}\left[ \left| v_{j}(n,f)\right| ^2\right] ={\mathbf {w}}_{j}^{H}(n,f)\varvec{\varPhi }_{u}(f){\mathbf {w}}_{j}(n,f) \end{aligned}$$
(12.15)

is the variance of the residual noise component.

For evaluating the distortion level at the enhanced j-th signal, we define the signal-to-distortion ratio (SDR) as the power ratio of the desired speech component and its distortion:

$$\begin{aligned} \text {SDR}_{\text {o},j}(n, f) \triangleq&\frac{\phi _{s_j}(n,f)}{\delta _{j}(n,f)}\nonumber \\ =&\frac{1}{\left| 1-{\mathbf {w}}_{j}^{H}(n,f){\mathbf {a}}_{j}(f)\right| ^2}. \end{aligned}$$
(12.16)

To evaluate the noise reduction of the spatial filter we define the signal-to-noise ratio (SNR) improvement, denoted \(\varDelta \text {SNR}_{j}\), which is the ratio of the SNR at the output and at the input, denoted \(\text {SNR}_{\text {o},j}\) and \(\text {SNR}_{\text {i},j}\):

$$\begin{aligned} \text {SNR}_{\text {i},j}(n, f) \triangleq&\frac{\text {trace}\left( \varvec{\varPhi }_{c_j}(n, f)\right) }{\text {trace}\left( \varvec{\varPhi }_{u}(f)\right) }\end{aligned}$$
(12.17a)
$$\begin{aligned} \text {SNR}_{\text {o},j}(n, f) \triangleq&\frac{\psi _{dj,j}(n, f)}{\psi _{vj}(f)}\end{aligned}$$
(12.17b)
$$\begin{aligned} \varDelta \text {SNR}_{j}(n, f) \triangleq&\frac{\text {SNR}_{\text {o},j}(n, f)}{\text {SNR}_{\text {i},j}(n, f)}\nonumber \\ =&\frac{\left| {\mathbf {w}}_{j}^{H}(n,f){\mathbf {a}}_{j}(f)\right| ^2 / {\mathbf {w}}_{j}^{H}(n,f)\varvec{\varPhi }_{u}(f){\mathbf {w}}_{j}(n,f)}{\Vert {\mathbf {a}}_{j}(f)\Vert ^2 / \text {trace}\left( \varvec{\varPhi }_{u}(f)\right) }. \end{aligned}$$
(12.17c)

Note that the last expression of \(\varDelta \text {SNR}_{j}\) is obtained by substituting the expressions from (12.5), (12.14), (12.15), (12.17a), and (12.17b).

The interfering speakers reduction is evaluated by using the SIR improvement, denoted as \(\varDelta \text {SIR}_{jj'}\), defined for pairs of desired speaker and interfering speaker, denoted j and \(j'\) respectively, as the ratio the output SIR and the input SIR, denoted as \(\text {SIR}_{\text {o},jj'}\) and \(\text {SIR}_{\text {i},jj'}\):

$$\begin{aligned} \text {SIR}_{\text {i},jj'}(n, f) \triangleq&\frac{\text {trace}\left( \varvec{\varPhi }_{c_j}(n, f)\right) }{\text {trace}\left( \varvec{\varPhi }_{c_{j'}}(n, f)\right) }\end{aligned}$$
(12.18a)
$$\begin{aligned} \text {SIR}_{\text {o},jj'}(n, f) \triangleq&\frac{\psi _{djj}(n, f)}{\psi _{djj'}(n, f)}\end{aligned}$$
(12.18b)
$$\begin{aligned} \varDelta \text {SIR}_{jj'}(n, f) \triangleq&\frac{\text {SIR}_{\text {o},jj'}(n, f)}{\text {SIR}_{\text {i},jj'}(n, f)}\nonumber \\ =&\frac{\left| {\mathbf {w}}_{j}^{H}(n,f){\mathbf {a}}_{j}(f)\right| ^2 / \left| {\mathbf {w}}_{j}^{H}(n,f){\mathbf {a}}_{j'}(f)\right| ^2}{\Vert {\mathbf {a}}_{j}(f)\Vert ^2 / \Vert {\mathbf {a}}_{j'}(f)\Vert ^2}. \end{aligned}$$
(12.18c)

Note that the last expression of \(\varDelta \text {SIR}_{jj'}\) is obtained by substituting the expressions from (12.5), (12.14), (12.18a) and (12.18b).

Finally, in order to evaluate the total interference and noise reduction, the signal-to-interference-and-noise ratio (SINR) improvement, denoted \(\varDelta \text {SINR}_{j}\), is defined as the ratio of the SINR at the output and at the input, denoted \(\text {SINR}_{\text {o},j}\) and \(\text {SINR}_{\text {i},j}\):

$$\begin{aligned} \text {SINR}_{\text {i},j}(n, f) =&\frac{\Vert {\mathbf {a}}_{j}(f)\Vert ^2\phi _{s_j}(n,f)}{\sum _{j'\ne j}{\Vert {\mathbf {a}}_{j'}(f)\Vert ^2\phi _{s_{j'}}(n,f)}+\text {trace}\left( \varvec{\varPhi }_{u}(f)\right) }\end{aligned}$$
(12.19a)
$$\begin{aligned} \text {SINR}_{\text {o},j}(n, f) =&\frac{\psi _{dj,j}(n, f)}{\sum _{j'\ne j}{\psi _{d_{j,j'}}(n, f)}+\psi _{v_{j}}(f)}\end{aligned}$$
(12.19b)
$$\begin{aligned} \varDelta \text {SINR}_{j}(n, f) =&\frac{\left| {\mathbf {w}}_{j}^{H}(n,f){\mathbf {a}}_{j}(f)\right| ^2}{\Vert {\mathbf {a}}_{j}(f)\Vert ^2} \cdot \nonumber \\&\cdot \frac{\sum _{j'\ne j}{\Vert {\mathbf {a}}_{j'}(f)\Vert ^2\phi _{s_{j'}}(n,f)}+\text {trace}\left( \varvec{\varPhi }_{u}(f)\right) }{\sum _{j'\ne j}{\left| {\mathbf {w}}_{j}^{H}(n,f){\mathbf {a}}_{j'}(f)\right| ^2\phi _{s_{j'}}(n, f)}+{\mathbf {w}}_{j}^{H}(n,f)\varvec{\varPhi }_{u}(f){\mathbf {w}}_{j}(n,f)}. \end{aligned}$$
(12.19c)

12.3 Matched Filter

In this section, the matched filter spatial filtering method is presented. Its design criterion is defined and explained in Sect. 12.3.1, and its performance is analyzed in Sect. 12.3.2. The MF based spatial filter was first introduced in [15], where it was implemented in the time domain.

12.3.1 Design

As suggested by its name, the matched filter is designed to match the ATFs of the desired source (here denoted as the j-th source). Formally, it is defined as:

$$\begin{aligned} \mathbf {w}^{\text {MF}}_{j}(f) \triangleq \frac{{\mathbf {a}}_{j}(f)}{\Vert {\mathbf {a}}_{j}(f)\Vert ^2} \end{aligned}$$
(12.20)

where the scaling is designed to maintain a distortionless response towards the desired source, i.e.,

$$\begin{aligned} \left( \mathbf {w}^{\text {MF}}_{j}(f)\right) ^{H}{\mathbf {a}}_{j}(f)=1 \end{aligned}$$
(12.21)

and therefore \(d_{jj'}(n,f)=s_j(n,f)\).

This criterion, can be shown optimal in the sense of maximizing the SNR at the output for the case of a single source contaminated by spatially white noise. The main advantage of this spatial filter lies in its simplicity as it is independent of the noise and interferences properties. In the special case of a desired source signal arriving from the far-field regime in an anechoic environment, the matched filter reduces to the well known delay-and-sum (DS) beamformer.

12.3.2 Performance

As stated in the previous section, the matched filter is designed to pass the j-th source undistorted. Hence, by substituting (12.21) in (12.13) we obtain that the distortion equals zero

$$\begin{aligned} \delta _{j}^{\text {MF}}(n,f) = 0 \end{aligned}$$
(12.22)

and by following (12.16) the SDR of the j-th source is infinite, i.e.:

$$\begin{aligned} \text {SDR}^{\text {MF}}_{\text {o},j}(n,f) \rightarrow \infty . \end{aligned}$$
(12.23)

Since the MF is designed independently of the noise and interference sound fields, the SIR and SNR improvements are accidental. The spectrogram of the \(j'\)-th interfering source at the output of the j-th output, \(\psi _{d_{j,j'}}^{\text {MF}}(n,f)\), and the corresponding SIR improvement of the j-th source with respect to the \(j'\)-th interfering source are:

$$\begin{aligned} \psi _{d_{j,j'}}^{\text {MF}}(n,f) =&\frac{\Vert {\mathbf {a}}_{j'}(f)\Vert ^2}{\Vert {\mathbf {a}}_{j}(f)\Vert ^2}\left| \rho _{jj'}(f)\right| ^2\phi _{s_{j'}}(n,f)\end{aligned}$$
(12.24a)
$$\begin{aligned} \varDelta \text {SIR}^{\text {MF}}_{jj'}(f) =&\frac{1}{\left| \rho _{jj'}(f)\right| ^2} \end{aligned}$$
(12.24b)

where \(\rho _{jj'}(f)\) is defined as the normalized projection of the desired source ATF onto the interfering source ATF (per frequency-bin):

$$\begin{aligned} \rho _{jj'}(f) \triangleq \frac{{\mathbf {a}}_{j}^{H}(f){\mathbf {a}}_{j'}(f)}{\Vert {\mathbf {a}}_{j}(f)\Vert \cdot \Vert {\mathbf {a}}_{j'}(f)\Vert }. \end{aligned}$$
(12.25)

Note that from the Cauchy-Schwarz inequality the reciprocal of the SIR improvement expression is bounded by \(0\le |\rho _{jj'}(f)|^2\le 1\), therefore the SIR improvement is bounded by \(1 \le \varDelta \text {SIR}^{\text {MF}}_{jj'}(f)< \infty \). The SIR improvement will reach its upper-bound with the \(j'\)-th source being nulled by the MF of the j-th source if their corresponding ATFs are orthogonal (i.e., \(\rho _{jj'}(f)=0\)).

By substituting (12.20) in (12.15) and (12.17c) the spectrum of the noise at the j-th output, and the corresponding SNR improvement are given by:

$$\begin{aligned} \psi _{v_{j}}^{\text {MF}}(f) =&\frac{{\mathbf {a}}_{j}^{H}(f)\varvec{\varPhi }_{u}(f){\mathbf {a}}_{j}(f)}{\Vert {\mathbf {a}}_{j}(f)\Vert ^4}\end{aligned}$$
(12.26a)
$$\begin{aligned} \varDelta \text {SNR}^\text {MF}_{j}(f) =&\frac{\Vert {\mathbf {a}}_{j}(f)\Vert ^2\cdot \text {trace}\left( \varvec{\varPhi }_{u}(f)\right) }{{\mathbf {a}}_{j}^{H}(f)\varvec{\varPhi }_{u}(f){\mathbf {a}}_{j}(f)}. \end{aligned}$$
(12.26b)

12.4 Multichannel Wiener Filter

In this section we present the MWF and analyze its performance.

12.4.1 Design

Considering the problem of enhancing the j-th source, recall that the MSE of an arbitrary spatial filter \({\mathbf {w}}(f)\) is denoted \(\chi _{j}(n,f)\) and is defined by (12.12). The MSE is comprised of the following components: (a) distortion (denoted \(\delta _j(n,f)\)); (b) residual interferers spectra (denoted \(\psi _{d_{j,j'}}(n,f)\), for \(j'\ne j\)); (c) residual noise spectrum (denoted \(\psi _{v_{j}}\left( n,f\right) \)). The MWF is designed to minimize the MSE expression:

$$\begin{aligned} \mathbf {w}^{\text {WF}}_{j}(n,f) \triangleq&\text {argmin}_{{\mathbf {w}}}{\chi _{j}(n, f)}\nonumber \\ =&\frac{\left( \sum _{j'\ne j}{\varvec{\varPhi }_{c_{j'}}(n, f)}+\varvec{\varPhi }_{u}(f)\right) ^{-1}{\mathbf {a}}_{j}\left( f\right) }{{\mathbf {a}}_{j}^{H}\left( f\right) \left( \sum _{j'\ne j}{\varvec{\varPhi }_{c_{j'}}(n, f)}+\varvec{\varPhi }_{u}(f)\right) ^{-1}{\mathbf {a}}_{j}\left( f\right) +1/\phi _{s_j}(n,f)}. \end{aligned}$$
(12.27)

Note that computing the MWF in (12.27) requires knowledge of: (a) power spectral density (PSD) of the desired source (denoted \({\phi _{s_j}}(n,f)\)); (b) ATFs of the desired source (denoted \({\mathbf {a}}_{j}(f)\)); (c) PSD matrices of the interferes (denoted \(\varvec{\varPhi }_{c_{j'}}(n,f)\) for \(j'\ne j\)); (d) PSD matrix of the noise (denoted \({{\varPhi }_{u}}(f)\)). The estimation of these parameters is discussed in details in Sect. 12.6.

Although, practical methods exist for estimating the required parameters and implementing the MWF for the single source case, some relaxation is required when considering the multiple sources case. The long-term averaged SOS MWF is defined similarly to (12.27) by replacing the instantaneous PSD and PSD matrices of the desired source and interferers, respectively, with long-term averages:

$$\begin{aligned} \overline{\mathbf {w}}^{\text {WF}}_{j}(f) = \frac{\left( \sum _{j'\ne j}{\varvec{\overline{{\varPhi }}}_{c_{j'}}(f)}+\varvec{\varPhi }_{u}(f)\right) ^{-1}{\mathbf {a}}_{j}\left( f\right) }{{\mathbf {a}}_{j}^{H}\left( f\right) \left( \sum _{j'\ne j}{\varvec{\overline{{\varPhi }}}_{c_{j'}}(f)}+\varvec{\varPhi }_{u}(f)\right) ^{-1}{\mathbf {a}}_{j}\left( f\right) +1/{\overline{\phi }_{s_j}(f)}} \end{aligned}$$
(12.28)

where

$$\begin{aligned} {\bar{\phi }_{s_j}}(f) \triangleq&\frac{1}{\sum _{n}{1_{s_j}(n, f)}}\sum _{n}{1_{s_j}(n, f){\text {E}}\left[ |s_j(n,f)|^2\right] }\end{aligned}$$
(12.29a)
$$\begin{aligned} \varvec{\overline{{\varPhi }}}_{c_{j'}} \triangleq&\frac{1}{\sum _{n}{1_{s_{j'}}(n, f)}}\sum _{n}{1_{s_{j'}}(n, f){\text {E}}\left[ {\mathbf {c}}_{j'}(n,f){\mathbf {c}}_{j'}^{H}(n,f)\right] } \end{aligned}$$
(12.29b)

and \(1_{s_{j'}}(n,f)\) denotes an indicator function which equals 1 for time-frequency bins in which the \(j'\)-th source is active, for \(j'\in \left\{ 1,\ldots ,J\right\} \).

12.4.2 Performance

By substituting (12.27) in (12.19c), the SINR improvement of the MWF can be shown to be

$$\begin{aligned} \varDelta \text {SINR}^{\text {WF}}_{j}(n,f) =&\frac{1}{\Vert {\mathbf {a}}_{j}(f)\Vert ^2}\left( \sum _{j'\ne j}{\Vert {\mathbf {a}}_{j'}(f)\Vert ^2\phi _{s_{j'}}(n,f)}+\text {trace}\left( \varvec{\varPhi }_{u}(f)\right) \right) \nonumber \\&\times {\mathbf {a}}_{j}^{H}(f)\left( \sum _{j'\ne j}{\phi _{s_{j'}}(n,f){\mathbf {a}}_{j'}(f){\mathbf {a}}^{H}_{j'}(f)}+\varvec{\varPhi }_{u}(f)\right) ^{-1}{\mathbf {a}}_{j}(f). \end{aligned}$$
(12.30)

Note that the MWF allows to introduce distortion to the desired source, as long as it minimizes the variance of the total error between the desired source and the MWF output. The latter distortion may become high for example in low SIR cases when the number of interfering speech sources is larger than the number of microphones, i.e., \(J-1>M\).

12.5 Multichannel LCMV

The criterion for designing the LCMV spatial filter is defined in Sect. 12.5.1, and its performance is analyzed in Sect. 12.5.2.

12.5.1 Design

Let us consider the design of the LCMV spatial filter which enhances the j-th speech source. The LCMV is designed to satisfy a set of J linear constraints, one for each speech source, that are defined by:

$$\begin{aligned} \mathbf {A}^{H}(f)\mathbf {w}^{\text {LCMV}}_{j}(f)=\mathbf {g}_{j} \end{aligned}$$
(12.31)

where

$$\begin{aligned} \mathbf {A}(f) \triangleq \left[ \begin{array}{ccc}{\mathbf {a}}_{1}(f),&\cdots ,&{\mathbf {a}}_{J}(f)\end{array}\right] \end{aligned}$$
(12.32)

is the source ATFs matrix and

$$\begin{aligned} \mathbf {g}_{j} \triangleq \left[ \begin{array}{ccc}\mathbf {0}_{1\times \left( j-1\right) }&1&\mathbf {0}_{1\times \left( J-j\right) }\end{array}\right] ^{T} \end{aligned}$$
(12.33)

is the desired response for each of the sources and \(\mathbf {w}^{\text {LCMV}}_{j}(f)\) denotes the LCMV spatial filter at the f-th frequency-bin. Note that the desired response for the j-th source is \(g_{j,j}=1\), i.e., pass the j-th source undistorted, and the desired response for all other sources is \(g_{j,j'}=0\) for \(j'\ne j\), i.e., null all other source.

The LCMV spatial filter is defined as the optimal solution of the following criterion:

$$\begin{aligned} \mathbf {w}^{\text {LCMV}}_{j}(f) \triangleq&\text {argmin}_{{\mathbf {w}}} {\mathbf {w}}^{H} \varvec{\varPhi }_{u}(f){\mathbf {w}};\; \text {s.t.}\; \mathbf {A}^{H}(f){\mathbf {w}}=\mathbf {g}_{j} \end{aligned}$$
(12.34)

which aims to minimize the power of the noise at the output of the spatial filter (defined by (12.15)) while satisfying the linear constraints set, defined in (12.31). The closed-form solution of the optimization problem in (12.34) is given by:

$$\begin{aligned} \mathbf {w}^{\text {LCMV}}_{j}(f) = \varvec{\varPhi }_{u}^{-1}(f)\mathbf {A}(f)\left( \mathbf {A}^{H}(f)\varvec{\varPhi }_{u}^{-1}(f)\mathbf {A}(f)\right) ^{-1}\mathbf {g}_{j}. \end{aligned}$$
(12.35)

An alternative form for implementing the LCMV, denoted GSC [3], conveniently separates the tasks of constraining the spatial filter and minimizing the noise variance. Additionally, the GSC can be efficiently implemented as a time-recursive procedure which tracks the noise statistics and, adapts to it, and converges to the optimal LCMV solution.

The performance and behavior of the LCMV are different than those of the MWF. On the one hand, the MWF gives equal weight to the three sources of error, i.e. distortion, interfering speakers and noise, when designing the spatial filter the sum of the three is minimized at the output. On the other hand, the LCMV maintains the desired signal undistorted and nulls (zero response) towards the interfering speech signals at the output. The remaining degrees of freedom (DoF) are designed to minimize the noise at the output. By doing so, conceptually, the LCMV gives significantly higher weights to the distortion and interfering speech components compared to the weight of the noise component. In [16] the multiple speech distortion weighted-MWF (MSDW-MWF) criterion which generalizes both MWF and LCMV criteria is defined. The latter enables component specific weights to each of the error sources at the output of the spatial filter. It extends the SDW-MWF to the multiple speakers case.

12.5.2 Performance

By design, the LCMV satisfies a set of J linear constraints, one per speech source. The constraint that corresponds to the j-th desired source is designed to maintain a distortionless response towards this source, and therefore the distortion equals zero

$$\begin{aligned} \delta _{j}^{\text {LCMV}}(n,f) = 0 \end{aligned}$$
(12.36)

and correspondingly the SDR is infinite

$$\begin{aligned} \text {SDR}^{\text {LCMV}}_{\text {o},j}(n,f) \rightarrow \infty . \end{aligned}$$
(12.37)

Similarly, as the rest of the \(J-1\) constraints are associated with interfering speech sources and are designed to null them out, their corresponding SIRs are infinite:

$$\begin{aligned} \varDelta \text {SIR}^{\text {LCMV}}_{jj'}(f) \rightarrow \infty . \end{aligned}$$
(12.38)

By substituting (12.35) in (12.15) the noise variance at the output of the LCMV and the corresponding SNR improvement are:

$$\begin{aligned} \psi _{vj}^{\text {LCMV}}(f) =&\mathbf {g}_{j}^{H}\left( \mathbf {A}^{H}(f)\varvec{\varPhi }_{u}^{-1}\mathbf {A}(f)\right) ^{-1}\mathbf {g}_{j}\end{aligned}$$
(12.39a)
$$\begin{aligned} \varDelta \text {SINR}^{\text {LCMV}}_{j}(f) =&\frac{\mathbf {g}_{j}^{H}\left( \mathbf {A}^{H}(f)\varvec{\varPhi }_{u}^{-1}\mathbf {A}(f)\right) ^{-1}\mathbf {g}_{j}}{\sum _{j'\ne j}{\Vert {\mathbf {a}}_{j'}(f)\Vert ^2\phi _{s_{j'}}(n,f)}+\text {trace}\left( \varvec{\varPhi }_{u}(f)\right) }. \end{aligned}$$
(12.39b)

12.6 Parameters Estimation

12.6.1 Multichannel SPP Estimators

Speech presence probability (SPP) is a fundamental and crucial component of many speech enhancement algorithms, among them are the spatial filters described in the previous sections. In the latter, SPP governs the adaptation of various components which contribute to the calculation of the spatial filter. Specifically, it can be used to govern the estimation of noise and speech covariance matrices (see Sect. 12.6.2) and of RTFs (see Sect. 12.6.3).

The problem of estimating SPP is derived from the classic detection problem, also known as the radar problem, and its goal is to identify the temporal-spectral activity pattern of speech contaminated by noise. Explicitly, determining if a time-frequency bin contains a noisy speech component or just noise. Contrary to the VAD problem where low resolution is sufficient, high-resolution activity estimation in both time and frequency is required here for proper enhancement. Most single-channel SPP estimators are based on non-stationarity of speech as opposed to the stationarity of the noise. However, in low SNR cases the accuracy of the estimation degrades.

When utilized for controlling the gain in single-channel postfiltering, the estimated SPP is “tuned” to have a tendency towards speech. This relates to the single-channel processing tradeoff between speech distortion and noise reduction, and to the common understanding that speech distortion and artifacts are more detrimental for human listeners than increased noise level. In difference to its use for single-channel enhancement, where the effect of SPP errors (i.e., false-alarms and miss-detections) is short-term (in time), in spatial processing the consequences of such errors can be grave and spreads over a longer period. Miss-detections of speech, and its false classification as noise might lead to a major distortion, also known as the self cancellation phenomenon. On the other hand false-alarms, i.e., time-frequency bins containing noise which are mistakenly classified as desired speech, result in increased noise level at the output of the spatial filter, since it is designed to pass them through.

Several contributions extend SPP estimation to utilize spatial information when using an array of microphones. Here we present some of these methods. In  [17] which is presented in Sect. 12.6.1.1, the single channel Gaussian signal model of both speech and noise is extended to multichannel input, yielding a multichannel SPP. In Sect. 12.6.1.2, the work of  [18], suggesting to incorporate the spatial information embedded in the direct-to-reverberant ratio (DRR) into the speech a priori probability (SAP), is presented. Thereby utilizing the coherence property of the speech source, assuming diffuse noise. Multichannel SPP incorporating spatial diversity can be utilized to address complex scenarios of multiple speakers. In [19, 20], the authors extend the previous DRR based SAP and incorporate estimated speaker positions to distinguish between different speakers, see Sect. 12.6.1.3.

12.6.1.1 Multichannel Gaussian Variables Model Based SPP

All derivations in this section refer to a specific time-frequency bin (nf) and are replicated for all time-frequency bins. For brevity, the time and frequency indexes are omitted in the rest of this section. The received microphone signals

$$\begin{aligned} \mathbf {x} = \mathbf {c}+\mathbf {u} \end{aligned}$$
(12.40)

and the speech and noise components thereof, are modeled as Gaussian random variables:

$$\begin{aligned} \mathbf {c} \sim&\mathscr {N}_{c}\left( \mathbf {0}, \varvec{\varPhi }_{c}\right) \end{aligned}$$
(12.41a)
$$\begin{aligned} \mathbf {u} \sim&\mathscr {N}_{c}\left( \mathbf {0}, \varvec{\varPhi }_{u}\right) \end{aligned}$$
(12.41b)

where \(\varvec{\varPhi }_{c}=\phi _{s}\mathbf {a}\mathbf {a}^{H}\) is the covariance matrix of the speech image at the microphone signals. Consequently, a multichannel Gaussian model is adopted for the noise only, and noisy speech hypothesis:

$$\begin{aligned} \mathbf {x}|\mathscr {H}_{u} \sim&\mathscr {N}_{c}\left( \mathbf {0}, \varvec{\varPhi }_{u}\right) \end{aligned}$$
(12.42a)
$$\begin{aligned} \mathbf {x}|\mathscr {H}_{s} \sim&\mathscr {N}_{c}\left( \mathbf {0}, \varvec{\varPhi }_{c}+\varvec{\varPhi }_{u}\right) . \end{aligned}$$
(12.42b)

It can be shown [17] that the SPP, defined as:

$$\begin{aligned} p \triangleq P\left( \mathscr {H}_{s}|\mathbf {x}\right) \end{aligned}$$
(12.43)

can be formulated as

$$\begin{aligned} p = \frac{\varLambda }{1+\varLambda } \end{aligned}$$
(12.44)

where \(\varLambda \) is the generalized likelihood ratio, which in our case equals

$$\begin{aligned} \varLambda =\frac{1-q}{q}\cdot \frac{1}{1+\text {tr}\left\{ \varvec{\varPhi }_{u}^{-1}\varvec{\varPhi }_{c}\right\} }\cdot \exp \left\{ \frac{\varvec{x}^{H}\varvec{\varPhi }_{u}^{-1}\varvec{\varPhi }_{c}\varvec{\varPhi }_{u}^{-1}\varvec{x}}{1+\text {tr}\left\{ \varvec{\varPhi }_{u}^{-1}\varvec{\varPhi }_{c}\right\} }\right\} . \end{aligned}$$
(12.45)

and q is the a priori speech absence probability. Define the multichannel SNR as

$$\begin{aligned} \xi \triangleq \text {tr}\left\{ \varvec{\varPhi }_{u}^{-1}{\varvec{\varPhi }}_{c}\right\} \end{aligned}$$
(12.46)

and also define

$$\begin{aligned} \beta \triangleq \mathbf {x}^{H}\varvec{\varPhi }_{u}^{-1}\varvec{\varPhi }_{c}\varvec{\varPhi }_{u}^{-1}\mathbf {x}. \end{aligned}$$
(12.47)

Substituting (12.45), (12.46) and (12.47) in (12.44) yields the multichannel SPP:

$$\begin{aligned} p = \left\{ 1+\frac{q}{1-q}\cdot \left( 1+\xi \right) \cdot \exp \left\{ -\frac{\beta }{1+\xi }\right\} \right\} ^{-1}. \end{aligned}$$
(12.48)

Note that the single-channel SPP (of the first microphone) can be derived as a special case of the multichannel SPP by substituting

$$\begin{aligned} \xi _{1} =&\frac{\varPhi _{c,11}}{\varPhi _{u,11}}\end{aligned}$$
(12.49a)
$$\begin{aligned} \beta _{1} =&\gamma _{1}\cdot \xi _{1} \end{aligned}$$
(12.49b)

with \(\gamma _{1}\triangleq \frac{|x_{1}|^2}{\varPhi _{u,11}}\) defined as the posterior SNR and \(\varPhi _{c,11}\), \(\varPhi _{u,11}\) denote the speech and noise variances at the first microphone, respectively. The multichannel SPP can be interpreted as a single channel SPP applied to the output of an MVDR spatial filter designed to minimize the noise while maintaining a distortionless response towards the speech, with corresponding covariance matrices of \(\varvec{\varPhi }_{u}\) and \(\varvec{\varPhi }_{c}\), respectively.

The improvement of using the multichannel SPP depends on the spatial properties of the noise and of the speech. Two interesting special cases are the spatially white noise case and the coherent noise case. In the first case of a spatially white noise, the noise covariance matrix equals \(\varvec{\varPhi }_{u}=\phi _{u}\mathbf {I}\) where \(\mathbf {I}\) is the identity matrix. For this case the multichannel SNR equals \(M\cdot \xi _{1}\) and is higher than the single-channel SNR by a factor of the number of microphones (assuming that the SNRs at all microphones are equal). In the second case of a coherent noise, the noise covariance matrix equals \(\varvec{\varPhi }_{u}=\mathbf {a}_{u}\mathbf {a}_{u}^{H}\phi _{u,\text {c}}+\phi _{u,\text {nc}}\mathbf {I}\), where \(\mathbf {a}_{u}\) and \(\phi _{u,\text {c}}\) are the vector of ATFs relating the coherent interference and the microphone signals and its respective variance. It is further assumed that the microphones also contain spatially white noise components with variance \(\phi _{u,\text {nc}}\). In this case, perfect speech detection is obtained, i.e., \(p|\mathscr {H}_{s}\rightarrow 1\) and \(p|\mathscr {H}_{u}\rightarrow 0\) regardless of the coherent noise power, assuming that the ATFs vectors of the speech and the coherent noise are not parallel and that the spatially white sensors noise power \(\phi _{u,\text {nc}}\) is sufficiently low.

12.6.1.2 Coherence Based SAP

As presented in Sect. 12.6.1.1, computing the SPP requires the speech a priori probability (SAP), denoted q. The SAP can be either set to a constant [21] or derived from the received signals and updated adaptively according to past estimates of SPP and SNR [22, 23] (also known as the decision-directed approach). In [19] the multichannel generalization of the SPP (see Sect. 12.6.1.1 and [17]) is adopted and it is proposed to incorporate coherence information in the SAP.

Let us consider a scenario where a single desired speech component contaminated by a diffuse noise is received by a pair of omnidirectional microphones. The diffuse noise field can be modeled as an infinite number of equal power statistically independent interferences uniformly distributed over a sphere surrounding the microphone array. A well known result [24] is that the coherence of diffuse noise components received by a pair of microphones is

$$\begin{aligned} \gamma _{\text {diff}}(\ell , \lambda ) = \text {sinc}\left( \frac{2\pi \ell }{\lambda }\right) \end{aligned}$$
(12.50)

were \(\lambda \) is the wavelength and \(\ell \) is the microphones spacing. The direct to diffuse ratio (DDR) is defined as the SNR in this case, i.e., the power ratio of the directional speech received by the microphone and the diffuse noise. Heuristically, high and low DDR values are transformed into low and high SAP, respectively. The estimation of the DDR is based on a sound field model where the sound pressure at any position and time-frequency bin is modelled as a superposition of a direct sound represented by a single monochromatic plane wave and an ideal diffuse field, for more details please refer to [19, 25]. The DDR is estimated by:

$$\begin{aligned} \varGamma = \text {Re}\left\{ \frac{\gamma _{\text {diff}}-\hat{\gamma }}{\hat{\gamma }-\exp (\text {j}\hat{\theta })}\right\} \end{aligned}$$
(12.51)

where

$$\begin{aligned} \gamma \triangleq \frac{\text {E}\left[ x_{1} x_{2}^{*}\right] }{\sqrt{\text {E}\left[ |x_{1}|^2\right] \cdot \text {E}\left[ |x_{2}|^2\right] }} \end{aligned}$$
(12.52)

is the coherence between the microphone signals and \(\theta \triangleq \angle \left( c_1\cdot c_{2}^{*} \right) \) is the phase between the speech components received by the microphones. The coherence is computed from estimates of the auto-PSDs and cross-PSD of the microphones (see Sect. 12.6.2), and the phase \(\theta \) is approximated from the phase of the cross-PSD by:

$$\begin{aligned} \hat{\theta } = \angle \left( \text {E}\left[ x_{1} x_{2}^{*}\right] \right) \end{aligned}$$
(12.53)

assuming that both SNR and DDR are high.

12.6.1.3 Multiple Speakers Position Based SPP

Consider the J speakers scenario in which the microphone signals can be formulated as:

$$\begin{aligned} \mathbf {x} = \sum _{j=1}^{J}\mathbf {c}_{j}+\mathbf {u}. \end{aligned}$$
(12.54)

In [26], the authors propose to use a MWF for extracting a desired source from a multichannel convolutive mixture of sources. By incorporating position estimates into the SPP and classifying the dominant speaker per time-frequency point, the “interference” components’ PSD matrix, comprising noise and interfering speakers, and the desired speaker components’ PSD matrix are estimated and utilized for constructing a spatial filter. Speaker positions are derived by triangulation of DOA estimates obtained from distributed sub-arrays of microphones with known positions.

Individual sources SPPs are defined as:

$$\begin{aligned} p_j\triangleq p\left( \mathscr {H}_{s_j}|\mathbf {x}\right) =p\left( \mathscr {H}_{s_j}|\mathbf {x},\mathscr {H}_{s}\right) p \end{aligned}$$
(12.55)

where \(\mathscr {H}_{s_j}\) denotes the hypothesis that the j-th speaker is active (per time-frequency point), and p is the previously defined SPP (for any speaker activity).

The conditional SPPs given the microphone signals are replaced by conditional SPPs given an estimate position of the dominant active speaker, denoted \(\hat{\varvec{\varTheta }}\), i.e., it is assumed that:

$$\begin{aligned} p\left( \mathscr {H}_{s_j}|\hat{\varvec{\varTheta }},\mathscr {H}_{s}\right) \approx p\left( \mathscr {H}_{s_j}|\mathbf {x},\mathscr {H}_{s}\right) . \end{aligned}$$
(12.56)

The estimated position, given that a specific speaker is active, is modeled as a mixture of Gaussian variables centered at the sources’ positions:

$$\begin{aligned} p\left( \hat{\varvec{\varTheta }}|\mathscr {H}_{s}\right) =\sum _{j=1}^{J}\pi _{j}\mathscr {N}\left( \hat{\varvec{\varTheta }};\varvec{\mu }_{j},\varvec{\varOmega }_{j}\right) \end{aligned}$$
(12.57)

where \(\varvec{\mu }_{j}\), \(\varvec{\varOmega }_{j}\) and \(\pi _{j}\) are the mean, covariance and mixing coefficient of Gaussian vector distribution which corresponds to the estimated position of the j-th source, for \(j=1,\ldots ,J\). The parameters of the distribution of \(\hat{\varvec{\varTheta }}\) are estimated by an expectation maximization (EM) procedure given a batch of estimated positions. For a detailed explanation please refer to [26].

This work is further extended in [19], where a MWF is designed to extract sources arriving from a predefined “spot”, i.e., a bounded area, while suppressing all other sources outside of the spot. This method is denoted by spotforming.

12.6.2 Covariance Matrix Estimators

The noise covariance matrix can be estimated by recursively averaging instantaneous covariance matrices weighted according to the SPP:

$$\begin{aligned} \varvec{\widehat{{\varPhi }}_{u}}\left( n,f\right) =&{\lambda _u'}(n, f)\varvec{\widehat{{\varPhi }}_{u}}\left( n-1,f\right) \nonumber \\&+\left( 1-{\lambda _u'}(n, f)\right) {\mathbf {x}}(n,f){\mathbf {x}^{H}}(n,f). \end{aligned}$$
(12.58)

where

$$\begin{aligned} {\lambda _u'}(n, f) \triangleq \left( 1-p(n, f)\right) {\lambda _u}+p(n, f) \end{aligned}$$
(12.59)

is a time-varying recursive averaging factor and \({\lambda _u}\) is selected such that its corresponding estimation period (\(\frac{1}{1-{\lambda _u}}\) frames) is shorter than the stationarity time of the noise. Alternatively, a hard binary weighting, obtained by applying a threshold to the SPP, can be used instead of the soft weighting.

The hypothesis that speaker j is present and the corresponding SPP are denoted in Sect. 12.6.1.1 as \({\mathscr {H}_{s_j}}(n,f)\) and \(p_j(n,f)\), respectively. Similarly to (12.58), the covariance matrix of the spatial image of source j, denoted \({\varvec{\varPhi }_{c_j}}(n,f)\), can be estimated by

$$\begin{aligned} \varvec{\widehat{{\varPhi }}_{c_j}}&(n,f) = {\lambda '_{c_j}}(n,f)\varvec{\widehat{{\varPhi }}_{c_j}}(n-1,f)\nonumber \\&+(1-{\lambda '_{c_j}}(n,f))({\mathbf {x}}(n,f){\mathbf {x}^{H}}(n,f)-\varvec{\widehat{{\varPhi }}_{u}}\left( n-1,f\right) ) \end{aligned}$$
(12.60)

where

$$\begin{aligned} {\lambda '_{c_j}}(n, f) \triangleq \left( 1-{p_j}(n, f)\right) {\lambda _c}+{p_j}(n, f) \end{aligned}$$
(12.61)

is a time-varying recursive-averaging factor, and \({\lambda _c}\) is selected such that its corresponding estimation period (\(\frac{1}{1-{\lambda _c}}\) frames) is shorter than the coherence time of the AIRs of speaker j, i.e. the time period over which the AIRs are assumed to be time-invariant. Note that: (1) usually the estimation period is longer than the speech nonstationarity time, therefore, although the spatial structure of \({\varvec{\varPhi }_{c_j}}(n,f)\) is maintained, the estimated variance is an average of the speech variances over multiple time periods, denoted \({\bar{\phi }_{s_j}}(n,f)\), rather than \({\phi _{s_j}}(n,f)\), the actual time-varying variance of the speaker ; (2) the estimate \(\varvec{\widehat{{\varPhi }}_{c_j}}(n,f)\) keeps its past value when speaker j is absent.

12.6.3 Procedures for Semi-blind RTF Estimation

Two common approaches for RTF estimation are the covariance subtraction [27, 28] and the covariance whitening [11, 29] methods. Here, for brevity we assume a single speaker scenario. Both of these approaches rely on estimated noisy speech and noise-only covariance matrices, i.e. \(\varvec{\widehat{{\varPhi }}_{xj}}(n,f)\) (where \(\varvec{{\varPhi }_{xj}}(n,f)={\varvec{\varPhi }_{c_j}}(n,f)+{\varvec{\varPhi }_{u}}(f)\)) and \(\varvec{\widehat{{\varPhi }}_{u}}(n,f)\). Given the estimated covariance matrices, covariance subtraction estimates the speaker RTF by

$$\begin{aligned} {\tilde{\mathbf {a}}_{j,\text {CS}}}(f) \triangleq \frac{1}{{\mathbf {i}_{1}^{H}}(\varvec{\widehat{{\varPhi }}_{c_j}}(n,f)-\varvec{\widehat{{\varPhi }}_{u}}(n,f)){\mathbf {i}_{1}}}(\varvec{\widehat{{\varPhi }}_{c_j}}(n,f)-\varvec{\widehat{{\varPhi }}_{u}}(n,f)){\mathbf {i}_{1}} \end{aligned}$$
(12.62)

where \({\mathbf {i}_{1}}=[\begin{array}{cc}1&\mathbf {0}_{1\times M-1}\end{array}]^{T}\) is an \(M\times 1\) selection vector for extracting the component of the reference microphone, here assumed to be the first micrphone.

The covariance whitening approach estimates the RTF by: (1) applying the generalized eigenvalue decomposition (GEVD) to \(\varvec{\widehat{{\varPhi }}_{xj}}(n,f)\) with \(\varvec{\widehat{{\varPhi }}_{u}}(n,f)\) as the whitening matrix; (2) de-whitening the eigenvector corresponding to the strongest eigenvalue, denoted \({\dot{\mathbf {a}}_{j}}(f)\), namely \(\varvec{\widehat{{\varPhi }}_{u}}(n,f){\dot{\mathbf {a}}_{j}}(f)\); (3) normalizing the de-whitened eigenvector by the reference microphone component. Explicitly:

$$\begin{aligned} {\tilde{\mathbf {a}}_{j,\text {CW}}}(f) \triangleq ({\mathbf {i}_{1}^{H}}\varvec{\widehat{{\varPhi }}_{u}}(n,f){\dot{\mathbf {a}}_{j}}(f))^{-1}\varvec{\widehat{{\varPhi }}_{u}}(n,f){\dot{\mathbf {a}}_{j}}(f). \end{aligned}$$
(12.63)

A preliminary analysis and comparison of the covariance subtraction and covariance whitening methods can be found in [30].

Other methods utilize the speech nonstationarity property, assuming that the noise has slow time-varying statistics. In [10], the problem of estimating the RTF of microphone i is formulated as a least squares (LS) problem where the l-th equation utilizes \(\widehat{\phi }_{x_ix_1}^l\)(f), the estimated cross-PSD of microphone i and the reference microphone in the l-th time segment. This cross-PSD satisfies:

$$\begin{aligned} \widehat{\phi }_{xj,i1}^l(f) = {\tilde{a}_{j,i}}(f)\widehat{\phi }_{xj,11}^l(f)+\widehat{\phi }_{\dot{u}i,xj1}(f)+\varepsilon _{j,i}^{l}(f) \end{aligned}$$
(12.64)

where we use the relation \({\mathbf {x}}(n,f)={\tilde{\mathbf {a}}_{j}}(f) x_1(n,f)+\dot{\mathbf {u}}(n,f)\). The unknowns are \({\tilde{a}_{j,i}}(f)\), i.e. the required RTF, and \(\widehat{\phi }_{\dot{u}i,xj1}(f)\), which is a nuisance parameter. \(\varepsilon _{j,i}^{l}(f)\) denotes the error term of the l-th equation. Multiple LS problems, one for each microphone, are solved for estimating the vector RTF. Note that, the latter method, also known as the nonstationarity-based RTF estimation, does not require a prior estimate of the noise covariance, since it simultaneously solves for RTF and the noise statistics. Similarly, a weighted least squares (WLS) problem with exponential weighting can be defined and implemented using a recursive least squares (RLS) algorithm [31]. Considering speech sparsity in the STFT domain, in [28] the SPPs were incorporated into the weights of the WLS problem, resulting in a more accurate solution.

12.7 Examples

In the following section some simple examples are used to present the behaviors and differences between the MF, MWF and LCMV spatial filters. The case of a narrowband signal in an anechoic environment is presented in Sect. 12.7.1, and the case of two speech sources in a reverberant environment is presented in Sect. 12.7.2.

12.7.1 Narrowband Signals at an Anechoic Environment

Consider the case of \(J=2\) narrowband sources occupying the f-th frequency-bin propagating in an anechoic environment and received by a uniform linear array (ULA) array comprising M microphones with microphone spacing \(\ell \). Define a spherical-coordinate system with the origin coincides with the center of the ULA, and rotated such that the ULA is placed along the elevation angle \(\theta =\pm 90^{\circ }\) and azimuth angle of \(\phi =0^{\circ }\). The sources are positioned in the far-field at a large distance from the microphones and on the same plane as the microphones. The DOA of the sources with respect to the microphones array are denoted \(\theta _{j}\) for \(j=1,2\). By adopting the far-field free-space propagation model the ATF vectors of the sources are given by:

$$\begin{aligned} \mathbf {a}_{j}^{0}(f) = \left[ \begin{array}{cccc}1,&\exp \left( -\text {j}2\pi \frac{\ell \sin \left( \theta _{j}\right) }{\lambda }\right) ,&\cdots ,&\exp \left( -\text {j}2\pi \frac{\left( M-1\right) \ell \sin \left( \theta _{j}\right) }{\lambda }\right) \end{array}\right] ^{T} \end{aligned}$$
(12.65)

for \(j=1,2\) where \(\lambda \) is the wavelength corresponding to the f-th frequency-bin. The wavelength can be expressed as:

$$\begin{aligned} \lambda \triangleq \frac{\nu F}{f f_{s}} \end{aligned}$$
(12.66)

with the continuous frequency which corresponds to the discrete frequency-bin f is \(\frac{f f_{s}}{F}\) where \(f_{s}\) is the sample-rate, F is the length of STFT window and \(\nu \approx 343\)  m/s is the sound velocity. An additive white Gaussian noise with covariance matrix of

$$\begin{aligned} \varvec{\varPhi }_{u}^{0}(n, f) = \phi _{u} \mathbf {I}. \end{aligned}$$
(12.67)

is contaminating the received microphone signals. The setup of the sources and microphones is depicted in Fig. 12.1. Note that the ATF vector is independent of the azimuth angle \(\phi \), and therefore the beampattern and all performance measures of any spatial filter in this case will have a cylindrical symmetry. Next, we compare the performance of various spatial filters that are applied in this problem, namely MF, MWF and LCMV. The performance criteria that we use are SNR, SIR, SINR and SDR which are evaluated empirically from the signals. For the MF spatial filter we derive simplified expressions for the performance criteria, whereas for the MWF and the LCMV spatial filters we use the previously defined generic scenario expressions.

Fig. 12.1
figure 1

Setup of the narrowband signal at an anechoic environment example

Let us revisit the performance criteria of the MF for this case. By substituting the ATF vectors in (12.65), the scalar product of the j-th and \(j'\)-th ATF vectors, denoted by \(\rho _{jj'}(f)\) in (12.25), can be expressed as:

$$\begin{aligned} \rho ^{0}_{jj'}(f) =&\sum _{i=1}^{M}\exp \left( \text {j}2\pi \frac{\left( i-1\right) \ell \left( \sin \left( \theta _j\right) -\sin \left( \theta _{j'}\right) \right) }{\lambda }\right) \nonumber \\ =&M\cdot \text {diric}\left( 2\pi \frac{\ell \left( \sin \left( \theta _j\right) -\sin \left( \theta _{j'}\right) \right) }{\lambda }\right) \cdot \exp \left( \text {j}\pi \frac{\left( M-1\right) \ell \left( \sin \left( \theta _j\right) -\sin \left( \theta _{j'}\right) \right) }{\lambda }\right) \end{aligned}$$
(12.68)

where

$$\begin{aligned} \text {diric}\left( 2\pi \frac{\ell \left( \sin \left( \theta _j\right) -\sin \left( \theta _{j'}\right) \right) }{\lambda }\right) \triangleq \frac{\sin \left( M\pi \frac{\ell \left( \sin \left( \theta _j\right) -\sin \left( \theta _{j'}\right) \right) }{\lambda }\right) }{M \cdot \sin \left( \pi \frac{\ell \left( \sin \left( \theta _j\right) -\sin \left( \theta _{j'}\right) \right) }{\lambda }\right) } \end{aligned}$$
(12.69)

is the Dirichlet function which in general has a period of \(4\pi \). Note that \(\left| \rho ^{0}_{jj'}(f)\right| ^2=M^2\) for

$$\begin{aligned} \frac{\ell \left( \sin \left( \theta _j\right) -\sin \left( \theta _{j'}\right) \right) }{\lambda }=k \end{aligned}$$
(12.70)

where \(k=0,\pm 1,\pm 2,\ldots \) is any integer number. Next, since the \(\sin \left( \cdot \right) \) is bounded by \(-1\le \sin \left( \cdot \right) \le 1\) the left-hand side of (12.70) is bounded by:

$$\begin{aligned} -\frac{2\ell }{\lambda }\le \frac{\ell \left( \sin \left( \theta _j\right) -\sin \left( \theta _{j'}\right) \right) }{\lambda }\le \frac{2\ell }{\lambda }. \end{aligned}$$
(12.71)

Hence, in order to avoid the spatial aliasing phenomenon, where undesired directions are passed through the spatial filter without any attenuation, the well-known constraint on the ratio between microphones spacing and the wavelength is given by:

$$\begin{aligned} \frac{\ell }{\lambda }<\frac{1}{2}. \end{aligned}$$
(12.72)

Furthermore, note that for

$$\begin{aligned} M\frac{\ell \left( \sin \left( \theta _j\right) -\sin \left( \theta _{j'}\right) \right) }{\lambda } = k \end{aligned}$$
(12.73)

for any integer k non-divisible by M with the a zero remainder, i.e. of the form \(k \ne \iota M\) where \(\iota \) is an integer, we obtain that \(\left| \rho ^{0}_{jj'}(f)\right| ^2=0\). Explicitly, in the range of \(-\frac{\pi }{2}\le \theta _{j'}\le \frac{\pi }{2}\) there are \(M-1\) such DOAs that are perfectly attenuated by the MF, also referred to as nulls in the beampattern. By replacing \(\rho _{jj'}(f)\) with \(\rho ^{0}_{jj'}(f)\) in the power of the residual \(j'\)-th interference at the j-th output, see (12.24a), and the corresponding SIR improvement of the MF, see (12.24b), the following simplified expressions are obtained:

$$\begin{aligned} \psi _{d_{j,j'}}^{0,\text {MF}}(n,f) =&\text {diric}^{2}\left( 2\pi \frac{\ell \left( \sin \left( \theta _j\right) -\sin \left( \theta _{j'}\right) \right) }{\lambda }\right) \phi _{s_{j'}}(n,f)\end{aligned}$$
(12.74a)
$$\begin{aligned} \varDelta \text {SIR}^{0,\text {MF}}_{jj'}(f) =&\left( \text {diric}^{2}\left( 2\pi \frac{\ell \left( \sin \left( \theta _j\right) -\sin \left( \theta _{j'}\right) \right) }{\lambda }\right) \right) ^{-1}. \end{aligned}$$
(12.74b)

Considering the spatially non-correlated noise properties, see (12.67), and substituting it in the power of the noise at the output of the j-th source MF, see (12.26a), and the corresponding SNR improvement, see (12.26b), the latter can be expressed in this special case as:

$$\begin{aligned} \psi _{v_{j}}^{0,\text {MF}}(f) = \frac{\phi _{u}(f)}{M}\end{aligned}$$
(12.75a)
$$\begin{aligned} \varDelta \text {SNR}^{\text {0,MF}}_{j}(f) = M. \end{aligned}$$
(12.75b)

The corresponding criteria for the MWF and multichannel LCMV are more complicated, and their derivation is omitted.

We compare the spatial filters in a specific scenario of: (a) the microphone array comprises of \(M=4\) microphones with a microphone spacing of \(\ell =10\,\text {cm}\); (b) the desired source is the first source which arrives from \(\theta _{1}=0^{\circ }\). In the following, we investigate the performance dependency on the parameters: (a) SNR and interference-to-noise ratio (INR); (b) interference DOA, \(\theta _{2}\); (c) frequency. For simplicity, we consider two subsets of the above mentioned parameters.

In the first parameters subset, the interference DOA and frequency are set to \(\theta _{2}=10^{o}\) and \(\frac{f\cdot f_s}{F}=1715\,\text {Hz}\), corresponding to \(\frac{\ell }{\lambda }=\frac{1}{2}\) (in other figures we explore the performance depending on the frequency or wavelength). For this parameters selection the performance measures of the spatial filters are compared as a function of SNR and INR values that are selected within the range of \(\left[ -20\,\text {dB}, 30\,\text {dB}\right] \). The improvements in SNR, SIR and SINR and the SDR are depicted in Fig. 12.2a–d. We can observe in these figures the consequences of the different design criteria: (a) the MF is designed to maximized the SNR improvement with a spatially white noise (as in this example) and obtains the highest SNR improvement in Fig. 12.2a; (b) the LCMV is designed to null out the interfering sources and therefore obtains an infinite SIR improvement, which is of course higher than the finite SIR improvement of the other methods that are depicted in Fig. 12.2b; (c) the MWF is designed to maximize the SINR improvement and this is evidently seen in Fig. 12.2c. The MWF aims to maximize the SINR improvement and thus minimize the sum of interference and noise powers at its output. In the limit cases of \(\text {INR}\left[ \text {dB}\right] \rightarrow -\infty \) where the interference power is negligible and \(\text {INR}\left[ \text {dB}\right] \rightarrow \infty \) where the noise power is negligible, the MWF coincides with the MF and the LCMV, respectively. This can be clearly seen in Fig. 12.2a–c, where the performance of the MWF converges to that of the MF and LCMV for \(\text {INR}\left[ \text {dB}\right] \rightarrow -\infty \) and \(\text {INR}\left[ \text {dB}\right] \rightarrow \infty \), respectively. The MF and LCMV spatial filters are distortionless by design at any input SNR and INR levels, and therefore we do not depict their SDR. The SDR at the output of the MWF as a function of the input SNR and INR is depicted in Fig. 12.2d. The higher is the input SNR the higher is the relative weight of the distortion component compared to the interference and noise components in the MSE in (12.27), which is the MWF design criterion and correspondingly the higher is the SDR of the MWF.

Fig. 12.2
figure 2

Performance comparison depending on input SNR and INR of various spatial filters in the narrowband case with 2 speech sources propagating in freespace and spatially white noise received by a ULA comprising 4 microphones

In the second parameters subset, the input SNR and INR are both set to \(20\,\text {dB}\) and the interference DOA and frequency are selected within the range of \(\left[ -90^{\circ }, 90^{\circ }\right] \) and \(\left[ 0\,\text {Hz}, 8000\,\text {Hz}\right] \), respectively. The SNR, SIR and SINR improvement as well as SDR for frequency \(1715\,\text {Hz}\) (for which \(\frac{\ell }{\lambda }=0.5\)) depending the interference source DOA are depicted in Fig. 12.3 for the various spatial filters. As in the previous example and regardless of the DOA of the interference: (a) the MF is optimal in the sense of SNR improvement for a spatially white noise, see Fig. 12.3a; (b) the LCMV is optimal in the sense of SIR improvement, see Fig. 12.3b, as it completely nulls out the interference and obtains infinite SIR improvement, whereas for MF and MWF there is some residual interference at the output for almost all interference DOAs; (c) the MWF is optimal in the sense of SINR improvement, see Fig. 12.3c, although the SINR improvement of the LCMV is very similar for most interference DOAs.

Fig. 12.3
figure 3

Performance comparison of various spatial filters applied in the narrowband case (at frequency \(1715\,\text {Hz}\)), where 2 speech sources propagating in freespace and a spatially white additive noise are received by a ULA comprising 4 microphones. The desired source arrives from \(\theta _1=0^{\circ }\) and the interfering source arrives from \(\theta _2\int \left[ -90^{\circ },90^{\circ }\right] \)

The main difference between the LCMV and MWF can be observed when the interference DOA, \(\theta _2\), is close to that of the desired source, \(\theta _1=0^{\circ }\). The LCMV, which is designed to null the interference, “struggles” to satisfy its constraints as the interference and desired source DOAs become closer. As a result, the SNR improvement (which is a secondary objective for the LCMV) and correspondingly the SINR improvement are degraded (see Fig. 12.3a, c), and might even become negative (i.e. the spatial noise power at the output might become higher than the noise power at the input and in extreme cases might even become higher than the noise an interference power at the input). Furthermore, the LCMV is not defined for the singular case of \(\theta _2=\theta _1\). In this specific case, the MWF is not able to improve the SIR, however, it is able to improve the SNR. However, note in Fig. 12.3 that as the interference and desired source DOAs become close the SDR degrades. This is because the MWF converges in this case to the MF scaled by a single channel Wiener filter, which introduces more distortion as interference and noise power increases.

Fig. 12.4
figure 4

SINR improvement depending on interference DOA and frequency of various spatial filters in the narrowband case with 2 speech sources propagating in freespace and spatially white noise received by a ULA comprising 4 microphones

Another interesting observation in the SIR improvement (see Fig. 12.3b) is that for some DOAs (\(\theta _2\approx \pm 30^{\circ }\)) the SIR improvement of the MWF and MF also converge to infinity (as the optimal LCMV). The reason for that is that for these DOAs the interfering and desired ATF vectors are orthogonal (i.e. \(\rho _{12}^{0}(f)=0\), see (12.68)) and the corresponding SIR improvement is also infinite.

The SINR improvement of the MF and MWF depending on interference DOA and frequency are depicted in Figs. 12.4a, b. Clearly the SINR improvement of the MWF outperforms that of the MF. For brevity we omit the SINR improvement of the LCMV as it is similar to the improvement of the MWF almost always, except for when the interference DOA approaches the desired source DOA. The red regions in the SINR improvement of the MF in Fig. 12.4a, similarly to the peaks in the SINR improvement of the MF in Fig. 12.3c, correspond to cases where the desired source and interference ATF vectors are orthogonal. Note that positions of these peaks vary over frequency. The blue regions in the SINR improvement of the MF in Fig. 12.4a (except for the one around \(\theta _2\approx \theta _1\)) correspond to the spatial aliasing phenomenon. When the interference arrives from the DOA of the desired source or from a DOA which corresponds to a grating lobe of the spatial filter, it cannot be attenuated without degrading the desired source. Hence, the SINR improvement at these DOA is close to zero (blue color). Note that the positions of the grating lobes vary over frequencies. A similar phenomenon can be seen when observing the SINR improvement of the MWF in Fig. 12.4b. For the MWF, however, the areas in which the SINR is close to zero are narrower than in the MF, and the areas of high SINR improvement cover almost the entire interference DOA and frequency ranges.

12.7.2 Speech Signals at a Reverberant Environment

In this section we compare the performance of the various spatial filters in a scenario simulated by convolving recorded speech signals from the WSJCAM0 database [32] with AIRs drawn from a database collected in reverberant enclosures [33]. A ULA comprising \(M=4\) microphones with spacing of \(\ell =8\,\text {cm}\) is picking up signals of \(J=2\) speakers, a female and a male, located at a distance of \(1\,\text {m}\) from the array at DOAs of \(-90^{\circ }\) and \(75^{\circ }\), respectively, as well as diffuse noise that is generated using a diffuse noise simulator [34]. The SIR is set to \(0\,\text {dB}\) and the SNR is set to \(15\,\text {dB}\).

The signals are transformed to the STFT domain, where MF, MWF and LCMV are designed to enhance the first speaker. Speech-free time-segments and single-talk time-segments of each of the speech sources are used as training segments from which the required parameters for the various spatial filters are estimated: (a) RTFs vector \({\tilde{\mathbf {a}}}_1(f)\) of the first source for the MF; (b) RTFs vector \({\tilde{\mathbf {a}}}_1(f)\) and spectrum \(\overline{\phi }_{s1}(f)\) of the first source, covariance matrix of the second source \(\varvec{\varPhi }_{c2}(f)\) and covariance matrix of the noise \(\varvec{\varPhi }_u(f)\) for the MWF; (c) RTF vectors of both sources \({\tilde{\mathbf {a}}}_1(f)\),\({\tilde{\mathbf {a}}}_2(f)\) and covariance matrix of the noise \(\varvec{\varPhi }_u(f)\) for the LCMV. The output of the spatial filter is transformed back to the time-domain, yielding the enhanced signal. A reference microphone and outputs of the various spatial filters decomposed to their various components (desired speech, interfering speech and noise) are depicted in Fig. 12.5. The corresponding spectrograms of the reference microphone and the outputs of the spatial filters are depicted in Fig. 12.6. The performance measures of each of the spatial filters per frequency-bin in terms of SNR, SIR and SINR improvement as well as SDR are depicted in Fig. 12.7. Considering the SIR and SINR improvements, it is clear from Figs. 12.5 and 12.7b, c, that the MWF is slightly better than the LCMV and that both are significantly better than the MF. While the MWF is expected to obtain the maximal SINR improvement, it is surprising that it outperforms the LCMV in terms of SIR improvement as well. The reason for that lies in the fact that LCMV designates a single constraint for nulling the interfering source, thus assuming a rank-1 model for the interference, while the MWF utilizes the complete covariance matrix of the interfering source, thus allowing to reduce interferences with higher ranks. Although, theoretically, the covariance matrix of coherent point sources is rank-1, in practice, finite window lengths and variations in the AIR (AIR might vary even when the source is static due to slight variations in the enclosure) increase the matrix rank. Considering the SNR improvement, note that the MWF and LCMV are better than the MF in frequencies lower than \(~1000\,\text {Hz}\), and that for higher frequencies the MF is better than the MWF and LCMV. This result is attributed to the diffuse noise properties. In low frequencies, where \(\frac{\ell }{\lambda }<\frac{1}{2}\) the diffuse noise has a strong coherent component which the data-dependent filters, MWF and LCMV, reduce efficiently. In higher frequencies the diffuse noise becomes spatially uncorrelated, in which case the MF is optimal and outperforms the MWF and LCMV which utilize their DoF to reduce the interfering speech.

Fig. 12.5
figure 5

Input and output signals of various spatial filters in a simulated scenario with \(J=2\) speech signals contaminated by diffuse noise and received by a \(M=4\) microphones array in a reverberant environment. The signals are decomposed to their components: (1) desired speaker (blue); (2) interfering speaker (green); and (3) noise (red)

Fig. 12.6
figure 6

Input and output spectrograms of various spatial filters aiming to enhance the first source in a simulated scenario with \(J=2\) speech signals contaminated by diffuse noise and received by a \(M=4\) microphones array in a reverberant environment. Input spectrogram of the reference signal and its speech components are respectively depicted in a, b, c and the outputs of the MF, MWF and LCMV spatial filters are respectively depicted in d, e, f

Fig. 12.7
figure 7

Performance criteria per frequency-bin of various spatial filters in a simulated scenario with \(J=2\) speech signals contaminated by diffuse noise and received by a \(M=4\) microphones array in a reverberant environment

12.8 Summary

MMSE based criteria for designing beamformers, also referred to as spatial-filters, can be used in noise reduction and speech separation tasks. The following methods were presented and analyzed: (1) the MF, which maximizes the SNR at the output without distorting the speech signal, assuming a spatially white noise; (2) the MWF, which minimizes the MSE between the output signal and the desired speech signal, and assigns equal weights to the desired speech distortion, the variance of the interfering speakers at the output, and the noise variance at the output; and (3) the LCMV, which minimizes the noise variance at the output while satisfying a set of constraints designed to maintain the desired speech undistorted and to null out the interfering speakers. Estimation methods for implementing the various beamformers are surveyed. Specifically, methods for estimating the RTFs of speakers and for estimating the spatial covariance matrices of the noise and of the various speaker components were presented. The estimation methods are governed by the multichannel SPP, which was also presented. Some simple examples of applying the various beamformers to simulated narrowband signals in an anechoic environment and to speech signals in a real reverberant environment were presented and discussed.