1 Introduction

In speech-related applications, such as hearing aids, teleconferencing, and hands-free devices, it is often crucial to reduce the effect of undesired background noise while preserving the quality of clean speech signal.

Many noise reduction algorithms are implemented in the short-time Fourier transform (STFT) domain by taking advantage of the well-known fast Fourier transform. Considering the central limit theorem, a general trend in STFT-based speech enhancement algorithms has been to model real and imaginary parts of discrete Fourier transform (DFT) coefficients by independent Gaussian distributions [10, 48]. However, due to the limited DFT frame length, which is typically varied between 10 and 100 ms, the Gaussian assumption is not completely fulfilled [25, 29,30,31]. As a consequence, other distributions have been examined and many researches have been devoted to this topic [2, 4, 8, 11, 14, 25, 29,30,31, 43, 45].

In Martin [29], an analytical minimum mean square error (MMSE) estimator for the DFT coefficients was developed under the complex Gamma distribution of speech signals, while noise is modeled by either complex Gaussian or Laplacian distributions. It was shown that this Gamma-based estimator can deliver a better signal to noise ratio (SNR) compared to the Gaussian-based MMSE estimator presented in Ephraim and Malah [10]. The work was later extended to one that considers Laplacian distribution for speech DFT coefficients [31], and also speech presence probability [30]. Furthermore, based on some measured histograms, Lotter and Vary proposed a high-accuracy two-parametric function for the probability density function (PDF) of the speech signal amplitude [25]. They showed that in special cases, the two-parametric super-Gaussian functions lead to the Laplacian or Gamma assumption of complex DFT coefficients [25]. Finally, they proposed two analytical maximum a posteriori (MAP) spectral amplitude estimators. Compared to the MMSE criterion, MAP-based estimators can be implemented more efficiently, since they do not require the computation of expensive Bessel or confluent hypergeometric functions [25].

Non-Gaussian-based speech enhancement methods have received a great deal of attention by using either the MAP or MMSE criterion due to the significant noise reduction improvement [2, 4, 8, 11, 14, 25, 29,30,31, 43, 45]. In addition to the mentioned criteria, other cost functions such as \(\beta \)-order, weighted Euclidean, and the weighted Cosh MMSE of spectral amplitude have been studied, too [43, 45]. Also, the joint Bayesian estimations of both the clean speech amplitude and phase were proposed in [14], utilizing the super-Gaussian assumption of speech DFT coefficients.

All previously mentioned algorithms have been introduced in the framework of single-microphone speech enhancement. However, multimicrophone-based algorithms, which enable us to benefit from spatial in addition to spectral information, lead to higher degrees of freedom in order to reduce noise.

A Gaussian-based Bayesian estimation of clean speech signal has been introduced in the multimicrophone framework by Balan and Rosca [5]. Using the well-known Fisher–Neyman factorization, they demonstrated that the multimicrophone MMSE Bayesian estimation can be decomposed to a minimum variance distortion-less response (MVDR) filter followed by a single-microphone Ephraim and Malah-MMSE estimator [10]. Based on the decomposition, they developed MMSE estimators for short-time spectral amplitude (STSA), log spectral amplitude (LSA), and speech phase in the multimicrophone framework. An extension of this work was presented in [19], where the magnitude of DFT coefficients is modeled by a generalized Gamma distribution. It has been shown that assuming Gaussian distribution for noise DFT coefficients, and generalized Gamma distribution for the clean speech magnitude, a multimicrophone MMSE estimator can again be decomposed into two subsequent estimators (MVDR plus single-microphone MMSE). They also investigated the robustness of the proposed multichannel MMSE estimator under the erroneous estimation of steering vectors in [18]. Furthermore, a direction-independent multimicrophone amplitude estimator based on MAP was proposed in [24].

The impact of phase estimation was ignored for several years based on the perceptual experiments reported in Wang and Lim [47]. However, the multimicrophone MMSE phase estimation [44] and more recent experiments [33] have shown that phase enhancement plays a substantial role in reducing the background noise. Therefore in the recent decade, many research works were devoted to the phase processing to improve speech quality and intelligibility (see, e.g., [14, 7, 15, 21, 22, 36, 38, 40, 46, 49, 50]). A comprehensive survey of the most recent results was carried out in [13].

In this paper, we derive two statistically optimal multimicrophone MAP algorithms for joint estimation of amplitude and phase of clean speech signal. The algorithms are developed utilizing the family of super-Gaussian distributions. The first part of this contribution is an extension of the work presented by Lotter in [25] to the multimicrophone case. The second part employs another super-Gaussian distribution proposed in Gerkmann [14], which models the amplitude of speech signal using a shape parameter.

We should emphasize that in the proposed joint multimicrophone MAP estimators, noise components are modeled by independent Gaussian distributions. This assumption in wireless acoustic sensor networks (WASNs) sounds even more reasonable compared to traditional microphone arrays. The microphones in WASNs, which are randomly distributed in the room, communicate with each other via wireless links. Hence, they cover a larger area and exploit more spatial information rather than the traditional microphone arrays. This advantage becomes more significant when some microphones are located closer to the desired speaker, providing signals with higher SNRs. By sharing information between various microphones, considerable improvement can be obtained [6, 9, 27, 28, 37, 39]. The possibility of using separately manufactured devices (e.g., cell phones, laptops, etc.) along with the random location of their microphones makes the independency assumption of noise signals more reasonable in these networks.

The remaining of this paper is organized as follows. The statistical properties of signals will be reviewed in Sect. 2. We will introduce the proposed joint multimicrophone MAP estimators in Sect. 3. Next, we will show the simulation results in Sect. 4, comparing the noise reduction performance of the proposed estimators with the four benchmarks. Finally, we will present some concluding remarks in Sect. 5.

2 Statistical Properties of Speech Signal

Consider a WASN including N microphones. In the STFT domain, the noisy signal of the n-th microphone at the m-th frame and the k-th discrete frequency indices, \( Y_{n}(m,k) \), is modeled as

$$\begin{aligned} Y_{n}(m,k) = X_{n}(m,k) + V_{n}(m,k), \quad n=1, \ldots , N, \end{aligned}$$
(1)

where \( X_{n}(m,k)\) and \( V_{n}(m,k) \) denote the desired speech signal and the additive noise signal, respectively. For convenience, in the following, the frame and frequency indices are only mentioned when referring to a specific time-frequency unit.

In vector representation, the vector consisting of noisy signal at different microphones, namely as the noisy vector, is expressed as \( \textbf{y} = \textbf{x} + \textbf{v} \) with \( \textbf{y} = \left[ Y_{1},\;..., \; Y_{N} \right] ^T \), where T denotes the transpose operation. \( \textbf{x}\) and \(\textbf{v}\) are defined similarly.

Since the speech and noise signals are usually generated from different sources, it is common assumption to consider them as uncorrelated signals. Therefore, the correlation matrix of noisy vector will be expressed as \( \mathbf {\Phi }_{\textbf{y}} = \mathbb {E} \left\{ \textbf{y} \textbf{y} ^H \right\} = \mathbf {\Phi }_{\textbf{x}} + \mathbf {\Phi }_{\textbf{v}} \), where \(\mathbf {\Phi }_{\textbf{x}} \) and \( \mathbf {\Phi }_{\textbf{v}} \) denote correlation matrices of speech and noise, respectively. In the simulation section, we will explain in detail how to calculate these correlation matrices.

In polar representation, \( Y_{n} = R_{n} e^{j\vartheta _{n}}, \ n=1, \ldots , N \), where \( R_{n} \) and \( \vartheta _{n} \) denote the spectral amplitude and phase of noisy signal, respectively, at the n-th microphone. In a similar fashion, \( X_{n} = A_{n} e ^{j\alpha _{n}}, \ n=1, \ldots , N \), where \( A_{n} \) and \( \alpha _{n} \) denote the spectral amplitude and phase of clean signal, respectively, at the n-th microphone. The main goal of this contribution is to reduce the noise signal and preserve the clean speech signal, i.e., estimate \( A_{1} \) and \( \alpha _{1} \) from the noisy vector \( \textbf{y} \). It should be noted that, without loss of generality, we consider the first microphone as the reference and correspondingly its clean speech signal as the desired signal.

3 Proposed Joint Multimicrophone MAP Estimators

The first proposed estimator utilizes the super-Gaussian distribution proposed in [25], which models the amplitude of speech signal using two parameters.

3.1 Two-Parametric Joint Multimicrophone MAP (TPJMAP) Estimator Using Super-Gaussian Statistics

In Lotter and Vary [25], based on some measured histograms, it has been shown that the PDF of speech amplitude at the n-th microphone, can be modeled using a high-accuracy two-parametric function, i.e., [25],

$$\begin{aligned} p(A_n) = {\left\{ \begin{array}{ll} \dfrac{\mu ^{\nu +1} A_n^\nu }{\Gamma (\nu +1) \sigma ^{\nu +1}_x(n)} \exp \left( \dfrac{-\mu A_n}{\sigma _x(n)} \right) , \qquad &{} A_n > 0, \\ 0, \qquad &{} \text {else,} \end{array}\right. } \end{aligned}$$
(2)

where \(\sigma ^2_x(n)\) denotes the variance of clean speech signal at the n-th microphone, and \( \Gamma (\cdot ) \) is the Gamma function. The parameters \( \mu \) and \( \nu \) are used to shape the PDF. It has been seen that in special cases, e.g., \( \mu = 2.5 \), \( \nu = 1 \) and \( \mu = 1.5 \), \( \nu = 0.01 \) the PDF function simplifies to the case where the real and imaginary parts of the clean speech signal are modeled by Laplacian or Gamma distribution, respectively [25]. Also, considering uniform distribution for speech phase, the PDF of phase at the n-th microphone is given by

$$\begin{aligned} p(\alpha _n) = \dfrac{1}{2\pi }, \qquad -\pi< \alpha _n < \pi . \end{aligned}$$
(3)

It is a common assumption to consider the amplitude and the phase of speech signal as independent variables; hence, the joint PDF is written as follows [25]

$$\begin{aligned} p(A_n, \alpha _n) = \dfrac{\mu ^{\nu +1} A_n^\nu }{2\pi \Gamma (\nu +1) \sigma ^{\nu +1}_x(n)} \exp \left( \dfrac{-\mu A_n}{\sigma _x(n)} \right) . \end{aligned}$$
(4)

Although the non-Gaussian property of speech signal has attracted lots of attention, it is still a common assumption to model the noise signal properties by Gaussian distribution. Therefore, the conditional PDF of noisy signal, \( Y_n \), given the amplitude and phase of the clean speech signal at the n-th microphone, (\( A_n, \alpha _n \)), is given by Lotter and Vary [25]

$$\begin{aligned} p(Y_n \mid A_n,\alpha _n) = \dfrac{1}{\pi \sigma ^2_v(n)} \exp \left( -\dfrac{|Y_n-A_n e^{j\alpha _n} |^2}{\sigma ^2_v(n)} \right) , \end{aligned}$$
(5)

and \(\sigma ^2_v(n)\) denotes the variance of the noise signal at the n-th microphone.

Similar to [26] and also considering far-field propagation model, the speech amplitudes at different microphones are modeled by a linear relation; also considering the phase delay between different microphones, we have

$$\begin{aligned} {\left\{ \begin{array}{ll} Y_1 = A_{1} e^{j\alpha _{1}} + V_{1}, \\ Y_{2} = C_{2} A_{1} e^{j(\alpha _{1}-\beta _{2})} + V_{2}, \\ \vdots \\ Y_{N} = C_{N} A_{1} e^{j(\alpha _{1}-\beta _{N})} + V_{N}, \end{array}\right. } \end{aligned}$$
(6)

where \( \beta _{n} \) denotes the phase delay between the first and the n-th microphone, and \( C_{n} \) is modeled as a real deterministic value, and given by (see proof in Appendix)

$$\begin{aligned} C_{n}^2 = \dfrac{\mathbb {E} \left\{ X_{n}X_{n}^* \right\} }{\mathbb {E} \left\{ X_{1}X_{1}^* \right\} } = \dfrac{\sigma ^2_x(n)}{\sigma ^2_x(1)}, \qquad n=1,\cdots ,N. \end{aligned}$$
(7)

As mentioned before, since in WASNs, microphones are distributed randomly and may also belong to different devices, it is a reasonable assumption to consider the noise signals at different microphones independent. Therefore, the conditional PDF of noisy vector, \( \textbf{y} \), given the amplitude and phase of the clean speech signal at first microphone, (\( A_1,\alpha _1 \)), is obtained by multiplying the conditional PDFs of noisy signals at different microphones, \( Y_{n},\ n=1,\cdots ,N \), given the amplitude and phase of the desired clean speech signal, i.e.,

$$\begin{aligned} \begin{aligned} p(\textbf{y} \mid A_1, \alpha _1)&= \prod _{n=1}^{N}p(Y_{n} \mid A_1,\alpha _1) \\&= \prod _{n=1}^{N}\dfrac{1}{\pi \sigma ^2_v(n)} \exp \left( -\sum _{n=1}^{N} \dfrac{|Y_{n} - C_{n}A_1 e^{j(\alpha _1-\beta _{n})} |^2}{\sigma ^2_v(n)} \right) , \end{aligned} \end{aligned}$$
(8)

obviously \( C_{1} = 1 \) and \( \beta _{1} = 0 \). It should be noted that the variances of noise signals \(\sigma ^2_v(n)\), and clean signals \(\sigma ^2_x(n)\) are easily computed as the diagonal elements of the correlation matrices of speech and noise, respectively, i.e.,

$$\begin{aligned} {\left\{ \begin{array}{ll} \sigma ^2_v(n) = \mathbf {\Phi }_{\textbf{v}}(n,n), \qquad n=1, \cdots , N, \\ \sigma ^2_x(n) = \mathbf {\Phi }_{\textbf{x}}(n,n), \qquad n=1, \cdots , N. \end{array}\right. } \end{aligned}$$
(9)

The main goal of the proposed two-parametric joint MAP (TPJMAP) estimator is to compute both spectral amplitude and phase of the desired signal, considering the maximum a posteriori criterion. In other words, by maximizing the posterior distribution of \(A_1\) and \(\alpha _1\) given the noisy vector \(\textbf{y}\), as

$$\begin{aligned} \begin{aligned} \hat{A}_1,\hat{\alpha }_1&= \arg \max _{A_1,\alpha _1} p(A_1,\alpha _1 \mid \textbf{y}) \\ {}&= \arg \max _{A_1,\alpha _1} \dfrac{p(\textbf{y} \mid A_1,\alpha _1) \, p(A_1,\alpha _1)}{p(\textbf{y})}. \end{aligned} \end{aligned}$$
(10)

Considering that the denominator is independent of amplitude and phase, and plays no role in maximization, we only need to maximize the numerator, so

$$\begin{aligned} \begin{aligned} \hat{A}_1,\hat{\alpha }_1=\arg \max _{A_1,\alpha _1} p(\textbf{y} \mid A_1,\alpha _1) \, p(A_1,\alpha _1). \end{aligned} \end{aligned}$$
(11)

Combining (8) and (4) yields

$$\begin{aligned} \begin{aligned}&p(\textbf{y} \mid A_1,\alpha _1) \, p(A_1,\alpha _1) \\&\quad =\prod _{n=1}^{N}\dfrac{1}{\pi \sigma ^2_v(n)} \exp \left( -\sum _{n=1}^{N} \dfrac{|Y_{n}- C_{n}A_1 e^{j(\alpha _1 -\beta _{n})}|^2}{\sigma ^2_v(n)} \right) \\&\qquad \dfrac{\mu ^{\nu +1}A_1^\nu }{2\pi \Gamma (\nu +1)\sigma ^{\nu +1}_x(1)} \exp \left( \dfrac{-\mu A_1}{\sigma _x(1)} \right) , \end{aligned} \end{aligned}$$
(12)

since \( \ln (\cdot ) \) is a monotonically increasing function, the maximization of (11) can be replaced by maximizing its natural logarithm. Omitting the terms that have no effect on optimization procedure, we continue with

$$\begin{aligned} \begin{aligned}&\arg \max _{A_1,\alpha _1} \ln p(\textbf{y} \mid A_1,\alpha _1) \, p(A_1,\alpha _1) \\&\quad = -\sum _{n=1}^{N} \dfrac{|Y_{n}- C_{n}A_1 e^{j(\alpha _1-\beta _{n})}|^2}{\sigma ^2_v(n)} + \nu \ln A_1 - \mu \dfrac{A_1}{\sigma _x(1)}. \end{aligned} \end{aligned}$$
(13)

3.1.1 Two-Parametric Joint Multimicrophone MAP Estimator to Extract the Phase of Clean Signal

As mentioned before, phase enhancement plays a substantial role in reducing the background noise. One of the main keys of this contribution is to compute the speech signal phase and show its considerable importance in improving speech quality and intelligibility. Hence, the first part of this work will be dedicated to find the statistically optimal phase solution regarding the MAP criterion. This is actually done by maximizing (13) with respect to \(\alpha _1\). After differentiating (13) with regard to \( \alpha _1 \) and setting it equal to zero, we have

$$\begin{aligned} \begin{aligned}&\dfrac{\partial }{\partial \alpha _1} \ln p(\textbf{y} \mid A_1, \alpha _1) \, p(A_1, \alpha _1) \\&\quad =\sum _{n=1}^{N}\dfrac{\left[ Y_{n}-C_{n}A_1 e^{j(\alpha _1-\beta _{n})} \right] \left[ jC_{n}A_1 e^{-j(\alpha _1-\beta _{n})} \right] }{\sigma ^2_v(n)} \\&\qquad + \sum _{n=1}^{N}\dfrac{\left[ Y_{n}^*-C_{n}A_1 e^{-j(\alpha _1-\beta _{n})} \right] \left[ -jC_{n}A_1 e^{j(\alpha _1-\beta _{n})} \right] }{\sigma ^2_v(n)} = 0. \end{aligned} \end{aligned}$$
(14)

By substituting \( Y_{n} = R_{n} e^{\vartheta _{n}} \), we finally obtain the following expression

$$\begin{aligned} \sum _{n=1}^{N}\dfrac{C_{n}R_{n} \sin (\vartheta _{n}+\beta _{n}-\alpha _{1})}{\sigma ^2_v(n)} = \sum _{n=1}^{N}m_{n}\sin (\theta _{n}-\alpha _{1}) = 0, \end{aligned}$$
(15)

where \( \theta _{n} = \vartheta _{n} + \beta _{n} \) and

$$\begin{aligned} \begin{aligned} m_{n} = \frac{C_{n}R_{n}}{\sigma ^2_v(n)}=\dfrac{\sigma _x(n)}{\sigma _{x}(1)}\dfrac{R_{n}}{\sigma ^2_v(n)}=\dfrac{1}{\sigma _{x}(1)}\sqrt{\zeta _{n}\gamma _{n}}. \end{aligned} \end{aligned}$$
(16)

In (16), \( \zeta _{n} \) and \( \gamma _{n} \) denote a priori and a posteriori SNRs, respectively, at the n-th microphone. These values are easily computed as

$$\begin{aligned} \zeta _{n} = \dfrac{\sigma ^2_x(n)}{\sigma ^2_v(n)}, \qquad \gamma _{n} = \dfrac{R_{n}^2}{\sigma ^2_v(n)}. \end{aligned}$$
(17)

Using trigonometric identities [32], the sum of N sine terms can be written as one term, i.e.,

$$\begin{aligned} \sum _{i=1}^{N}d_i\sin (\theta _i-\omega ) = d \sin (\theta -\omega ), \end{aligned}$$
(18)

where d and \( \theta \) are structured as [32]

$$\begin{aligned} {\left\{ \begin{array}{ll} d^2 = \sum _{i,j} d_i d_j \cos (\theta _i -\theta _j),\\ \theta = \textrm{atan2} \left( \dfrac{\sum _{i} d_i \sin (\theta _i)}{\sum _{i} d_i \cos (\theta _i)} \right) , \end{array}\right. } \end{aligned}$$
(19)

the optimal TPJMAP estimator of the phase is obtained from (15) as

$$\begin{aligned} \alpha _1 = \textrm{atan2} \left( \dfrac{\sum _{n=1}^{N}m_{n}\sin (\vartheta _{n}+\beta _{n})}{\sum _{n=1}^{N}m_{n}\cos (\vartheta _{n}+\beta _{n})} \right) . \end{aligned}$$
(20)

We observe that in the case of single-microphone, corresponding to the \( N = 1 \), (20) converges to phase of noise signal, which is in accordance with the results presented in Lotter and Vary [25].

3.1.2 Two-Parametric Joint Multimicrophone MAP Estimator to Extract the Amplitude of Clean Signal

In this subsection, the optimal estimation of the amplitude of the clean speech signal is presented. For this purpose, first, we substitute (20) in (13), and subsequently, in a similar fashion, the posterior distribution (13) is maximized with respect to \( A_1 \). By differentiating and setting the result equal to zero, we get

$$\begin{aligned} \begin{aligned}&\dfrac{\partial }{\partial A_1}\ln p(\textbf{y} \mid A_1,\alpha _1) \, p(A_1,\alpha _1) \\&\quad =- \sum _{n=1}^{N}\dfrac{[Y_{n}-C_{n}A_1 e^{j(\alpha _1-\beta _{n})}][-C_{n} e^{-j(\alpha _1-\beta _{n})}]}{\sigma ^2_v(n)}\\&\qquad - \sum _{n=1}^{N}\dfrac{[Y_{n}^*-C_{n}A_1 e^{-j(\alpha _1-\beta _{n})}][-C_{n} e^{j(\alpha _1-\beta _{n})}]}{\sigma ^2_v(n)}\\&\qquad + \dfrac{\nu }{A_1}-\dfrac{\mu }{\sigma _x(1)}=0, \end{aligned} \end{aligned}$$
(21)

as a result (21) becomes

$$\begin{aligned} \sum _{n=1}^{N}\dfrac{C_{n} e^{-j(\alpha _1-\beta _{n})}Y_{n}}{\sigma ^2_v(n)} + \sum _{n=1}^{N}\dfrac{C_{n} e^{j(\alpha _1-\beta _{n})}Y_{n}^*}{\sigma ^2_v(n)} -2 \sum _{n=1}^{N}\dfrac{A_1C_n^2}{\sigma ^2_v(n)} + \dfrac{\nu }{A_1}-\dfrac{\mu }{\sigma _x(1)} = 0. \end{aligned}$$
(22)

Substituting the amplitude and phase of the noisy signal at the n-th microphone, i.e., \( Y_{n} = R_{n} e^{\vartheta _{n}} \), the following expression is obtained:

$$\begin{aligned} 2\sum _{n=1}^{N}\dfrac{C_{n}R_{n} \cos (\vartheta _{n} + \beta _{n} - \alpha _1)}{\sigma ^2_v(n)} - 2 A_1 \sum _{n=1}^{N}\dfrac{C_{n}^2}{\sigma ^2_v(n)} + \dfrac{\nu }{A_1} - \dfrac{\mu }{\sigma _x(n)} = 0. \end{aligned}$$
(23)

Using (16), which represents that \(\frac{C_{n}R_{n}}{\sigma ^2_v(n)}=\dfrac{1}{\sigma _{x}(1)}\sqrt{\zeta _{n}\gamma _{n}}\), and

$$\begin{aligned} \begin{aligned}&\sum _{n=1}^{N}\dfrac{C_{n}^2}{\sigma ^2_v(n)}=\dfrac{1}{\sigma ^2_{x}(1)}\sum _{n=1}^{N}\dfrac{\sigma ^2_x(n)}{\sigma ^2_v(n)} =\dfrac{1}{\sigma ^2_{x}(1)}\sum _{n=1}^{N}\zeta _{n}, \end{aligned} \end{aligned}$$
(24)

(23) is structured based on a priori and a posteriori SNRs as follows:

$$\begin{aligned} \begin{aligned}&A_1^2 \left( 2 \sum _{n=1}^{N}\zeta _{n} \right) + A_1 \left( \mu \sigma _{x}(1) - 2 \sigma _{x}(1) \sum _{n=1}^{N} \sqrt{\zeta _{n} \gamma _{n}} \cos (\vartheta _{n} + \beta _{n} - \alpha _1) \right) \\&-\nu \sigma ^2_{x}(1) = 0, \end{aligned} \end{aligned}$$
(25)

which represents a quadratic function in terms of \( A_1 \).

The TPJMAP estimator of amplitude (i.e., the solution of quadratic equation of (25)), is obtained by multiplying a gain factor to the amplitude of noisy signal at the first microphone, i.e., \( {A}_{1} = G_T R_1 \), where the gain \( G_T \) is achieved as

$$\begin{aligned} G_T = \dfrac{A_1}{R_1} = \dfrac{\sqrt{\dfrac{\zeta _1}{\gamma _1}}}{4 \sum _{n=1}^{N} \zeta _{n}} \textrm{Re} \Bigg \{ 2 \Delta -\mu + \sqrt{\mu ^2 + 4 \Delta ^2 - 4 \mu \Delta + 8 \nu \left( \sum _{n=1}^{N} \zeta _{n} \right) } \Bigg \}, \end{aligned}$$
(26)

where \( \Delta = \sum _{n=1}^{N} \sqrt{\zeta _{n}\gamma _{n}} \cos (\varphi _n) \) and \( \varphi _n = \vartheta _{n} + \beta _{n} - \alpha _1 \).

Again, we observe that in the case of single-microphone, corresponding to the \( N = 1 \), (26) is equal to joint MAP estimation of amplitude as presented in Lotter and Vary [25]. Indeed, based on what was stated earlier, the first part of this contribution is an extension of the work presented in Lotter and Vary [25] to the multimicrophone case.

3.2 One-Parametric Joint Multimicrophone MAP (OPJMAP) Estimator Using Super-Gaussian Statistic

As Gerkmann [14], in this part, we consider another super-Gaussian distribution which models the amplitude of speech signal using the shape parameter \(\eta \) as follows Gerkmann [14]

$$\begin{aligned} p(A_n) = {\left\{ \begin{array}{ll} \dfrac{2}{\Gamma (\eta )} \left( \dfrac{\eta }{\sigma ^{2}_x(n)}\right) ^\eta A_n^{2\eta -1} \exp \left( \dfrac{-\eta A_n^2}{\sigma _x^2(n)} \right) , \qquad &{} A_n >0 \\ 0, \qquad &{} \text {else.} \end{array}\right. } \end{aligned}$$
(27)

Assuming a uniform distribution for phase, the joint PDF is expressed as follows [14]

$$\begin{aligned} p(A_n, \alpha _n) = \dfrac{1}{\pi \Gamma (\eta )} \left( \dfrac{\eta }{\sigma ^{2}_x(n)}\right) ^\eta A_n^{2\eta -1} \exp \left( \dfrac{-\eta A_n^2}{\sigma _x^2(n)} \right) , \end{aligned}$$
(28)

considering a Gaussian distribution for noise signal, it is easily seen that the steps to obtain the optimal phase gain is similar to (20); so, here we only derive the optimal amplitude.

3.2.1 One-Parametric Joint Multimicrophone MAP Estimator to Extract the Amplitude of Clean Signal

Similar to what we did in the case of the TPJMAP estimator, it is sufficient to maximize

$$\begin{aligned} \begin{aligned} \hat{A}_1=\arg \max _{A_1} p(\textbf{y} \mid A_1, \alpha _1) \, p(A_1, \alpha _1). \end{aligned} \end{aligned}$$
(29)

Substituting (28) into (29), we get

$$\begin{aligned} \begin{aligned}&p(\textbf{y} \mid A_1,\alpha _1) \, p(A_1, \alpha _1) \\&\quad = \prod _{n=1}^{N} \dfrac{1}{\pi \sigma ^2_v(n)} \exp \left( -\sum _{n=1}^{N} \dfrac{|Y_{n}- C_{n}A_1 e^{j(\alpha _1-\beta _{n})} |^2}{\sigma ^2_v(n)} \right) \\&\qquad \dfrac{1}{\pi \Gamma (\eta )} \left( \dfrac{\eta }{\sigma ^{2}_x(n)} \right) ^\eta A_n^{2\eta -1} \exp \left( \dfrac{-\eta A_n^2}{\sigma _x^2(n)} \right) , \end{aligned} \end{aligned}$$
(30)

based on what was explained before, after applying \( \ln (\cdot ) \) and neglecting the terms that play no role in optimization, we proceed with

$$\begin{aligned} \begin{aligned}&\arg \max _{A_1}\ln p(\textbf{y} \mid A_1,\alpha _1) \, p(A_1,\alpha _1) \\&\quad = -\sum _{n=1}^{N} \dfrac{|Y_{n}- C_{n}A_1 e^{j(\alpha _1-\beta _{n})} |^2}{\sigma ^2_v(n)}+(2\eta -1) \ln A_1 \\&\qquad - \eta \dfrac{A_1^2}{\sigma _x^2(1)}. \end{aligned} \end{aligned}$$
(31)

Following differentiating and setting the result equal to zero, we get

$$\begin{aligned} \begin{aligned}&\dfrac{\partial }{\partial A_1}\ln p(\textbf{y} \mid A_1,\alpha _1) \, p(A_1,\alpha _1) \\&\quad =-\sum _{n=1}^{N}\dfrac{[Y_{n}-C_{n}A_1 e^{j(\alpha _1-\beta _{n})}][-C_{n} e^{-j(\alpha _1-\beta _{n})}]}{\sigma ^2_v(n)}\\&\qquad -\sum _{n=1}^{N}\dfrac{[Y_{n}^*-C_{n}A_1 e^{-j(\alpha _1-\beta _{n})}][-C_{n} e^{j(\alpha _1-\beta _{n})}]}{\sigma ^2_v(n)}\\&\qquad +\dfrac{2\mu -1}{A_1}-\dfrac{2\mu A_1}{\sigma _x^2(1)}=0, \end{aligned} \end{aligned}$$
(32)

which leads to

$$\begin{aligned} \begin{aligned}&\sum _{n=1}^{N}\dfrac{C_{n} e^{-j(\alpha _1-\beta _{n})}Y_{n}}{\sigma ^2_v(n)} +\sum _{n=1}^{N}\dfrac{C_{n} e^{j(\alpha _1-\beta _{n})}Y_{n}^*}{\sigma ^2_v(n)}\\&-2 \sum _{n=1}^{N}\dfrac{A_1C_n^2}{\sigma ^2_v(n)} + \dfrac{2\mu -1}{A_1}-\dfrac{2\mu A_1}{\sigma _x^2(1)} = 0. \end{aligned} \end{aligned}$$
(33)

Substituting the amplitude and phase of the noisy signal at the n-th microphone, i.e., \( Y_{n} = R_{n} e^{\vartheta _{n}} \), the following equation can be found

$$\begin{aligned} \begin{aligned}&2\sum _{n=1}^{N}\dfrac{C_{n}R_{n}\cos (\vartheta _{n}+\beta _{n}-\alpha _1)}{\sigma ^2_v(n)} -2A_1 \sum _{n=1}^{N}\dfrac{C_{n}^2}{\sigma ^2_v(n)}\\&\quad +\dfrac{2\mu -1}{A_1}-\dfrac{2\mu A_1}{\sigma _x^2(1)} = 0. \end{aligned} \end{aligned}$$
(34)

It readily follows that (34) represents a quadratic function in terms of \(A_1\) too, i.e.,

$$\begin{aligned} \begin{aligned}&A_1^2\left( 2\sum _{n=1}^{N}\zeta _{n}+2\mu \right) + A_1\left( -2\sigma _{x}(1)\sum _{n=1}^{N}\sqrt{\zeta _{n}\gamma _{n}}\cos (\vartheta _{n}+\beta _{n}-\alpha _1)\right) \\&-(2\mu -1)\sigma ^2_{x}(1)=0. \end{aligned} \end{aligned}$$
(35)

The solution of OPJMAP estimator of amplitude is computed by multiplying a gain factor to the amplitude of noisy signal at the first microphone, i.e., \({A}_{1} = G_O R_1\), where the gain \( G_O \) is obtained as

$$\begin{aligned} \begin{aligned} G_O =\dfrac{A_1}{R_1} = \dfrac{\sqrt{\dfrac{\zeta _1}{\gamma _1}}}{2\sum _{n=1}^{N}\zeta _{n}+2\mu } \textrm{Re} \Bigg \{\Delta + \sqrt{\Delta ^2 + 2(2\mu -1)\left( \sum _{n=1}^{N}\zeta _{n}+\mu \right) } \Bigg \}, \end{aligned} \end{aligned}$$
(36)

where \( \Delta = \sum _{n=1}^{N} \sqrt{\zeta _{n}\gamma _{n}} \cos (\varphi _n) \) and \( \varphi _n = \vartheta _{n}+\beta _{n}-\alpha _1 \).

4 Simulation Results

During these experiments, we implemented the STFT with \(75\%\)-overlapping frames and Hamming analysis window. The sampling frequency is \( f_s = 16 \) kHz. Also we set \( \mu = 1.74 \), \( \nu = 0.126 \), and \( \eta = 0.5 \) as presented in Lotter and Vary [24] and Gerkmann [14], respectively. To avoid the effect of the estimation error of phase delay, we consider a perfect synchronization situation, when the accurate estimations of \( \beta _i \) are provided.

4.1 Computation of Correlation Matrices

The correlation matrix of the noisy vector is commonly computed using a forgetting factor, \( e.g., \lambda _y \), and recursively estimating the matrix as a linear combination of the correlation matrix at previous frames and the noisy vector at the current frame [41], i.e.,

$$\begin{aligned} \hat{\mathbf {\Phi }}_{\textbf{y}}(m,k) = \lambda _y \hat{\mathbf {\Phi }}_{\textbf{y}}(m-1,k) + (1-\lambda _y) \textbf{y}(m,k) \textbf{y} ^H(m,k). \end{aligned}$$
(37)

To compute the correlation matrix of the noise signal, we use the speech presence probability (SPP) as presented in Souden et al.  [41]. In this case, using an SPP-based forgetting factor, the correlation matrix of noise signal is recursively updated as follows:

$$\begin{aligned} \hat{\mathbf {\Phi }}_{\textbf{v}}(m,k) = \lambda _{v,spp} \hat{\mathbf {\Phi }}_{\textbf{v}}(m-1,k) + (1-\lambda _{v,spp}) \textbf{y}(m,k) \textbf{y}^H(m,k), \end{aligned}$$
(38)

with \( \lambda _{v,spp} = \lambda _v +(1-\lambda _v)\textrm{SPP}(m,k) \), where \( \textrm{SPP}(m,k) \) represents the SPP at the current frame and \( \lambda _v \) denotes the forgetting factor. As observed, we need the value of \( \textrm{SPP}(m,k) \) to compute \(\hat{\mathbf {\Phi }}_{\textbf{v}}(m,k) \); on the other hand, \( \hat{\mathbf {\Phi }}_{\textbf{v}}(m,k)\) is required to compute the \( \textrm{SPP}(m,k) \) [41]. Indeed, these values are dependent together. Thus, an iterative algorithm was proposed in Souden et al. [41]. First, using the correlation matrix of noise signal at the previous frame, an initial estimation of \( \textrm{SPP}(m,k) \) and correspondingly an initial \( \lambda _{v,spp} \) will be obtained. In the second step, these initial values (\( \textrm{SPP}(m,k) \) and \( \lambda _{v,spp} \)) are utilized to obtain an update of the correlation matrix of noise signal. It has been shown in Souden et al. [41] that two repetitions are quite enough to obtain a good estimation of both SPP and the correlation matrix of noise signal. In our simulations, we used the first 10 silent-frames to obtain an initial estimation of the correlation matrix of the noise signal. During the experiments, to compute the SPP, we set \( q = 0.5 \) as presented in Gerkmann et al. [17].

Since we assumed that the speech and noise signals are uncorrelated, the correlation matrix of speech signal is computed as

$$\begin{aligned} \hat{\mathbf {\Phi }}_{\textbf{x}}(m,k) = \hat{\mathbf {\Phi }}_{\textbf{y}}(m,k)- \hat{\mathbf {\Phi }}_{\textbf{v}}(m,k). \end{aligned}$$
(39)

It should be noted that we set negative eigenvalues of \( \hat{\mathbf {\Phi }}_{\textbf{x}}(m,k) \) equal to zero to ensure that the resultant matrix is positive semi-definite.

4.2 Performance in Simulated Scenarios

To evaluate the noise reduction performance of the considered estimators, we simulated a rectangular room with dimensions \( 4.5 \times 4.5 \times 3 \) m\(^3\) (width\( \times \)length\( \times \)height). Also, we set the reverberation time as \( RT_{60} = 200 \) ms.

Emphasizing that the proposed algorithms are not dependent on the geometry and arrangement of the microphones, we performed 30 randomized trials. In each trial, we randomly chose the position of microphones, clean source signal, and interference signal. Also, the number of microphones is varied between \( M = {3, \ldots , 8} \). Figure 1 depicts an example of these configurations.

We have assessed the noise reduction performance of the considered estimators for a coverage of speakers including male, female, young and old speakers. The utterances of eight male and eight female speakers from the TIMIT database [12] were used as the clean source signals. We have presented the simulation results as the average on four randomized samples and 30 trials.

Fig. 1
figure 1

Description of the simulated acoustic scenario

To generate the noisy signals, we assume that the received clean speech signals at different microphones are degraded by additive white Gaussian noise at full-band input SNRs ranging from \( -10 \) dB to 15 dB. The full-band input SNRs is defined as \( = 10 \log \dfrac{\sum |x_{1}(t) |^2}{\sum |v_{1}(t) |^2} \).Footnote 1 Also, the microphone signals are corrupted by interfering noises, including stationary pink, and non-stationary babble noises (see Fig. 1) at full-band input signal to interference ratio (SIR) \( = 5 \) dB. We have utilized the well-known image method [3] to generate a good approximation of room impulse responses between the sources of the signals (speech and interference) and the microphones.

Fig. 2
figure 2

Performance of the proposed estimators in terms of a PESQ and b SegSNR improvement for different forgetting factors, in the case of stationary pink interference signal when SIR \( =5 \) dB and the SNR for additive white Gaussian noise is 5 dB

We first show how the performance of the proposed estimators (named TPJMAP-Super and OPJMAP-Super) varies with forgetting factors, \(\lambda _y\) and \(\lambda _v\). These parameters are required to compute the correlation matrices of noisy (37) and noise signal (38), respectively. Figure 2 illustrates the performance of the proposed estimators in terms of perceptual evaluation of speech quality (PESQ) [23] and segmental SNR (SegSNR) [16] improvement between the enhanced signal and the noisy reference signal. As mentioned earlier, we consider the clean speech signal at the first microphone as the reference signal. The PESQ value is one of the most well-known measures to assess the quality of speech. To consider both noise reduction as well as speech distortion, we also show the SegSNR improvement, which represents the average of SNR over speech-activated segments. During this experiment, we implemented the STFT whit the number of fast Fourier transform (NFFT) points \( = \) 512, fixed the SNR for additive white Gaussian noise \( = 5 \) dB, and the SIR \( = 5 \) dB for stationary pink interference signal. It should also be noted that, as mentioned in Huang and Benesty [20], and to capture the same tracking characteristics for correlation matrices, we made \(\lambda _y = \lambda _v = \lambda \).

Fig. 3
figure 3

Performance of the proposed estimators in terms of a PESQ and b SegSNR improvement for different NFFTs, in the case of stationary pink interference signal when SIR \( =5 \) dB and the SNR for additive white Gaussian noise is 5 dB

We observe that although the TPJMAP-Super performs slightly better, both algorithms follow the same trend and obtain the best performance with \( \lambda = 0.92\). Indeed, due to the averaging process, we usually achieve a reliable estimation of correlation matrices by selecting large \(\lambda \), emphasizing on the previous samples. However, on the other hand, with a large \(\lambda \), we are not able to trace the short-term variation of the speech signal. It is well known that speech signals are inherently non-stationary signals. So, as expected, a moderate \(\lambda \) results in the best performance in terms of PESQ and SegSNR.

In Fig. 3, we examine how the NFFT affects the performance of the proposed algorithms. NFFT represents the sampling resolution in frequency domain. As mentioned earlier, the proposed algorithms have been derived and developed in frequency domain. In this experiment, we fixed \(\lambda _y = \lambda _v = 0.92\), the SNR for additive white Gaussian noise = 5 dB, and the SIR \( = 5 \) dB for stationary pink interference signal. It is well known that the higher FFT resolution, the larger gap between two consecutive frames in time. We observe that the best performance is achieved by selecting a moderate NFFT \(=512\).

Therefore, in the next experiments, we fix \( \lambda _y = \lambda _v = 0.92\) and implement the STFT using NFFT\( = 512\).

Fig. 4
figure 4

a Spectrogram of the reference clean speech signal, b the noisy signal, c enhanced signal using the proposed TPJMAP-Super estimator, and d enhanced signal using the proposed OPJMAP-Super, in the case of stationary pink interference signal when SIR \( =5 \) dB and the SNR for additive white Gaussian noise is 5 dB

Figure 4 illustrates the spectrograms of the reference clean speech signal, the noisy signal along with the enhanced signals using the proposed estimators in the presence of stationary pink interference signal when SIR \( =5 \) dB and the SNR for additive white Gaussian noise is 5 dB. The spectrograms validate the merits of both proposed estimators in reducing noise signal while the clean speech signal has considerably been preserved. It seems that the TPJMAP-Super provides better estimation of the amplitude of the clean speech signal than the OPJMAP-Super, leading to more noise reduction. It is consistent with the previous experiments.Footnote 2

In the following, we also report the experiments in which we compared the noise reduction performance of the proposed joint multimicrophone MAP estimators with four baseline methods: (1) the super-Gaussian MAP-based amplitude estimation [24], where the super-Gaussian statistics are only utilized to develop multimicrophone MAP estimation of amplitude, and keep the phase unchanged (named AMAP-Super), (2) the MMSE estimator presented by Trawicki and Johnson, in [44], where both amplitude and phase of speech signal were derived assuming Gaussian properties for speech signal (named MMSE-Gaussian), (3) when the enhanced signal is obtained using a MVDR filter followed by a super-Gaussian MMSE estimator (named MVDR-MMSE-Super) [11, 19], and (4) the centralized multichannel Wiener filter (named CMWF) as presented in Bertrand and Moonen [6].

Fig. 5
figure 5

Performance of the considered estimators in terms of a PESQ, b STOI, c SegSNR improvement, and d LSD, in the case of stationary pink interference signal when SIR \( =5 \) dB and the SNR for additive white Gaussian noise ranges from \( -10 \) dB to 15 dB

We also compare the noise reduction performance of the considered estimators in terms of two more measures, short-time objective intelligibility (STOI) [42], and log-spectral distortion (LSD) [1]. To evaluate the intelligibility of speech [35], we have utilized the STOI measure. The LSD value is a measure to define speech distortion; the lower LSD, the lower speech distortion.

Figures 5 and 6 illustrate the performance of the considered estimators in terms of PESQ, STOI, SegSNR improvement, and LSD when the SNR for additive white Gaussian noise ranges from \( -10 \) dB to 15 dB, while maintaining SIR \( = 5 \) dB in the presence of stationary pink and non-stationary babble interference signals, respectively.

Fig. 6
figure 6

Performance of the considered estimators in terms of a PESQ, b STOI, c SegSNR improvement, and d LSD, in the case of non-stationary babble interference signal when SIR \( =5 \) dB and the SNR for additive white Gaussian noise ranges from \( -10 \) dB to 15 dB

It is seen that the PESQ improvements obtained by the proposed estimators are considerably larger than the others for all input SNRs. Indeed, proposed estimators present superior performance unifying the advantages of phase estimation and more accurate super-Gaussian speech model. It is also observed that the TPJMAP-Super estimator, which uses the PDF function in [24], provides a higher approximation accuracy compared to the OPJMAP-Super, which uses the PDF function in [14], and consequently yields more improvement.

Compared to the AMAP-Super, which also considers super-Gaussian statistics, the proposed estimators benefit from the important effect of phase estimation on speech quality improvement. On the other hand, and compared to the MMSE-Gaussian, which estimates both amplitude and phase using MMSE criterion, the proposed estimators take advantage of more accurate super-Gaussian modeling of the speech signal. It should also be noted that, compared to the MMSE criterion, MAP-based estimators can be implemented more efficiently, since they do not require the computation of expensive Bessel or confluent hypergeometric functions. Indeed, based on (25) and (35), we observe that the proposed JMAP estimators are easily obtained by solving quadratic functions. Besides, Wolfe and Godsill [48] have shown that MAP-based estimators could be considered an excellent alternative to the MMSE-based estimations according to their comparative performance. The poor performance of the MVDR-MMSE-Super estimator can be justified from different view points: first, the MVDR filter is highly sensitive to the estimation errors of the correlation matrix of noise. Also, this estimator needs to compute the inverse of the \( N \times N \) dimensional correlation matrix of noise. This problem becomes more acute when the number of microphones in the network increases, allowing estimation errors. Besides, this estimator utilizes the one-parametric super-Gaussian function to model the statistical properties of the speech signal. As mentioned before, it seems that the two-parametric model provides a higher approximation accuracy in comparison with the one-parametric model. Compared to the CMWF, we also observed that the proposed estimators provide larger improvement. Indeed, like to the MVDR-MMSE-Super, CMWF needs to compute inverse of the correlation matrix of noise. As mentioned, this issue results in estimation errors in the case of large number of microphones. In addition, the inverse of a square \( N \times N \) dimensional matrix is of quadratic complexity (\(\mathcal {O}(N^2)\)), while the complexity of the proposed estimators grows linearly with the number of microphones (\(\mathcal {O}(N)\)).

Concerning STOI, in Figs. 5b and 6b, we observe that the proposed methods significantly increase the STOI and, consequently, speech intelligibility. This result can be supported by Kazama et al. [21], emphasizing the perceptual importance of phase. Phase estimation is one of the key factors in the context of speech enhancement. In Kazama et al. [21], it has been shown that the enhanced phase plays a substantial role in improving speech intelligibility. Indeed, our proposed estimators incorporate both super-Gaussian and phase estimation concepts together.

Regarding SegSNR, Figs. 5c and 6c indicate that TPJMAP-Super outperforms the rest for all input SNRs. Next is the OPJMAP-Super. These results demonstrate that the proposed estimators are able to make a good trade-off between two measures, noise reduction and speech distortion allowing higher SegSNR improvement.

In terms of LSD, we observe that following the TPJMAP-Super, the MVDR-MMSE-Super performs slightly better than the others in low SNRs. In mid and high SNRs, the compared estimators deliver approximately similar results.

Also, a brief comparison between considered algorithms is summarized in Table 1.

Table 1 Comparison of the considered algorithms for noise reduction in wireless acoustic sensor networks

4.3 Performance in Realistic Scenarios

To compare the performance of estimators in realistic scenario, we use the data recorded in a laboratory located at the University of Oldenburg [39]. The dimension of laboratory is (\( x = 7 \) m, \( y = 6 \) m, \( z= 2.7 \) m). The reverberation time of the room is \( RT_{60} \simeq 350 \) ms. The considered WASN contains 4 microphones; two microphones, corresponding to each side of a hearing aid, are located in the middle of the room. One microphone is placed at (\( x = 4.64 \) m, \( y = 2.63 \) m, \( z = 2 \) m), and the next one is at (\( x = 2.36 \) m, \( y = 2.63 \) m, \( z = 2 \) m). The source speech signal originating from a male speaker, located at (\( x = 4.64\) m, \( y = 4.63 \) m, \( z = 2 \) m), received by microphones. The length of this source speech signal is 24 seconds. The microphones receive the signals at sampling frequency \( f_s = 16 \) kHz. Besides, two different types of noise signals, either factory or babble, are considered to produce diffuse and additive noise signals. These signals are generated by four speakers located in the four corners of the room and are collected in different SNRs together with the clean signal.

Fig. 7
figure 7

Performance of the considered estimators in terms of a PESQ, b STOI, c SegSNR improvement, and d LSD for several full-band input SNRs, in the presence of factory noise, \(RT_{60} = 350\) ms

Fig. 8
figure 8

Performance of the considered estimators in terms of a PESQ, b STOI, c SegSNR improvement, and d LSD for several full-band input SNRs, in the presence of babble noise, \(RT_{60} = 350\) ms

Figures 7 and 8 illustrate the performance of the considered estimators in terms of PESQ, STOI, SegSNR improvement, and LSD when the SNR for additive noise ranges from \( -10 \) dB to 15 dB in the presence of factory and babble noises, respectively.

Concerning PESQ, Figs. 7a and 8a depict that although at SNR \( =-10 \) dB considered estimators fail to improve the PESQ value, the proposed joint multimicrophone MAP estimators deliver considerable improvement at other SNRs, emphasizing their ability to improve the quality of speech in realistic scenarios. It is also observed that, in low SNRs, the MMSE-Gaussian, which estimates both amplitude and phase under the Gaussian distribution assumption, provides more improvement compared with the AMAP-Super, which considers super-Gaussian statistics and estimates only the amplitude of the clean signal and keeps the phase unchanged. However, in high SNRs, it is seen that the AMAP-Super outperforms the MMSE-Gaussian.

In terms of STOI and SegSNR, we observe a similar trend; proposed estimators achieve larger STOI and SegSNR improvement than the others for all SNRs. Also, while the MMSE-Gaussian achieves better results than AMAP-Super at low SNRs, the AMAP-Super achieves a larger improvement at high SNRs.

Regarding LSD, we observe that the TPJMAP-Super produces the best performance in low SNRs. In mid and high SNRs, all estimators perform about the same.

5 Conclusion

In this work, we proposed two multimicrophone estimators that exploit both amplitude and phase of clean speech signal considering the MAP criterion. Proposed estimators work based on two existing PDF functions which model the amplitude of clean speech signal by super-Gaussian statistical properties. Considering the MAP criterion, we provided closed-form solutions for both amplitude and phase of clean speech signal.

The performance of the proposed multimicrophone MAP estimators for different kinds of noise was investigated in terms of PESQ, STOI, segmental SNR improvement and also LSD values. The superiority of both proposed estimators in all situations was confirmed by the simulation results. Taking advantages of phase estimation and more accurate super-Gaussian speech model, the proposed estimators result in remarkable PESQ and STOI improvement. Also, by making a good trade-off between noise reduction and speech distortion, proposed algorithms achieve higher output segmental SNR compared to the benchmarks in both simulated and realistic scenarios.

In this work, the proposed estimators were derived under the assumption of uniform distribution for speech spectral phase. Although the proposed estimators were able to enhance the speech signal significantly, investigating the applicability of other distribution, especially the Von Mises distribution [14], is worthwhile for future works.