Keywords

1 Introduction

Nowadays, speech-based applications like automatic speech recognition systems (ASR) are becoming very important. But the performance of the ASR system in-cars is deteriorated very much by background noises and other various disturbances [1]. Hence, the essential step is to develop effective speech enhancement methods to suppress the background disturbing noises and improve the intelligibility of capturing speech under in-car noisy situations. In recent years, the devices which are activated by human voice commands have been given much attention by car manufacturers. However, the performance of currently available commercial products is degrading substantially under real-world conditions. The noises originated from pumps, engines, audio equipment, wind, road and air-conditioning, radio and communication are usually non-stationary in nature and time-varying [1, 2]. So, some speech enhancement techniques dealing with in-car noises were developed in recent years to extract the original speech from the distorted noisy speech. The present scheme involves two main algorithms in which beamforming exploits the time correlation of speech signals captured by microphones and Kalman filter uses the different statistics in speech and noise signals to separate them. The purpose of combining the beamforming and Kalman filter is to exploit the strength of each individual method and to improve the quality of speech. When these two methods are combined, a strongly interfering source signal is first separated by beamforming and remaining noise signals are denoised by an adaptive Kalman filter [3]. Thus, they produce the best performance when worked jointly rather independently.

2 Microphone Array-Based Speech Enhancement System

The in-car acoustic ambient contains various sources of disturbances besides the speaker. In order to separate the speech signal of the speaker from these disturbances, multichannel speech enhancement system involving microphone arrays is being used. By processing the microphone array signal appropriately, we can achieve the direction-dependent sensitivity of the source. This technique is named as beamforming.

2.1 Beamforming

In beamforming, it isolates a source from a specific direction, while still maintaining some semblance of directionality of the receiving signal. The beamforming be able to represented by linearly arranged of the sensor outputs

$$ R\left( l \right) = \mathop \sum \limits_{i = 1}^{M} w_{i} \left( l \right)Y_{i} \left( l \right) $$
(1)

where wi(l) are the weights associated to each ith sensors, and M represents the l number of sensors. The data is described by the vector,

$$ \begin{aligned} R(l) & = w^{T} (l)Y(l) \\ w(l) & = \left[ {w_{1} (l) \ldots w_{M} (l)} \right]^{T} \\ \end{aligned} $$
(2)

Beamforming can be classified as conventional and adaptive where in conventional the weights across time are fixed, and in adaptive the weights are varied according to the acoustic surroundings of speech.

2.1.1 Fixed Beamforming

In conventional beamforming, [4] the weights \( w_{i} \left( k \right) \) are fixed and determined by minimizing the signal power at the beamformer output and subjected to a constraint ensures that the desired signal is unvarnished [4], i.e., the optimal weights are the solution to

$$ \begin{array}{*{20}l} {\hbox{min} i\,w^{*}(l)\varPsi_{yy} (l)w(l)\,{\text{subject}}\,{\text{to}}\,w^{*} (l)1 = 1} \hfill \\ {w(l)} \hfill \\ \end{array} $$
(3)

where * represents the complex conjugate transpose and whereas Ψyy(l) is the power spectral density matrix of size M × M of the noisy speech signal with (i, j)th value which is \( E\left[ {X_{i} (l)X_{i}^{*} (l)} \right] \). The constraint of zero distortion in the desired direction is given by means of a vector of one’s while we consider the array is pre-steered [5] towards the preferred signal direction. The solution is the minimum variance distortion less response beamformer with the constrained optimization [6]

$$ w\left( l \right) = \frac{{\Psi _{ww}^{ - 1} \left( l \right)1}}{{1^{T}\Psi _{ww} \left( l \right)1}} $$
(4)

where Ψww(l) is the noise PSD matrix of size M × M whose (i; j)th entry is \( E\left[ {W_{i} (l)W_{i}^{*} (l)} \right] \). By considering the noise field as homogeneous, the solution is in terms of coherence matrix

$$ w\left( l \right) = \frac{{{\Upgamma }_{ww}^{ - 1} \left( l \right)1}}{{1^{\text{T}} {\Upgamma }_{ww} \left( l \right)1}} $$
(5)

The (i; j)th value of the coherence matrix M × M is

$$ {\Upgamma }_{ij} \left( l \right) = \frac{{\Psi _{{w_{i} w_{j} }} \left( l \right)}}{{\sqrt {\Psi _{{w_{i} w_{i} }} \left( l \right)\Psi _{{w_{j} w_{j} }} \left( l \right)} }} $$
(6)
$$ \quad = \frac{{\Psi _{{w_{i} w_{j} }} \left( l \right)}}{{\Psi _{ww} \left( l \right)}} $$
(7)

In the above equation, \( \Psi _{{w_{i} w_{j} }} \left( l \right) \) is the cross spectral density between e ith and jth sensors and the noise signals. By the assumption of a homogeneous noise field, \( \Psi _{{w_{i} w_{i} }} \left( l \right) =\Psi _{ww} \left( l \right) \) for i. The incoherence noise fields, \( \Gamma _{ww} = {\text{I}},\, w = \frac{1}{M} \) 1 and the minimum variance distortion less response beamformer reduced to a delay-and-sum beamformer, in which first the sensor output speech signals are delayed and then followed by average. The pre-steering corresponds to the delay and the speech signal components at various sensors added beneficially, and at the same time the noise components are get cancelled. In a delay-and-sum beamformer, the weights of the amplitude are fixed and the weights of phase introduce the delay. But both the amplitude and phase weights vary in filter-and-sum beamformer (FSB). These FSBs are used in designing beamformers with a specific pattern of direction for microphone arrays. Most of the noises fall into the category of noise fields, and the coherence function is in the form of

$$ \Gamma_{ij} (l) = \sin c\left( {\frac{2\varPi k}{l}\frac{{D_{ij} }}{c}} \right) $$
(8)

where k is the frame length and c is the velocity of sound in air, c = 339 m/s. Where Dij is the distance between the ith and jth sensors in the array and \( \sin c\left( z \right) = \sin \left( z \right)/z. \)

In the coherence matrix, if we use the equivalent expression for the resultant beamformer, it is known as a super directive beamformer (SDB). Although SDB is used in diffuse noise fields, it has a disadvantage of amplifying uncorrelated noises at low frequencies. It can be overcome by incorporating white noise gain restriction in the design.

2.1.2 Adaptive Beamforming

In this, the weights are changed according to the acoustic surroundings. The optimized weights are taken by means of minimizing the variance of the output signal. To make sure that the desired speech signal is not cancelled out or distorted, a distortion less constraint is forced on the desired signal. The generalized side lobe canceller (GSC) [6] is an efficient implementation of the linearly constrained minimum variance (LCMV) procedure, which converts the constrained optimized problem to an unconstrained one. This will give a better performance for the updated weights. The block diagram of GSC is shown in Fig. 1. The structure of GSC contains three modules a beamformer (BF), blocking matrix (BM) and noise canceller (NC). The BF is having a pre-steering module which aligns the desired speech components YBF by designing its coefficients. Blocking matrix is orthogonal to the beamformer and resulting outputs, called the noise reference signals by blocks the desired speech signal. By taking the differences between adjacent sensor signals, noise references will be formed. The noise cancellation in Fig. 1 removes any remaining noise residual in the speech reference that is correlated with the noise references.

Fig. 1
figure 1

Implementation of the generalized side lobe canceller in the frequency domain

3 Kalman Filter

When speech signal is corrupted with noise, the output y(l) is given as

$$ y(l) = x(l) + v(l) $$
(9)

x(l) is the clean speech.

A qth order autoregressive AR predictor is used to model the speech signal.

Where x(l), the present sample, depends on the linear combination of previous q samples added with a noise.

$$ x(l) = \sum\limits_{i = 1}^{q} {a_{i} x(l - i) + u(l)} $$
(10)

where x(l) is the lth sample of the clean signal, and y(l) is the lth sample of the noisy speech, and \( a_{i} \left( l \right) \) is ith autoregressive process parameter. This can be modelled by the following state-space expression. Where, the sequences u(l) and v(l) are uncorrelated Gaussian white noise sequences with the mean \( {\bar{\text{u}}} \) and \( {\bar{\text{v}}} \) and the variances \( \sigma_{u}^{2} \) and \( \sigma_{v}^{2} .x\left( l \right) \) is the Q × 1 state vector.

$$ X(l) = [s(l - q + 1), \ldots ,s(l),v(l - q + 1), \ldots ,v(l)]^{T} $$
(11)

The Kalman filter gives the updating state vector estimator equations

$$ e(l) = y(l) - H\hat{X}\left( {l/(l - 1} \right) $$
(12)
$$ K(l) = q(l/(l - 1))H \times \left[ {Hq\left( {l/(l - 1)H^{T} } \right]^{ - 1} } \right. $$
(13)
$$ \hat{X}(l/l) = \hat{X}(l/(l - 1)) + k(l)e(l) $$
(14)
$$ q(l/l) = [I - k(l)H]q(l/l - 1) $$
(15)
$$ {\hat{\text{x}}}\left( {l + 1/l} \right) = {\text{F}}\left( l \right){\hat{\text{x}}}\left( {l/l} \right) + {\text{G}}_{{\overline{{{\ddot{\text{u}}}}} }} $$
(16)
$$ Q(l + 1/l) = F(l)P(l/l)F^{T} (l) + GG^{T} \sigma_{u}^{2} $$
(17)

where \( \hat{x}\left( {l + 1/l} \right) \) is the minimum mean square estimation of the state vector X(l) given the past l − 1 observations y(l), …, y(l − 1).

The predicted state error vector is \( \hat{x}\left( {l/l - 1} \right) = x\left( l \right){-}\hat{x}\left( {l/l - 1} \right) \).

  • \( Q(l/l - 1) = E[\tilde{x}(l/l - 1)\tilde{x}^{T} (l/l - 1)] \) is predicted state error correlation matrix where

  • \( \hat{x}\left( {l/l} \right) \) is the filtered estimation of the state vector.

  • \( \hat{x}\left( {l/l} \right) = x\left( l \right){-}\hat{x}\left( {l/l} \right) \) is the filtered state error vector.

  • \( Q(l/l) = E[\tilde{x}(l/l - 1)\tilde{x}^{ - T} (l/l)] \) is the filtered state error correlation vector.

  • K(l) is the Kalman gain and e(l) is the innovation sequence.

  • The estimated signal can be retrieved from the state vector estimator

    $$ \hat{s}(l) = H\hat{x}\left( {\frac{l}{l}} \right) $$
    (18)

4 Experimental Results

The efficiency of the proposed system BEAM-KAL is tested under the speech signal corrupted with car interior noise at various inputs SNRs −6, 3 dB [7] which was taken from NOIZEUS database [7]. Perceptual evaluation of speech quality scores (PESQ) [8] for the proposed BEAM-KAL [9] are found to be consistently good at lower input SNR, i.e., −6 dB than individual filters shown in Fig. 2. The time domain graphs and spectrogram of noisy, beamform, Kalman and BEAM-KAL enhanced speech signals are done at 3 dB where the circles show the noise removal as in Figs. 3 and 4. From this, it is clear that BEAM-KAL combination is superior when compared to others in case of noise reduction. Speech enhancement must satisfy the listener by improving speech quality and hence subjective tests, like informal A–B testing, are used to evaluate the performance. In A–B testing, a group of people listening to a number of pairs of speech files (labelled A and B) are involved, and they decide which is better in each case. The obtained results show that the proposed system is extremely good at removing in-car noise, with excellent speech quality and high intelligibility even at low noise exhibits better performance than those obtained with the beamforming and Kalman filter (Table 1).

Fig. 2
figure 2

PESQ scores for the proposed filter and with individual filters

Fig. 3
figure 3

Comparison of time domain plots at 3 dB. a Noisy signal (blue colour). b The enhanced beamform (yellow colour) c. The enhanced Kalman (green in colour). d Enhanced signal of the proposed BEAM-KAL (red colour) and the black circle shows the noise and it was removed

Fig. 4
figure 4

Spectrogram of the speech sample sp01 from NOIZEUS data corrupted with car noise. a Corrupted speech with car noisy signal at −6 dB SNR. b Enhanced beamform the speech obtained signal at −6 dB. c The output of Kalman filter (enhanced signal at −6 dB). d Output signal of the proposed BEAM-KAL at input −6 dB SNR

Table 1 Subjective A–B test

5 Conclusion

In this paper, a cascaded scheme BEAM-KAL based upon a combination of generalized side lobe canceller (GSC) beamformer and Kalman filter was proposed for the enhancement of speech signals corrupted with car noise. Simulation results were conducted, and the results of the proposed scheme are compared to the beamformer and Kalman filter individually at various input SNRs. The overall performance of the proposed method is shown to outperform beamforming and Kalman filter. Using an objective speech quality measure, spectrogram analysis, as well as formal subjective listening tests, showed that the proposed method is capable of reducing noise resulting in improved speech quality. Hence, this scheme provides a promising solution for real-time speech enhancement in noisy car environments.