Abstract
The effectiveness of the communication system gets seriously degraded in car by noises like engine sounds and ambient noise, thus decreasing the quality of speech. In modern cars, a lot of effort is put on reducing the background noise. In this paper, speech enhancement cascaded scheme named BEAM-KAL is developed to get the better intelligibility and quality of speech. For this, multichannel beamforming techniques are combined with single channel Kalman filter to get better quality of speech signals which suffer in-car noises. In beamforming, microphone arrays are used to extract the speech signal of interest from a specific desired direction, whereas signals contaminated with noises from various directions are attenuated. However, this technique does not appear to provide enough improvement by itself. Hence, the Kalman filter has been used for its further enhancement. Experiments are performed with real recordings taken while driving in a noisy automobile environment. The performance is investigated with SNR, PESQ and spectrograms and has been shown to produce a better quality of speech.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Nowadays, speech-based applications like automatic speech recognition systems (ASR) are becoming very important. But the performance of the ASR system in-cars is deteriorated very much by background noises and other various disturbances [1]. Hence, the essential step is to develop effective speech enhancement methods to suppress the background disturbing noises and improve the intelligibility of capturing speech under in-car noisy situations. In recent years, the devices which are activated by human voice commands have been given much attention by car manufacturers. However, the performance of currently available commercial products is degrading substantially under real-world conditions. The noises originated from pumps, engines, audio equipment, wind, road and air-conditioning, radio and communication are usually non-stationary in nature and time-varying [1, 2]. So, some speech enhancement techniques dealing with in-car noises were developed in recent years to extract the original speech from the distorted noisy speech. The present scheme involves two main algorithms in which beamforming exploits the time correlation of speech signals captured by microphones and Kalman filter uses the different statistics in speech and noise signals to separate them. The purpose of combining the beamforming and Kalman filter is to exploit the strength of each individual method and to improve the quality of speech. When these two methods are combined, a strongly interfering source signal is first separated by beamforming and remaining noise signals are denoised by an adaptive Kalman filter [3]. Thus, they produce the best performance when worked jointly rather independently.
2 Microphone Array-Based Speech Enhancement System
The in-car acoustic ambient contains various sources of disturbances besides the speaker. In order to separate the speech signal of the speaker from these disturbances, multichannel speech enhancement system involving microphone arrays is being used. By processing the microphone array signal appropriately, we can achieve the direction-dependent sensitivity of the source. This technique is named as beamforming.
2.1 Beamforming
In beamforming, it isolates a source from a specific direction, while still maintaining some semblance of directionality of the receiving signal. The beamforming be able to represented by linearly arranged of the sensor outputs
where wi(l) are the weights associated to each ith sensors, and M represents the l number of sensors. The data is described by the vector,
Beamforming can be classified as conventional and adaptive where in conventional the weights across time are fixed, and in adaptive the weights are varied according to the acoustic surroundings of speech.
2.1.1 Fixed Beamforming
In conventional beamforming, [4] the weights \( w_{i} \left( k \right) \) are fixed and determined by minimizing the signal power at the beamformer output and subjected to a constraint ensures that the desired signal is unvarnished [4], i.e., the optimal weights are the solution to
where * represents the complex conjugate transpose and whereas Ψyy(l) is the power spectral density matrix of size M × M of the noisy speech signal with (i, j)th value which is \( E\left[ {X_{i} (l)X_{i}^{*} (l)} \right] \). The constraint of zero distortion in the desired direction is given by means of a vector of one’s while we consider the array is pre-steered [5] towards the preferred signal direction. The solution is the minimum variance distortion less response beamformer with the constrained optimization [6]
where Ψww(l) is the noise PSD matrix of size M × M whose (i; j)th entry is \( E\left[ {W_{i} (l)W_{i}^{*} (l)} \right] \). By considering the noise field as homogeneous, the solution is in terms of coherence matrix
The (i; j)th value of the coherence matrix M × M is
In the above equation, \( \Psi _{{w_{i} w_{j} }} \left( l \right) \) is the cross spectral density between e ith and jth sensors and the noise signals. By the assumption of a homogeneous noise field, \( \Psi _{{w_{i} w_{i} }} \left( l \right) =\Psi _{ww} \left( l \right) \) for i. The incoherence noise fields, \( \Gamma _{ww} = {\text{I}},\, w = \frac{1}{M} \) 1 and the minimum variance distortion less response beamformer reduced to a delay-and-sum beamformer, in which first the sensor output speech signals are delayed and then followed by average. The pre-steering corresponds to the delay and the speech signal components at various sensors added beneficially, and at the same time the noise components are get cancelled. In a delay-and-sum beamformer, the weights of the amplitude are fixed and the weights of phase introduce the delay. But both the amplitude and phase weights vary in filter-and-sum beamformer (FSB). These FSBs are used in designing beamformers with a specific pattern of direction for microphone arrays. Most of the noises fall into the category of noise fields, and the coherence function is in the form of
where k is the frame length and c is the velocity of sound in air, c = 339 m/s. Where Dij is the distance between the ith and jth sensors in the array and \( \sin c\left( z \right) = \sin \left( z \right)/z. \)
In the coherence matrix, if we use the equivalent expression for the resultant beamformer, it is known as a super directive beamformer (SDB). Although SDB is used in diffuse noise fields, it has a disadvantage of amplifying uncorrelated noises at low frequencies. It can be overcome by incorporating white noise gain restriction in the design.
2.1.2 Adaptive Beamforming
In this, the weights are changed according to the acoustic surroundings. The optimized weights are taken by means of minimizing the variance of the output signal. To make sure that the desired speech signal is not cancelled out or distorted, a distortion less constraint is forced on the desired signal. The generalized side lobe canceller (GSC) [6] is an efficient implementation of the linearly constrained minimum variance (LCMV) procedure, which converts the constrained optimized problem to an unconstrained one. This will give a better performance for the updated weights. The block diagram of GSC is shown in Fig. 1. The structure of GSC contains three modules a beamformer (BF), blocking matrix (BM) and noise canceller (NC). The BF is having a pre-steering module which aligns the desired speech components YBF by designing its coefficients. Blocking matrix is orthogonal to the beamformer and resulting outputs, called the noise reference signals by blocks the desired speech signal. By taking the differences between adjacent sensor signals, noise references will be formed. The noise cancellation in Fig. 1 removes any remaining noise residual in the speech reference that is correlated with the noise references.
3 Kalman Filter
When speech signal is corrupted with noise, the output y(l) is given as
x(l) is the clean speech.
A qth order autoregressive AR predictor is used to model the speech signal.
Where x(l), the present sample, depends on the linear combination of previous q samples added with a noise.
where x(l) is the lth sample of the clean signal, and y(l) is the lth sample of the noisy speech, and \( a_{i} \left( l \right) \) is ith autoregressive process parameter. This can be modelled by the following state-space expression. Where, the sequences u(l) and v(l) are uncorrelated Gaussian white noise sequences with the mean \( {\bar{\text{u}}} \) and \( {\bar{\text{v}}} \) and the variances \( \sigma_{u}^{2} \) and \( \sigma_{v}^{2} .x\left( l \right) \) is the Q × 1 state vector.
The Kalman filter gives the updating state vector estimator equations
where \( \hat{x}\left( {l + 1/l} \right) \) is the minimum mean square estimation of the state vector X(l) given the past l − 1 observations y(l), …, y(l − 1).
The predicted state error vector is \( \hat{x}\left( {l/l - 1} \right) = x\left( l \right){-}\hat{x}\left( {l/l - 1} \right) \).
-
\( Q(l/l - 1) = E[\tilde{x}(l/l - 1)\tilde{x}^{T} (l/l - 1)] \) is predicted state error correlation matrix where
-
\( \hat{x}\left( {l/l} \right) \) is the filtered estimation of the state vector.
-
\( \hat{x}\left( {l/l} \right) = x\left( l \right){-}\hat{x}\left( {l/l} \right) \) is the filtered state error vector.
-
\( Q(l/l) = E[\tilde{x}(l/l - 1)\tilde{x}^{ - T} (l/l)] \) is the filtered state error correlation vector.
-
K(l) is the Kalman gain and e(l) is the innovation sequence.
-
The estimated signal can be retrieved from the state vector estimator
$$ \hat{s}(l) = H\hat{x}\left( {\frac{l}{l}} \right) $$(18)
4 Experimental Results
The efficiency of the proposed system BEAM-KAL is tested under the speech signal corrupted with car interior noise at various inputs SNRs −6, 3 dB [7] which was taken from NOIZEUS database [7]. Perceptual evaluation of speech quality scores (PESQ) [8] for the proposed BEAM-KAL [9] are found to be consistently good at lower input SNR, i.e., −6 dB than individual filters shown in Fig. 2. The time domain graphs and spectrogram of noisy, beamform, Kalman and BEAM-KAL enhanced speech signals are done at 3 dB where the circles show the noise removal as in Figs. 3 and 4. From this, it is clear that BEAM-KAL combination is superior when compared to others in case of noise reduction. Speech enhancement must satisfy the listener by improving speech quality and hence subjective tests, like informal A–B testing, are used to evaluate the performance. In A–B testing, a group of people listening to a number of pairs of speech files (labelled A and B) are involved, and they decide which is better in each case. The obtained results show that the proposed system is extremely good at removing in-car noise, with excellent speech quality and high intelligibility even at low noise exhibits better performance than those obtained with the beamforming and Kalman filter (Table 1).
5 Conclusion
In this paper, a cascaded scheme BEAM-KAL based upon a combination of generalized side lobe canceller (GSC) beamformer and Kalman filter was proposed for the enhancement of speech signals corrupted with car noise. Simulation results were conducted, and the results of the proposed scheme are compared to the beamformer and Kalman filter individually at various input SNRs. The overall performance of the proposed method is shown to outperform beamforming and Kalman filter. Using an objective speech quality measure, spectrogram analysis, as well as formal subjective listening tests, showed that the proposed method is capable of reducing noise resulting in improved speech quality. Hence, this scheme provides a promising solution for real-time speech enhancement in noisy car environments.
References
Abut H, Hansen JHL, Takeda K (2005) DSP for in-vehicle and mobile systems. Springer
Poulat LD (2004) Robust speech recognition techniques evaluation for telephony server based in-car applications. In: ICASSP 2004, I-65–I-68
Paliwal K, Basu A (1987) A speech enhancement method based on Kalman filtering. In: Proceedings of IEEE international conference on acoustics speech and signal processing (ICASSP)
Van Veen Barry, Buckley Kevin M (1988) Beamforming: a versatile approach to spatial filtering. IEEE Sig Process Mag 5:4–24
Bitzer J, Simmer KU (2001) Superdirective microphone arrays. In: Brandstein MS, Ward DB (eds) Microphone arrays: signal processing techniques and applications. Springer, Berlin, pp 19–38 (Chapter 2)
Breed BR, Strauss J (2002) A short proof of the equivalence of LCMV and GSC beamforming. IEEE Sig Process Lett 9(6):168–169
Noizeus: a noisy speech corpus for evaluation of Speech enhancement algorithms, http://www.utdallas.edu/~loizou/speech/noizeus
ITU-T P.862 (2000) Perceptual evaluation of speech quality (PESQ) and objective method for end-to-end speech quality assessment of narrow band telephone networks and speech codecs. ITU-T Recommendation, p 862
Ramesh Babu G, Rao R (2012) Combination of beamforming and Kalman filter techniques for speech enhancement. Int J Comput Sci Commun Netw 3(1):338–343
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ramesh Babu, G., Sridhar, G.V. (2021). Speech Enhancement Using Beamforming and Kalman Filter for In-Car Noisy Environment. In: Chowdary, P., Chakravarthy, V., Anguera, J., Satapathy, S., Bhateja, V. (eds) Microelectronics, Electromagnetics and Telecommunications. Lecture Notes in Electrical Engineering, vol 655. Springer, Singapore. https://doi.org/10.1007/978-981-15-3828-5_57
Download citation
DOI: https://doi.org/10.1007/978-981-15-3828-5_57
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3827-8
Online ISBN: 978-981-15-3828-5
eBook Packages: EngineeringEngineering (R0)