Keywords

1 Introduction

These days, innovation is truly advancing with enormous demand, and the interest for speech enhancement frameworks is clear. Speech improvement in uproarious reverberant conditions, for the audience, is hard and testing. The speech signal is corrupted by the noise and resonation when caught utilizing an inaccessible mouthpiece [1]. A room impulse response will incorporate segments at long postponements, subsequently coming about in resonation and echoes. Reverberation is considered to be a convolutive distortion that actuates big haul correlation between successive observations and can be very time-taking with a resonation time [2]. Noise and reverberation can be stationary or non-stationary and inconveniently affect both discourse quality and discourse comprehensibility [2]. Different techniques have been introduced on speech enhancement.

The Kalman filtering is one of them and is a good and dependable speech improvement algorithm. It utilizes the minimum mean square error wisely [3]. Nonetheless, admittance to clean speech and added substance commotion data for the state-space model boundaries for the greater part of the traditional KF-based speech enhancement techniques is needed. In particular, the linear prediction coefficients and the additive noise variance estimation, which is unrealistic in practical speaking to get the noisy speech [4, 5]. Also, the authors in [6] proposed that the fundamental cycle of noise reduction calculation is Kalman filtering. The underlying incentive for KF is dictated by ASS. To get higher exactness, the following calculation is proposed. From the outset, the power spectrum of clean speech is assessed from the spectrum by the KF algorithm. At that point, the acquired power spectrum is filled in for initial value, and Kalman filter calculation is rehashed. On doing this calculation, we acquired greater precision of decrease in noise. It can be repeated at 1.5–2.0 occasion times of constant by taking the noisy speech signal as an input, and fast Fourier transform (FFT) was done to get power spectrum. Using adaptive spectral subtraction (ASS), we get estimates of power spectrum, i.e., noise signal power subtracted is subtracted from mixed signal spectrum [7].

2 Related Works

As per the work done in [8], first a noisy speech signal is given as input, and this input speech signal is assumed as stationary during each frame and processed using three algorithms, which are spectral subtraction, Wiener filter and Kalman filters, and the work suggests that the spectral subtraction can be used only for stationary signals and real-time signals are non-stationary. The Wiener filter is also suitable for stationary signals but denies working on musical noise. To oversee these boundaries, the paper suggests Kalman filtering. When talking about the UKF algorithm, it was first proposed in [9, 10]. In [11], the work proposes that most approaches use the stationary AWGN assumption, but the same of colored noise is believed to be more useful for speech denoising and speech dereverberation. The Kalman filter, because of its flexibility, is widely used for signal enhancement. Kalman filter has a considerable amount of numerical complexity while dealing with colored noise. Moreover, Kalman filtering is a model-based adaptive method, where speech as well as noise is modeled as AR processes. Thus, a major issue in Kalman filtering is the estimation of the AR parameters in the presence of noise. The traditional algorithm utilizes the EM technique to repeatedly calculate the AR boundaries. Unfortunately, its computational complexity is high. The method used in our work is built on spectral subtraction for estimation of AR parameters of clean signal and corresponding noise [12]. It is computationally efficient and can be easily implemented. The mathematical model for the algorithm of the state-space model and Kalman filter equations was formulated, and the obtained results were compared to the WF method [13, 14].

The work proposed by the authors in [15] is the computer-based algorithms which are generally used for controlling and monitoring a computer where human, digital and analog interactions occur. The cyber-physical systems (CPS) scheme is used in many areas due to its easily available and connectivity features and also offers large amount of storage and computing resources. However, the limitation of this scheme is its large energy consumption. As in [16], spectral subtraction method is applied in the estimation of parameters, musical noise appears in the enhanced speech. To acquire a Kalman filter output with better audible quality, a conceptual post-filter is set at the output of the Kalman filter to decrease the musical noise level. The perceptual filter minimizes signal distortion while constraining the noise spectrum.

3 Methodology

3.1 Flow Process

In the time domain, the distorted speech, dk(t), is given by dk (t) = Ck (t) * rk (t) + nk(k) where Ck (t) is the clean speech component, rk(t) is the reverberant speech component, and nk (t) is the noise [2]. The time frame index is represented as k. The algorithm holds each time frame bit on its own. In the limits of the algorithm, k is introduced as a variable in the equations that involve multiple time frames [2]. Figure 1 explains the flow process of the algorithm.

Fig. 1
figure 1

Flow process

The clean speech which is downloaded from the database is processed and is reverberated using the reverb parameters and convolution. The output of the first block in Fig. 1 is the reverberated speech with some given delay, and the magnitude of the speech changes according to the coefficient of reverberation taken [17]. The approach in Eq. 1 is used to do the reverberation process as

$$O\left( n \right) = I\left( n \right) + aO\left( {n - d} \right)$$
(1)

where i(n) is the input audio signal, O (n) is the output (echoed) audio signal, d is the echo delay (in samples), and alpha is the coefficient that governs the amount of echo fed back. Then, the reverberated signal is then added with a certain amount of additive white Gaussian noise as shown in Fig. 1. Here, we have the corrupted speech signal that needs to be denoised and de-reverberated.

The corrupted speech is then taken as k reduced time frames or into k smaller time frames that are of a specific period which are called the state spaces. For this process of converting the clear speech signal to state spaces, we use three different windows. They are the rectangular window, the hamming window and the Gaussian window [12]. The proposed algorithm treats each time frame or the state space on its own. Firstly, as in the third block of Fig. 1, each of these frames then undergo the unscented transform in which the sigma points of the first state space are calculated. Then, the statistical mean and covariance of the present state are calculated. Then, the two main steps of the algorithm, the time update and the measurement update steps, are done for the first state space. Being an auto-regressive algorithm, the same is applied to all the k state spaces, i.e., the set of time update equations and measurement update equations given in the following Sect. 3.2 are implemented. The detailed equations to the above algorithm are also mentioned in the Sect. 3.2.

3.2 Unscented Kalman Filtering

3.2.1 Unscented Transform

The unscented transform (UT) is a method for estimating the mean and covariance of RV that goes through a nonlinear transformation [3, 18]. Take into consideration the propagation a RV x into a function y = f(x). Consider \(\overline{x}\) is the mean, and \(P_{{\varvec{X}}}\) is the covariance of RV x.

Figure 2 explains the steps in the unscented transform step in Fig. 1. The \(\overline{x}\) and Px depicted in Fig. 2 are the mean the covariance of the random variable x, respectively, then the sigma points are calculated which are then propagated through non-linear function. Then, the weighted sample mean and weighted sample covariance are calculated for further process [19].

Fig. 2
figure 2

Diagram of UT

To evaluate the mean and variance of y, we initiate a matrix Xi of 2L + 1 sigma vector \(X_{i}\), relating to the following Eqs. 24 as shown in Fig. 2.

$$X_{0} = \overline{x}$$
(2)
$$X_{i} = \overline{x} + \left( {\sqrt {P_{{\varvec{X}}} (L + {\uplambda }} } \right)_{i} ,i = 1, \ldots ,L$$
(3)
$$X_{i} = \overline{x} + \left( {\sqrt {P_{{\varvec{X}}} (L + {\uplambda }} } \right)_{i} ,i = L + 1, \ldots ,2L$$
(4)

where \({\uplambda } = { }\alpha^{2} \left( {L + k} \right) - L\). \(\alpha\) is a coefficient that governs the sigma point spread around \(\overline{x}\) and is generally set to a positive minor value (e.g., \(1 \le \alpha \le 1e - 4\)). k is a constant that is generally equal to 0 or 3-L and \(\beta\) is used for integration [20]. The initial information of the distribution of x (for Gaussian distribution \(\beta = 2\) is ideal), \(\left( {\sqrt {P_{{\varvec{X}}} (L + {\uplambda }} } \right)_{i}\) is the ith column of the square root of the matrix. These sigma vectors undergo transition throughout as in Eq. 5,

$$y_{i} = f\left( {X_{i} } \right)i = 0,1,2, \ldots \, 2L$$
(5)

And using Eqs. 510, the weighted sample mean and covariance of the posterior sigma points are used to approximate the mean and covariance of y [21],

$$\overline{\user2{y}} \approx \mathop \sum \limits_{i = 0}^{2L} W_{i}^{\left( m \right)} y_{i}$$
(6)
$$P_{y} = \mathop \sum \limits_{i = 0}^{2L} W_{i}^{\left( c \right)} \left\{ {y_{i} - \overline{\user2{y}}} \right\}\left\{ {y_{i} - \overline{\user2{y}}} \right\}^{T}$$
(7)

With weights \(W_{i}\) are

$$W_{0}^{\left( m \right)} = {\uplambda }/\left( {{\text{L}} + {\uplambda }} \right)$$
(8)
$$W_{0}^{\left( c \right)} = {\uplambda }/\left( {{\text{L}} + {\uplambda }} \right) + \left( {1 - \alpha^{2} + \beta } \right)$$
(9)
$$W_{i}^{\left( m \right)} = W_{i}^{\left( c \right)} = 1/\left\{ {2\left( {{\text{L}} + {\uplambda }} \right)} \right\}$$
(10)

A diagram representing the steps in unscented transform is depicted in Fig. 1. Consider that, it varies considerably from the Monte-Carlo sampling methods that need more sample and orders of magnitude to propagate through a precise distribution of state [22, 23]. The illusionary simple way through with the UT leads to an approximation that are nearly equal to the third order of Gaussian inputs for all nonlinearities [14, 24]. For non-gaussian inputs, approximation is reduced precisely to 1st or 2nd order and the selection of \(\alpha\, \text{and}\, \beta\) with the exactness of third order and other higher order moments are found.

3.2.2 Unscented Kalman Filter Equations

The UKF is a clear augmentation of the UT to the recurring assessment, when the state RV is reclassified due to the addition of the original state and noise variables: \(x_{k}^{a} = \left[ {x_{k}^{T} V_{k}^{T} n_{k}^{T} } \right]\). The UT sigma point choosing scheme (in Eq. 4) is put in to the new state random variable to determine the respective sigma matrix, \(X_{k}^{a}\) [2]. Then, the equations are initialized as shown in Eqs. 1114. So, however, no conspicuous computation of Jacobians is important to execute this calculation. Moreover, the general number of calculations is a similar request as the EKF.

Initialize with

$$\hat{X}_{0} = {\mathbb{E}}\left[ {X_{0} } \right]$$
(11)
$$P_{0} = {\mathbb{E}}\left[ {\left( {X_{0} - \hat{X}_{0} } \right)\left( {X_{0} - \hat{X}_{0} } \right)^{T} } \right]$$
(12)
$$\hat{X}_{0}^{a} = {\mathbb{E}}\left[ {X^{a} } \right] = \left[{\begin{array}{*{20}c} \hat{X}_{0}^{T} & 0 & 0 \\ \end{array} } \right]$$
(13)
$$P_{0}^{a} = {\mathbb{E}}\left[ {\left( {X_{0}^{a} - \hat{X}_{0}^{a} } \right)\left( {X_{0}^{a} - \hat{X}_{0}^{a} } \right)^{T} } \right] = \left[ {\begin{array}{*{20}c} {P_{0} } & 0 & 0 \\ 0 & {R^{v} } & 0 \\ 0 & 0 & {R^{n} } \\ \end{array} } \right]$$
(14)

Calculation of sigma points:

$$X_{k - 1}^{a} = \left[ {\hat{X}_{k - 1}^{a} \hat{X}_{k - 1}^{a} + \gamma \sqrt {P_{k - 1}^{a} } \hat{X}_{k - 1}^{a} - \gamma \sqrt {P_{k - 1}^{a} } } \right]$$
(15)

The time update equations are given from Eq. 1620:

$$X_{k|k - 1}^{x} = F\left[ {X_{k - 1}^{x} , u_{k - 1} , X_{k - 1}^{v} } \right]$$
(16)
$$\hat{X}_{k}^{ - } = \mathop \sum \limits_{i = 0}^{2L} W_{i}^{\left( m \right)} X_{i , k|k - 1}^{x}$$
(17)
$$P_{k}^{ - } = \mathop \sum \limits_{i = 0}^{2L} W_{i}^{\left( c \right)} \left[ {X_{i, k|k - 1}^{ - } - \hat{X}_{k}^{ - } } \right]\left[ {X_{i, k|k - 1}^{ - } - \hat{X}_{k}^{ - } } \right]^{T}$$
(18)
$$y_{k|k - 1} = H[X_{k|k - 1}^{x} , X_{k - 1}^{n} ]$$
(19)
$$\hat{y}_{k}^{ - } = \mathop \sum \limits_{i = 0}^{2L} W_{i}^{\left( m \right)} y_{i, k|k - 1}$$
(20)

The measurement update equation is from Eqs. 2125:

$$P_{{\hat{y}_{k} \overline{y}_{k} }} = \mathop \sum \limits_{i = 0}^{2L} W_{i}^{\left( c \right)} \left[ {y_{i , k|k - 1} - \hat{y}_{k}^{ - } } \right]\left[ {y_{i , k|k - 1} - \hat{y}_{k}^{ - } } \right]^{T}$$
(21)
$$P_{{x_{k} y_{k} }} = \mathop \sum \limits_{i = 0}^{2L} W_{i}^{\left( c \right)} \left[ {x_{i , k|k - 1} - \hat{x}_{k}^{ - } } \right]\left[ {y_{i , k|k - 1} - \hat{y}_{k}^{ - } } \right]^{T}$$
(22)
$$K_{k} = P_{{x_{k} y_{k} }} P_{{\hat{y}_{k} \overline{y}_{k} }}^{ - 1}$$
(23)
$$\hat{x}_{k} = \hat{x}_{k}^{ - } + K_{k} \left( {y_{k} - \hat{y}_{k}^{ - } } \right)$$
(24)
$$P_{k} = P_{k}^{ - } - K_{k} P_{{\hat{y}_{k} \overline{y}_{k} }} K_{k}^{T}$$
(25)

where \(x^{a} = \left[ {x^{T} v^{T} n^{T} } \right]\), \(X^{a} = \left[ {\left( {X^{x} } \right)^{T} \left( {X^{v} } \right)^{T} \left( {X^{n} } \right)^{T} } \right]^{T}\), \(\gamma = \sqrt {\left( {{\text{L}} + {\uplambda }} \right)}\), where \(R^{v} is\) the process noise variance, \(R^{v}\) is the measurement noise covariance, and \(W_{i}\) are the weights that are calculated in Eq. 4. The measurement is then updated in each time frame of the speech taken [2]. Then, all the time frames are then augmented to get back the denoised and dereverberated clean processed speech.

4 Experimental Results

In this section, the simulation results we obtained from the approach detailed in the above section, i.e., the UKF algorithm are discussed. There were few .wav files on which we performed the algorithm under various windowed processing like the rectangular window, hamming window and the Gaussian window. The results we obtained are plotted as wave forms. There are two wave files on which this algorithm was performed. Let the names be Speech A.wav and Speech B.wav. The SNR was precalculated for later use in the comparisons. The waves were then reverberated, and the observation noise was added to both. The observation noise added to all the wave forms is additive white Gaussian noise (AWGN).

After processing the two waveforms through the algorithm and getting the results, we calculated the parameters such as the figure of merit and the correlation between SNR of the processed output and the precalculated SNR of the clean speech for three different number of iterations in the algorithm. A table is given below with the particular details of the figure of merit and correlation for the above two wave forms.

Table1 shows analysis of the performance metrics FOM and correlation between the input and the output of the algorithm proposed.

Table 1 Performance metrics

Table 2 shows the comparison between the SNR values for the different windows—rectangular, hamming and the Gaussian windows used for chopping and the number of iterations performed on both speech A and speech B.

Table2 SNR comparison

5 Conclusion

In this project, speech enhancement technique using Kalman filtering has been implemented. The objective was to design an effective method to process a noise invaded and reverberated speech in adverse environments. We were able to perform the denoising and dereverberation on the corrupted speech. The proposed algorithm can be used in the cases of nonlinear systems, where in most of the algorithms, this is not possible. Also, this algorithm is time-efficient. So, it can be used for mediocre length speeches. Here, the proposed algorithm, unscented Kalman filtering, uses three windows—rectangular, hamming and Gaussian for the chopping of the signal before processing, and from Table 1, the results significantly differ from each window for every iteration. The performance is slightly increasing with the increasing number of iterations in any window up to a certain number of iterations. Then, there is fall in both the performance metrics—figure of merit and the correlation taken in this report. This is due to the repeated denoising and dereverberation, which causes a damage to the intelligibility of the desired output. Then, Table 2 compares the SNRs of the outputs of different windows under different number of iterations.