Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization

Ueda, Yuma; Wang, Longbiao; Kai, Atsuhiko; Xiao, Xiong; Chng, Eng Siong; Li, Haizhou

doi:10.1007/s11265-015-1007-3

Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization

Published: 15 May 2015

Volume 82, pages 151–161, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Signal Processing Systems Aims and scope Submit manuscript

Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization

Download PDF

Yuma Ueda¹,
Longbiao Wang²,
Atsuhiko Kai¹,
Xiong Xiao³,
Eng Siong Chng⁴ &
…
Haizhou Li⁵

467 Accesses
7 Citations
Explore all metrics

Abstract

In this paper, we propose a robust distant-talking speech recognition by combining cepstral domain denoising autoencoder (DAE) and temporal structure normalization (TSN) filter. As DAE has a deep structure and nonlinear processing steps, it is flexible enough to model highly nonlinear mapping between input and output space. In this paper, we train a DAE to map reverberant and noisy speech features to the underlying clean speech features in the cepstral domain. For the proposed method, after applying a DAE in the cepstral domain of speech to suppress reverberation, we apply a post-processing technology based on temporal structure normalization (TSN) filter to reduce the noise and reverberation effects by normalizing the modulation spectra to reference spectra of clean speech. The proposed method was evaluated using speech in simulated and real reverberant environments. By combining a cepstral-domain DAE and TSN, the average Word Error Rate (WER) was reduced from 25.2 % of the baseline system to 21.2 % in simulated environments and from 47.5 % to 41.3 % in real environments, respectively.

Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification

Article Open access 12 May 2015

Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments

Article Open access 12 October 2023

A hybrid speech enhancement system with DNN based speech reconstruction and Kalman filtering

Article 29 August 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In a distant-talking environment, channel distortion drastically degrades speech recognition performance because of mismatches between the training and test environments. According to [1], the approaches in dealing with reverberation problem can be classified as front-end-based and back-end-based approaches. The front-end-based approaches [1–10] attempt to reduce the effect of reverberation from the observed speech signal. The back-end-based methods attempt to modify the acoustic model and/or decoder so that they are suitable for reverberant environment [11, 12]. In this paper, we focus on the front-end-based approached for distant-talking speech recognition.

Many single-channel and multi-channel dereverberation methods have been proposed for robust distant-talking speech/speaker recognition [2–4, 13–17]. Comparing to microphone array, single microphone is much easier and cheaper to be implemented for real applications. Several single-channel dereverberation approaches have been proposed [2–4, 13, 14]. Cepstral mean normalization (CMN) [18–20] may be considered the most general approach. It has been extensively examined and shown as a simple and effective way of reducing reverberation by normalizing cepstral features. However, the dereverberation of CMN is not completely effective in environments with late reverberation. Several studies have focused on mitigating the above problem [3, 4, 14]. A reverberation compensation method for speaker recognition using spectral subtraction [21], in which late reverberation is treated as additive noise, was proposed in [3]. A method based on multi-step linear prediction (MSLP) was proposed by [4, 14] for both single and multiple microphones. The method first estimates late reverberations using long-term multistep linear prediction, and then suppresses these with subsequent spectral subtraction. Wolfel proposed a joint compensation of noise and reverberation by integrating an estimate of the reverberation energy derived by an auxiliary model based on multistep linear prediction, into a framework, which, so far tracks and removes nonstationary additive distortion by particle filters in a low-dimension logarithmic power frequency domain [22].

Neural network (NN) based approaches have been proposed for feature transformation [23, 24]. Bottleneck features extracted by a multi-layer perceptron (MLP) can be used a non-linear feature transformation [23]. However, deep networks of MLP with many hidden layers have a high computational cost, and can’t learn much further away from the top layer. Deep belief networks (DBNs) which employ an unsupervised pretraining method using restricted Boltzmann machine (RBM) have been proposed to train better initial values of deep networks [29]. DNNs with pretraining have been shown better performance than the conventional MLP without pretraining on automatic speech recognition [29]. Recently, denoising autoencoder (DAE), one of Deep Neural Network (DNN), has been shown to be effective in many noise reduction applications because higher level representations and increased flexibility of the feature mapping function can be learned [25, 26]. Ishii et al. applied a DAE for spectral-domain dereverberation [27] and found the word accuracy of LVCSR was improved from 61.4 % to 65.2 % for the JNAS database [28]. However, the suppressed spectral-domain feature needs to be converted to a cepstral-domain feature, and this improvement is not sufficient. Previously, we found that Deep Neural Network (DNN) [29] based cepstral-domain feature mapping is efficient for distant-talking speaker processing [30]. In this paper, we apply a denoising autoencoder for cepstral-domain dereverberation because there are many LVCSR systems that adopt a cepstral-domain feature as the direct input.

The DAE uses its flexible mapping capability to learn a mapping from a window of distorted input features to the clean output features. Due to the limitation on the model complexity, the input window of DAE cannot go too big. In this study, we use a window of 9 frames, which covers roughly 0.1s of speech. However, the effects of reverberation on speech features may be as long as 1 second. Therefore, the DAE is clearly not adequate to deal with the reverberation distortion by itself. In this paper, we apply temporal structure normalization (TSN) [31, 32] to complement DAE for the task of dereverberation. TSN was previously proposed to reduce the effects of transmission channel and additive background noise on speech features for robust speech recognition. It is motivated by the observation that noise and channel modifies the temporal structure of speech features, hence there is a need to restore the clean temporal structure. The furthermore improvement is expected for distant-talking speech recognition by combining cepstral-domain DAE and the TSN filter based feature normalization. The proposed method is evaluated in both simulated and real reverberant environments.

The remainder of this paper is organized as follows: Section 2 describes denoising autoencoder for cepstrl-domain dereverberation. Temporal structure normalization is described in Section 3. The experimental results and discussions are presented in Section 4. Finally, Section 5 summarizes the paper.

2 Denoising Autoencoder for Cepstral-Domain Dereverberation

An autoencoder is a type of artificial neural network (NN) whose output is reconstruction of input, and is often used for dimensionality reduction. DAEs share the same structure as autoencoders, but input data is a noisy version of the output data. Denoising autoencoder use feature mapping to convert noisy input data into clean output, and have been used for noise removal in the field of image processing [25]. Ishii et al. applied a DAE for spectral-domain dereverberation [27]. However, the suppressed spectral-domain feature needs to be converted to a cepstral-domain feature, and this improvement is not sufficient. Noting that many speech recognition systems adopt a cepstral-domain feature as the direct input, we think that the transformation of cepstral-domain feature may achieve better performance than that of spectral-domain feature. Cepstral-domain denoising autoencoder based dereverberation transforms the cepstrum of reverberant speech to that of clean speech. By using Mel filterbank, the cepstral features take into consideration the fact that human auditory system has higher resolution in low frequency than in high frequency. Hence, error of dereverberant signal and clean teacher signal on cepstral feature automatically emphasizes more on low frequencies than high frequencies. On the other hand, if error on spectrum is used, all frequencies are treated equally important. Moreover, the dimensions of the spectral-domain based features are greater than those of cepstral-domain based features. This introduces greater difficulty in learning for DAE with a deep architecture. Thus, it is expected that the DAE-based cepstral-domain dereverberation should be more efficient than DAE-based spectral-domain dereverberation for speech recognition. In this paper, we apply a denoising autoencoder for cepstral-domain dereverberation because there are many LVCSR systems that adopt a cepstral-domain feature as the direct input.

Given a pair of speech samples: clean speech and corresponding reverberant speech, DAE learns the non-linear conversion function that converts reverberant speech features into clean speech. In general, reverberation is dependent on both current and several previous observation frames. In addition to the vector of the current frame, vectors of past frames are concatenated to form input.

For cepstral feature X _i of observed reverberant speech of i−t h frame, cepstral features of N−1 frames before the current frame are concatenated with the current frame to form a cepstral vector of N frames. Output O _i of the non-linear transformer based on the DAE is given by:

$$ O_{i} = f_{L}(...f_{l}(...f_{2}(f_{1}(X_{i},X_{i-1},...,X_{i-N}))) $$

(1)

where f _l is the non-linear transformation function in layer l, N is the number of frames to be used as the input features.

Topology of the cepstral-domain DAE for dereverberation is shown in Fig. 1. In this paper, the number of hidden layers is set to three. Details of parameter turning for DAE is discussed in Section 4.2.1. In Fig. 1, W _i(i=1,2) shows the weighting of the different layers, and ${W^{T}_{i}}$ shows the transposition of W _i.^{Footnote 1} That is to say, W ₁ and W ₂ are the encoder matrix and ${W^{T}_{1}}$ and ${W^{T}_{2}}$ are the decoder matrix, respectively.

2.1 Training of DAE

2.1.1 Restricted Boltzmann Machine

To train a deep neural network, Deep Belief Networks (DBNs) [29] are used for pre-training because they can obtain accurate initial values of the deep-layer neural networks.

RBM is a bipartite graph shown in Fig. 2. It has visible and hidden layer in which visible units that represent observations are connected to hidden units that learn to represent features using weighted connection. A RBM is restricted that there are novisible-visible or hidden-hidden connections. Different types of RBM is used in the case of binary or real-valued input. Bernoulli-Bernoulli RBMs used to convert binary stochastic variables to binary stochastic variables. Gaussian-Bernoulli RBMs is used to convert real-valued stochastic variables to binary stochastic variables.

In a Bernoulli-Bernoulli RBMs, the weights on the connections and the biases of the individual units define a probabillity distribution over the joint states of the visible and hidden units via an energy function. The energy of a joint configuration is:

$$ E(\mathrm{v},\mathrm{h}|\theta) = -\sum\limits^{\mathcal{V}}_{i=1}\sum\limits^{\mathcal{H}}_{j=1}w_{ij}v_{i}h_{j}-\sum\limits^{\mathcal{V}}_{i=1}a_{i}v_{i}-\sum\limits^{\mathcal{H}}_{j=1}b_{j}h_{j} $$

(2)

where 𝜃=(w,a,b) and w _{i
j} represents the symmetric interaction term between visible unit i and hidden unit j while a _i and b _j are their bias term. $\mathcal {V}$ and $\mathcal {H}$ are the numbers of visible and hidden units.

The probability that a RBM assigns to a visible vector v is:

$$ p(\mathrm{v}|\theta) = \frac{{\sum}_{\mathrm{h}}exp(-E(\mathrm{v},\mathrm{h}))}{{\sum}_{\mathrm{v}}{\sum}_{\mathrm{h}}exp(-E(\mathrm{v},\mathrm{h}))} $$

(3)

Since there are no hidden-hidden connections, the conditional distribution p(h|v,𝜃) is factorial and is given by:

$$ p(h_{j} = 1|\mathrm{v},\theta) = \sigma\left(b_{j} + \sum\limits^{\mathcal{V}}_{i=1}w_{ij}v_{i}\right) $$

(4)

where σ(x)=(1+e x p(−x))⁻¹. Similarly, since there are no visible-visible connections, the conditional distribution p(v|h,𝜃) is factorial and is given by:

$$ p(v_{j} = 1|\mathrm{h},\theta) = \sigma\left(a_{i} + \sum\limits^{\mathcal{H}}_{j=1}w_{ij}h_{j}\right) $$

(5)

In a Gaussian-Bernoulli RBMs, the energy of a joint configuration is:

$$ E(\mathrm{v},\mathrm{h}|\theta) = \sum\limits^{\mathcal{V}}_{i=1}\frac{(v_{i} - a_{i})^{2}}{2}-\sum\limits^{\mathcal{V}}_{i=1}\sum\limits^{\mathcal{H}}_{j=1}w_{ij}v_{i}h_{j}-\sum\limits^{\mathcal{H}}_{j=1}b_{j}h_{j} $$

(6)

The conditional distribution p(h|v,𝜃) is factorial and is given by:

$$ p(v_{i} = 1|h,\theta) = N\left(v_{i};a_{i} + \sum\limits^{\mathcal{H}}_{j=1}w_{ij}h_{j},1\right) $$

(7)

where $N(\mu ,\mathcal {V})$ is a Gaussian with mean μ and variance V.

Maximum likelihood estimation of RBM is to maximize the log likelihood l o g(p(v|𝜃)) for the parameters 𝜃. Therefore, the weight update equation is given by:

$$ {\Delta}{w}_{ij} = \langle{v_{i}h_{j}}\rangle_{data} - \langle{v_{i}h_{j}}\rangle_{model} $$

(8)

where 〈⋅〉_{d
a
t
a} is the expectation that v _i and h _j are on together in the training set and 〈⋅〉_{m
o
d
e
l} is the same expectation calculated from the model. Because compute 〈v _i h _j〉 is expensive, using contrastive divergence (CD) approximation for the compute gradient. It is possible to compute 〈v _i h _j〉 by once the Gibbs sampling.

To obtain a pre-trained RBM, we trained all hidden layers by using the Bernoulli-Bernoulli RBM. DBNs are hierarchically configured by connecting these pre-trained RBMs. In DAE, output is reconstruction of input, so network’s first half layer are for encoding, and second half layer are for decoding. W ₁, and W ₂ are learned automatically and ${W^{T}_{1}}$ and ${W^{T}_{2}}$ are generated from W ₁ and W ₂ in Fig. 1.

2.1.2 Backpropagation Algorithm

After pre-training, a backpropagation algorithm was applied to adjust the parameters. Backpropagation modifies the weights of the network to reduce the error of the teacher signal and the output value when a pair of signals (input signal and the ideal teacher signal, the cepstral feature of clean speech) are given.

3 Temporal Structure Normalization

Noise and reverberation distorts speech features in multiple aspects, e.g. the timbre of speech and also the temporal structure of speech up to 1 second. The DAE uses its flexible mapping capability to learn a mapping from a window of distorted input features to the clean output features. Due to the limitation on the model complexity, the input window of DAE cannot go too big. In this study, we use a window of 9 frames, which covers roughly 0.1s of speech. However, the effects of reverberation on speech features may be as long as 1s. For example, the evaluation data in this study has a T60 time up to 0.7s. Therefore, the DAE is clearly not adequate to deal with the reverberation distortion by itself.

In this section, we describe a method called temporal structure normalization (TSN) to complete DAE for the task of dereverberation. TSN was previously proposed to reduce the effects of transmission channel and additive background noise on speech features for robust speech recognition. It is motivated by the observation that noise and channel modifies the temporal structure of speech features, hence there is a need to restore the clean temporal structure. In TSN, temporal structure of speech features are represented by the modulation spectra of speech signal, i.e. the power spectral density function (PSD) of feature trajectories. The restoration of temporal structure of clean features is implemented by performing a linear filtering on the feature trajectories, where the linear filter weights are designed independently for each feature trajectory and speech utterance to normalize the PSD of the feature trajectory to a reference PSD. The reference PSD of a feature trajectory is estimated as the mean of PSD of clean utterances and represent the clean temporal structure of clean speech.

In our preliminary study, we found that the PSD function of the clean and reverberant speech features are very different. This is actually expected as reverberation is known to have a blurring effect on the speech spectrum, hence introduce a smoothing effects on the spectrum and hence the features. Therefore, it is natural to apply TSN to deal with the remaining distortions that exist in speech features after DAE.

4 Experiments

4.1 Experimental Setup

4.1.1 Training Dataset

The training dataset provided by gREVERB challengeh (Reverberant Voice Enhancement and Recognition Benchmark) [33] was used. This dataset consists of the clean WSJCAM0 [34] training set and a multi-condition (MC) training set. Reverberant speech is generated from the clean WSJCAM0 training data by convolving the clean utterances with measured room impulse responses and adding recorded background noise. The reverberation times of the measured impulse responses range from approximately 0.1 to 0.8 sec. This training dataset was used for both training of acoustic models. Clean speeches are also used by training the reference PSD function of the TSN.

It should be noted that the recording rooms used for the multi-condition training data and test data were different.

4.1.2 Evaluation Test Set

It is important to note that the proposed dataset consists of real recordings (RealData) and simulated data (SimData), part of which has similar characteristics to RealData in terms of reverberation time and microphone-speaker distance. This setup allows us to perform evaluations in terms of both practicality and robustness in various reverberant conditions. Specifically, the development (Dev.) test set and the final evaluation (Eval.) test set each consists of the following SimData and RealData: SimData is generated from WSJCAM0 corpus [34], and RealData from MC-WSJ-AV corpus [35]. This development dataset was used to determine the optimal parameter for dereverberation and speech recognition. The details of data set of training and test are shown in Tables 1 and 2.

Table 1 Quantity of data for Dev. and Eval. set of SimData and RealData and for training dataset. ²

Full size table

Table 2 Details of data set of SimData and RealData.

Full size table

4.1.3 Experimental conditions for LVCSR and Dereverberation

In this study, Mel Frequency Cepstral Coefficients (MFCCs) were used as features for LVCSR. The dimension of the MFCCs was 39 including 12 MFCCs plus power and their Delta and Delta-Delta coefficients. MFCC features were normalized using the mean of the entire multi-condition training set. The DAE training was carried out using stochastic mini-batch gradient descent with a minibatch size of 256 samples. Fifty epochs with a learning rate of 0.002 were used for all layers during pre-training, and 100 epochs with a learning rate of 0.1 were used for all layers during fine-tuning.

Multi-step linear prediction (MSLP) algorithm generate inverse filter through the prediction coefficients to estimate inverse system [14]. We estimate the late reverberation components using the inverse filter and apply dereverberation by power spectral subtraction. For MSLP-based dereverberation, the step size and the order of linear prediction were set to 500 and 750, respectively. For the TSN filter, the Yule-Walker method is used to estimate the PSD functions of feature trajectories. The order of the AR model for PSD estimation is set to 6 to obtain proper level of details. A filter length of 33 taps is used for the evaluation if not otherwise stated.

In this study, we used a speech recognition system provided by the gREVERB challengeh task [33], which is based on the hidden Markov model tool kit (HTK) [36]. As an acoustic model, it employs tied-state HMMs with eight Gaussian components per state, trained according to the maximum-likelihood criterion. We use a multi-condition training set for training of acoustic model. This training set is generated from the clean WSJCAM0 training data by convolving the clean utterances with measured room impulse responses and adding recorded background noise. The reverberation times of the measured impulse responses range roughly from 0.1 to 0.8 sec. Note that the recording rooms used for the SimData, RealData and multi-condition training data are all different. CMLLR [37] is the method for converting the mean and variance of the Gaussian distribution for each state of the hidden Markov models (HMMs) by using the regression matrix to reduce the mismatch between the adaptation data and model. This method is intended to obtain a transformation matrix for modifying the model parameters that maximize the likelihood of the adaptation data. In this paper, we applied CMLLR for unsupervised model adaptation, i.e., environment adaptation.

4.2 Experimental Results

4.2.1 Parameters Tuning for DAE

For DAE-based dereverberation, feature vectors of the current frame and previous eight frames of reverberant speech were used as input. Thirty-nine MFCCs of the current frame of clean speech were used as teacher signals for output, i.e., the dimension of input was 39 × 9 = 351. Optimum value of number of hidden layer and units in each hidden layer were determined from the experimental. Table 3 shows the speech recognition results with different number of hidden layer (1024 hidden units in each hidden layer). According to these results, the number of hidden layer of DAE is set to 3 in the following part of this paper. It is confirmed that the performance of speech recognition is decreased when the number of hidden layer is increased. This can be explained by the complex structure of DNNs: too many layers cause an increase in parameters of DNNs, with the output being over-learned using small training data. In this paper, the size of training data is less than 20 hours, so an appropriate number of layers is sufficient in this task. Table 4 shows the results with different units in each hidden layer. We decided that the number of units in each hidden layer is 1024 according to these results.

Table 3 Word error rate by DAE-based dereverberation with different number of hidden layers (%).

Full size table

Table 4 Word error rate by DAE-based dereverberation with different number of units in each hidden layer (%).

Full size table

4.2.2 Comparison of Spectral-Domain DAE and Cepstral-Domain DAE

Ishii et al. applied a DAE for spectral-domain dereverberation [27] for the JNAS database [28]. However, the suppressed spectral-domain feature needs to be converted to a cepstral-domain feature, and this improvement is not sufficient. In our study, we applied DAE for cepstral-domain dereverberation for the REVERB-challenge task [33]. In this section, we compare spectral-domain and cepstral-domain DAE-based dereverberation for this task. The results are compared in Table 5. These results indicate that cepstral-domain DAE is better than spectral-domain DAE for reverberant speech recognition on “REVERB Challenge” task.

Table 5 Word error rate by spectral-domain DAE and cepstral-domain DAE (%).

Full size table

A possible reason for the better results of cepstral domain DAE is that the measure square error (MSE) of cesptral features are used as the cost function, which is more relevant to the speech recognition task than the MSE of the spectrum. By using Mel filterbank, the cepstral features take into consideration that fact human auditory system has higher resolution in low frequency than in high frequency. Hence, MSE on cesptral feature automatically emphasizes more on low frequencies than high frequencies. On the other hand, if MSE on spectrum is used, all frequencies are treated equally important.

4.2.3 Results by Combining DAE and TSN

Tables 6 and 7 show the speech recognition results with Dev. and Eval. dataset. We compared three kinds of dereverberation methods (MSLP, DAE and TSN) and the combination of these. CMN was applied to all methods. When DAE-based cepstral-domain dereverberation was compared with CMN-based dereverberation, single channel MSLP-based dereverberation and TSN filter-based dereverberation, a remarkable improvement was achieved. DAE worked well especially with strong reverberation, i.e., far-field microphone in gRoom 2hand gRoom 3h of SimData. The performance with CMLLR-based environment adaptation was better than that without CMLLR. However, the results of DAE are worse than that of baseline in gRoom1. This trend is also seen in Table 5. It is considered that the late reverberation in gRoom 1h is relative small and early reverberation can be suppressed by CMN effectively. Hence, the merit of suppression of late reverberation under light reverberant condition is not very large. DAE+TSN cause some distortion by doing dereverberation due to mismatch between gRoom 1h condition and training condition. So the performance of DAE+TSN is worse than CMN when RT60 is very small. The results indicate that the proposed methods work better in heavy reverberation than in light reverberation.

Table 6 Word error rate of each method for Dev. dataset (%).

Full size table

Table 7 Word error rate of each method for Eval. dataset (%).

Full size table

For Dev. dataset, the average word error rates (WERs) in SimData were improved from 25.16 % of CMN to 22.05 % of cepstral-domain DAE with CMLLR-based environment adaptation. In RealData, WER were improved from 47.48 % to 42.51 % with CMLLR-based environment adaptation. In SimData, by combining cepstral-domain DAE and TSN filter with environment adaptation, the WER was reduced from 25.16 % in the baseline state to 21.19 %, i.e., the relative error reduction rate was 15.8 %. In RealData, the WER was reduced from 47.48 % to 41.26 %, i.e., the relative error reduction rate was 13.1 %. TSN filter didn’t work well alone due to reverberation . However, when combined with cepstral-domain DAE, improvement of TSN filter was increased. It was considered that noise reduction capability of TSN filter was improved by dereverberation of cepstral-domain DAE.

For Eval. dataset, the similar trend in Dev. Dataset was obtained. The WERs in SimData were improved from 25.82 % of CMN to 22.92 % of cepstral-domain DAE with CMLLR-based environment adaptation. In RealData, WER were improved from 48.56 % to 45.17 % with CMLLR-based environment adaptation. The proposed combination of DAE and TSN achieved best speech recognition performance. That is, combination of DAE and TSN outperformed the other dereverberation methods for both Dev. dataset and Eval. dataset.

We found that the combinations with MSLP does not produce good results. MSLP work well for dereverberation, but does not work well in combination with other dereverberation methods.

5 Conclusions

In this paper, we proposed a robust distant-talking speech recognition method by combining the cepstral-domain DAE and the TSN filter. The proposed method was evaluated in simulated and real distant-talking environments. DAE-based cepstral-domain dereverberation achieved a remarkable improvement compared with CMN-based dereverberation, MSLP-based dereverberation and TSN filter-based feature normalization in both environments. Furthermore, speech recognition performance was improved by combining the cepstral-domain DAE and the TSN filter. In SimData of Dev. dataset, by combining cepstral-domain DAE and TSN filter with environment adaptation, the WER was reduced from 25.16 % in the baseline state to 21.19 %, i.e., the relative error reduction rate was 15.8 %. In RealData of Dev. dataset, the WER was reduced from 47.48 % to 41.26 %, i.e., the relative error reduction rate was 13.1 %. For Eval. dataset, the similar trend was obtained. In SimData of Eval. dataset, the WER was reduced from 25.82 % to 22.54 %, i.e., the relative error reduction rate was 12.7 %. In RealData of Eval. dataset, the WER was reduced from 48.56 % to 44.03 %, i.e., the relative error reduction rate was 9.33 %.

Notes

W _i and $W_{{i^{T}_{1}}}$ correspond to f _L in Eq. 1

References

Yoshioka, T., Sehr, A., Delcroix, M., Kinoshita, K., Maas, R., Nakatani, T., & Kellermann, W. (2012). Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. IEEE Signal Processing Magazine, 29(6), 114–126.
Article Google Scholar
Wu, M., & Wang, D. (2006). A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Transactions on ASLP, 14(3), 774–784.
Google Scholar
Jin, Q., Schultz, T., & Waibel, A. (2007). Far-field speaker recognition. IEEE Transactions on ASLP, 15 (7), 2023–2032.
Google Scholar
Delcroix, M., & Hikichi, T. (2007). M.Miyoshi, Precise dereverberation using multi-channel linear prediction. IEEE Transactions on ASLP, 15(2), 430–440.
Google Scholar
Wang, L., Zhang, Z., & Kai, A. (2013). Hands-free speaker identification based on spectral subtraction using a multi-channel least mean square approach. Proceedings of ICASSP, 2013, 7224–7228.
Google Scholar
Habets, E. A. (2005). Multi-channel speech dereverberation based on a statistical model of late reverberation. Proceedings of IEEE ICASSP, 173–176.
Wang, L., Kitaoka, N., & Nakagawa, S. (2006). Robust Distant Speech Recognition by Combining Multiple Microphone-array Processing with Position-dependent CMN. Eurasip Journal on Applied Signal Processing, 2006 (95491), 1–11.
Google Scholar
Wang, L., Kitaoka, N., & Nakagawa, S. (2011). Distant-talking speech recognition based on spectral subtraction by multi-channel LMS algorithm. IEICE Transactions on Information Systems, E94-D(3), 659–667.
Article Google Scholar
Wang, L., Odani, K., & Kai, A. (2012). Dereverberation and denoising based on generalized spectral subtraction by nutil-channel LMS algorithm using a small-scale microphone array. Eurasip Journal on Advances in Signal Processing, 2012(12), 1–11.
MATH Google Scholar
Li, W., Wang, L., Zhou, F., & Liao, Q. (2013). Joint sparse representation based cepstral-domain dereverberation for distant-talking speech recognition. Proceedings of IEEE ICASSP, 7117–7120.
Hirsch, H., & Finster, H. (2008). A new approach for the adaptation of HMMs to reverberation and background noise. Speech Communication, 50(3), 244–263.
Article Google Scholar
Sehr, A., Maas, R., & Kellermann, W. (2010). Reverberation model-based decoding in the logmelspec domain for robust distant-talking speech recognition. IEEE Transactions on ASLP, 18(7), 1676–1691.
Google Scholar
Sadjadi, S.O., & Hasnen, J.H.L. (2011). Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions. In Proceedings of IEEE ICASSP (pp. 5448–5451).
Kinoshita, K., Delcroix, M., Nakatani, T., & Miyoshi, M. (2006). Spectral subtraction steered by multistep forward linear prediction for single channel speech dereverberation. In Proceedings of IEEE ICASSP (Vol. 2006, pp. 817–820).
Wang, L., Odani, K., & Kai, A. (2012). Dereverberation and denoising based on generalized spectral subtraction by multi-channel LMS algorithm using a small-scale microphone array. EURASIP Journal on Advances in Signal Processing, 2012, 12.
Wang, L., Kitaoka, N., & Nakagawa, S. (2011). Distant-talking speech recognition based on spectral subtraction by multi-channel LMS algorithm. IEICE Transactions on Information and Systems, E94-D(3), 659–667.
Wang, L., Zhang, Z., & Kai, A. (2013). Hands-free speaker identification based on spectral subtraction using a multi-channel least mean square approach. Proceedings of IEEE ICASSP, 2013, 7224–7228.
Furui, S. (1981). Cepstral Analysis Technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(2), 254–272.
Article Google Scholar
Liu, F., Stern, R., Huang, X., & Acero, A. (1993). Efficient cepstral normalization for robust speech recognition. In Proceedings of ARPA Speech Natural Language Workshop (pp. 69–74).
Wang, L., Kitaoka, N., & Nakagawa, S. (2007). Robust distant speech recognition by combining position-dependent CMN with conventional CMN. Proceedings of ICASSP, 817–820.
Boll, S. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2), 113–120.
Article Google Scholar
Wolfel, M (2009). Enhanced speech features by single-channel joint compensation of noise and reverberation. IEEE Transactions on Audio Speech Language Processing, 17(2), 312–323.
Article Google Scholar
Konig, Y., Heck, L., Weintraub, M., & Sonmez, K. (1998). Nonlinear discriminant feature extraction for robust text-independent speaker recognition. In Proceedings of RLA2C: ESCA workshop on speaker recognition and its commercial and forensic applications (pp. 72–75).
Zhu, Q., Stolcke, A., Chen, B.Y., & Morgan, N. (2005). Using MLP features in SRI’s conversational speech recognition system. INTERSPEECH, 2005, 2141–2144.
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 3371–3408.
MATH MathSciNet Google Scholar
Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising autoencoder, In Proceedings of Interspeech (pp. 436–440).
Ishii, T., Komiyama, H., Shinozaki, T., Horiuchi, Y., & Kuroiwa, S. (2013). Reverberant speech recognition based on denoising autoencoder. In Proceedings of Interspeech (pp. 3512– 3516).
Itou, K., Yamamoto, M., Takeda, K., Kakezawa, T., Matsuoka, T., Kobayashi, T., Shikano, K., & Itahashi, S. (1999). JNAS: Janpanese speech corpus for large vocabulary continuous speech recognition research. J. Acoust. Soc. Jpn. (E), 20(3), 199– 206.
Article Google Scholar
Hinton, G., & Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Article MATH MathSciNet Google Scholar
Yamada, T., Wang, L., & Kai, A. (2013). Improvement of distant-talking speaker identification using bottleneck features of DNN. In Proceedings of Interspeech (pp. 3661– 3664).
Xiao, X., Chng, E.S., & Li, H. (2008). Normalization of the speech modulation spectra for robust speech recognition. IEEE Transactions on Audio Speech, and Language Processing, 16(8), 1662–1674.
Article Google Scholar
Xiao, X., Chng, E.S., & Li, H. (2007). Temporal structure normalization of speech feature for robust speech recognition. IEEE Signal Processing Letters, 14(7), 500–503.
Article Google Scholar
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Haeb-Umbach, R., Leutnant, V., Sehr, A., Kellermann, W., Maas, R., Gannot, S., & Raj, B (2013). The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In Proceedings of the IEEE workshop on applications of signal processing to audio and acoustics (WASPAA-13).
Robinson, T., Fransen, J., Pye, D., Foote, J., & Renals, S. (1995). Wsjcam0: A british english speech corpus for large vocabulary continuous speech recognition. In Proceedings of ICASSP (Vol. 95, pp. 81–84).
Lincoln, M., McCowan, I., Vepa, I., & Maganti, H. K. (2005). The multi-channel wall street journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments. In Proceedings of ASRU (pp. 357–362).
Young, S., Kershow, D., Odell, J., Ollason, D., Valtchev, V., & Woodland, P. (2000). The HTK book (for HTK version 3.0): Cambridge University.
Gales, M.J.F., & Woodland, P.C. (1996). Mean and variance adaptation within the MLLR framework. Computer Speech & Language, 10, 249–264.
Article Google Scholar

Download references

Acknowledgments

This work was partially supported by a research grant from the Research Foundation for the Electrotechnology of Chubu (REFEC).

Author information

Authors and Affiliations

Graduate School of Engineering, Shizuoka University, Hamamatsu, 432-8561, Japan
Yuma Ueda & Atsuhiko Kai
Nagaoka University of Technology, Nagaoka, 940-2188, Japan
Longbiao Wang
Temasek Laboratories @ NTU, Nanyang Technological University, Singapore, 138632, Singapore
Xiong Xiao
School of Computer Engineering, Nanyang Technological University, Singapore, 138632, Singapore
Eng Siong Chng
Human Language Technology, Institute for Infocomm Research, A*STAR, Singapore, 138632, Singapore
Haizhou Li

Authors

Yuma Ueda
View author publications
You can also search for this author in PubMed Google Scholar
Longbiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Atsuhiko Kai
View author publications
You can also search for this author in PubMed Google Scholar
Xiong Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Eng Siong Chng
View author publications
You can also search for this author in PubMed Google Scholar
Haizhou Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Longbiao Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ueda, Y., Wang, L., Kai, A. et al. Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization. J Sign Process Syst 82, 151–161 (2016). https://doi.org/10.1007/s11265-015-1007-3

Download citation

Received: 15 November 2014
Revised: 09 April 2015
Accepted: 21 April 2015
Published: 15 May 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11265-015-1007-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization

Abstract

Similar content being viewed by others

Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification

Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments

A hybrid speech enhancement system with DNN based speech reconstruction and Kalman filtering

1 Introduction

2 Denoising Autoencoder for Cepstral-Domain Dereverberation

2.1 Training of DAE

2.1.1 Restricted Boltzmann Machine

2.1.2 Backpropagation Algorithm

3 Temporal Structure Normalization

4 Experiments

4.1 Experimental Setup

4.1.1 Training Dataset

4.1.2 Evaluation Test Set

4.1.3 Experimental conditions for LVCSR and Dereverberation

4.2 Experimental Results

4.2.1 Parameters Tuning for DAE

4.2.2 Comparison of Spectral-Domain DAE and Cepstral-Domain DAE

4.2.3 Results by Combining DAE and TSN

5 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization

Abstract

Similar content being viewed by others

Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification

Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments

A hybrid speech enhancement system with DNN based speech reconstruction and Kalman filtering

Explore related subjects

1 Introduction

2 Denoising Autoencoder for Cepstral-Domain Dereverberation

2.1 Training of DAE

2.1.1 Restricted Boltzmann Machine

2.1.2 Backpropagation Algorithm

3 Temporal Structure Normalization

4 Experiments

4.1 Experimental Setup

4.1.1 Training Dataset

4.1.2 Evaluation Test Set

4.1.3 Experimental conditions for LVCSR and Dereverberation

4.2 Experimental Results

4.2.1 Parameters Tuning for DAE

4.2.2 Comparison of Spectral-Domain DAE and Cepstral-Domain DAE

4.2.3 Results by Combining DAE and TSN

5 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation