1 Introduction

In a distant-talking environment, channel distortion drastically degrades speech recognition performance because of mismatches between the training and test environments. According to [1], the approaches in dealing with reverberation problem can be classified as front-end-based and back-end-based approaches. The front-end-based approaches [110] attempt to reduce the effect of reverberation from the observed speech signal. The back-end-based methods attempt to modify the acoustic model and/or decoder so that they are suitable for reverberant environment [11, 12]. In this paper, we focus on the front-end-based approached for distant-talking speech recognition.

Many single-channel and multi-channel dereverberation methods have been proposed for robust distant-talking speech/speaker recognition [24, 1317]. Comparing to microphone array, single microphone is much easier and cheaper to be implemented for real applications. Several single-channel dereverberation approaches have been proposed [24, 13, 14]. Cepstral mean normalization (CMN) [1820] may be considered the most general approach. It has been extensively examined and shown as a simple and effective way of reducing reverberation by normalizing cepstral features. However, the dereverberation of CMN is not completely effective in environments with late reverberation. Several studies have focused on mitigating the above problem [3, 4, 14]. A reverberation compensation method for speaker recognition using spectral subtraction [21], in which late reverberation is treated as additive noise, was proposed in [3]. A method based on multi-step linear prediction (MSLP) was proposed by [4, 14] for both single and multiple microphones. The method first estimates late reverberations using long-term multistep linear prediction, and then suppresses these with subsequent spectral subtraction. Wolfel proposed a joint compensation of noise and reverberation by integrating an estimate of the reverberation energy derived by an auxiliary model based on multistep linear prediction, into a framework, which, so far tracks and removes nonstationary additive distortion by particle filters in a low-dimension logarithmic power frequency domain [22].

Neural network (NN) based approaches have been proposed for feature transformation [23, 24]. Bottleneck features extracted by a multi-layer perceptron (MLP) can be used a non-linear feature transformation [23]. However, deep networks of MLP with many hidden layers have a high computational cost, and can’t learn much further away from the top layer. Deep belief networks (DBNs) which employ an unsupervised pretraining method using restricted Boltzmann machine (RBM) have been proposed to train better initial values of deep networks [29]. DNNs with pretraining have been shown better performance than the conventional MLP without pretraining on automatic speech recognition [29]. Recently, denoising autoencoder (DAE), one of Deep Neural Network (DNN), has been shown to be effective in many noise reduction applications because higher level representations and increased flexibility of the feature mapping function can be learned [25, 26]. Ishii et al. applied a DAE for spectral-domain dereverberation [27] and found the word accuracy of LVCSR was improved from 61.4 % to 65.2 % for the JNAS database [28]. However, the suppressed spectral-domain feature needs to be converted to a cepstral-domain feature, and this improvement is not sufficient. Previously, we found that Deep Neural Network (DNN) [29] based cepstral-domain feature mapping is efficient for distant-talking speaker processing [30]. In this paper, we apply a denoising autoencoder for cepstral-domain dereverberation because there are many LVCSR systems that adopt a cepstral-domain feature as the direct input.

The DAE uses its flexible mapping capability to learn a mapping from a window of distorted input features to the clean output features. Due to the limitation on the model complexity, the input window of DAE cannot go too big. In this study, we use a window of 9 frames, which covers roughly 0.1s of speech. However, the effects of reverberation on speech features may be as long as 1 second. Therefore, the DAE is clearly not adequate to deal with the reverberation distortion by itself. In this paper, we apply temporal structure normalization (TSN) [31, 32] to complement DAE for the task of dereverberation. TSN was previously proposed to reduce the effects of transmission channel and additive background noise on speech features for robust speech recognition. It is motivated by the observation that noise and channel modifies the temporal structure of speech features, hence there is a need to restore the clean temporal structure. The furthermore improvement is expected for distant-talking speech recognition by combining cepstral-domain DAE and the TSN filter based feature normalization. The proposed method is evaluated in both simulated and real reverberant environments.

The remainder of this paper is organized as follows: Section 2 describes denoising autoencoder for cepstrl-domain dereverberation. Temporal structure normalization is described in Section 3. The experimental results and discussions are presented in Section 4. Finally, Section 5 summarizes the paper.

2 Denoising Autoencoder for Cepstral-Domain Dereverberation

An autoencoder is a type of artificial neural network (NN) whose output is reconstruction of input, and is often used for dimensionality reduction. DAEs share the same structure as autoencoders, but input data is a noisy version of the output data. Denoising autoencoder use feature mapping to convert noisy input data into clean output, and have been used for noise removal in the field of image processing [25]. Ishii et al. applied a DAE for spectral-domain dereverberation [27]. However, the suppressed spectral-domain feature needs to be converted to a cepstral-domain feature, and this improvement is not sufficient. Noting that many speech recognition systems adopt a cepstral-domain feature as the direct input, we think that the transformation of cepstral-domain feature may achieve better performance than that of spectral-domain feature. Cepstral-domain denoising autoencoder based dereverberation transforms the cepstrum of reverberant speech to that of clean speech. By using Mel filterbank, the cepstral features take into consideration the fact that human auditory system has higher resolution in low frequency than in high frequency. Hence, error of dereverberant signal and clean teacher signal on cepstral feature automatically emphasizes more on low frequencies than high frequencies. On the other hand, if error on spectrum is used, all frequencies are treated equally important. Moreover, the dimensions of the spectral-domain based features are greater than those of cepstral-domain based features. This introduces greater difficulty in learning for DAE with a deep architecture. Thus, it is expected that the DAE-based cepstral-domain dereverberation should be more efficient than DAE-based spectral-domain dereverberation for speech recognition. In this paper, we apply a denoising autoencoder for cepstral-domain dereverberation because there are many LVCSR systems that adopt a cepstral-domain feature as the direct input.

Given a pair of speech samples: clean speech and corresponding reverberant speech, DAE learns the non-linear conversion function that converts reverberant speech features into clean speech. In general, reverberation is dependent on both current and several previous observation frames. In addition to the vector of the current frame, vectors of past frames are concatenated to form input.

For cepstral feature X i of observed reverberant speech of it h frame, cepstral features of N−1 frames before the current frame are concatenated with the current frame to form a cepstral vector of N frames. Output O i of the non-linear transformer based on the DAE is given by:

$$ O_{i} = f_{L}(...f_{l}(...f_{2}(f_{1}(X_{i},X_{i-1},...,X_{i-N}))) $$
(1)

where f l is the non-linear transformation function in layer l, N is the number of frames to be used as the input features.

Topology of the cepstral-domain DAE for dereverberation is shown in Fig. 1. In this paper, the number of hidden layers is set to three. Details of parameter turning for DAE is discussed in Section 4.2.1. In Fig. 1, W i (i=1,2) shows the weighting of the different layers, and \({W^{T}_{i}}\) shows the transposition of W i .Footnote 1 That is to say, W 1 and W 2 are the encoder matrix and \({W^{T}_{1}}\) and \({W^{T}_{2}}\) are the decoder matrix, respectively.

Figure 1
figure 1

Topology of stacked denoising autoencoder for cepstral-domain dereverberation.

2.1 Training of DAE

2.1.1 Restricted Boltzmann Machine

To train a deep neural network, Deep Belief Networks (DBNs) [29] are used for pre-training because they can obtain accurate initial values of the deep-layer neural networks.

RBM is a bipartite graph shown in Fig. 2. It has visible and hidden layer in which visible units that represent observations are connected to hidden units that learn to represent features using weighted connection. A RBM is restricted that there are novisible-visible or hidden-hidden connections. Different types of RBM is used in the case of binary or real-valued input. Bernoulli-Bernoulli RBMs used to convert binary stochastic variables to binary stochastic variables. Gaussian-Bernoulli RBMs is used to convert real-valued stochastic variables to binary stochastic variables.

Figure 2
figure 2

Graphical representation of the RBM.

In a Bernoulli-Bernoulli RBMs, the weights on the connections and the biases of the individual units define a probabillity distribution over the joint states of the visible and hidden units via an energy function. The energy of a joint configuration is:

$$ E(\mathrm{v},\mathrm{h}|\theta) = -\sum\limits^{\mathcal{V}}_{i=1}\sum\limits^{\mathcal{H}}_{j=1}w_{ij}v_{i}h_{j}-\sum\limits^{\mathcal{V}}_{i=1}a_{i}v_{i}-\sum\limits^{\mathcal{H}}_{j=1}b_{j}h_{j} $$
(2)

where 𝜃=(w,a,b) and w i j represents the symmetric interaction term between visible unit i and hidden unit j while a i and b j are their bias term. \(\mathcal {V}\) and \(\mathcal {H}\) are the numbers of visible and hidden units.

The probability that a RBM assigns to a visible vector v is:

$$ p(\mathrm{v}|\theta) = \frac{{\sum}_{\mathrm{h}}exp(-E(\mathrm{v},\mathrm{h}))}{{\sum}_{\mathrm{v}}{\sum}_{\mathrm{h}}exp(-E(\mathrm{v},\mathrm{h}))} $$
(3)

Since there are no hidden-hidden connections, the conditional distribution p(h|v,𝜃) is factorial and is given by:

$$ p(h_{j} = 1|\mathrm{v},\theta) = \sigma\left(b_{j} + \sum\limits^{\mathcal{V}}_{i=1}w_{ij}v_{i}\right) $$
(4)

where σ(x)=(1+e x p(−x))−1. Similarly, since there are no visible-visible connections, the conditional distribution p(v|h,𝜃) is factorial and is given by:

$$ p(v_{j} = 1|\mathrm{h},\theta) = \sigma\left(a_{i} + \sum\limits^{\mathcal{H}}_{j=1}w_{ij}h_{j}\right) $$
(5)

In a Gaussian-Bernoulli RBMs, the energy of a joint configuration is:

$$ E(\mathrm{v},\mathrm{h}|\theta) = \sum\limits^{\mathcal{V}}_{i=1}\frac{(v_{i} - a_{i})^{2}}{2}-\sum\limits^{\mathcal{V}}_{i=1}\sum\limits^{\mathcal{H}}_{j=1}w_{ij}v_{i}h_{j}-\sum\limits^{\mathcal{H}}_{j=1}b_{j}h_{j} $$
(6)

The conditional distribution p(h|v,𝜃) is factorial and is given by:

$$ p(v_{i} = 1|h,\theta) = N\left(v_{i};a_{i} + \sum\limits^{\mathcal{H}}_{j=1}w_{ij}h_{j},1\right) $$
(7)

where \(N(\mu ,\mathcal {V})\) is a Gaussian with mean μ and variance V.

Maximum likelihood estimation of RBM is to maximize the log likelihood l o g(p(v|𝜃)) for the parameters 𝜃. Therefore, the weight update equation is given by:

$$ {\Delta}{w}_{ij} = \langle{v_{i}h_{j}}\rangle_{data} - \langle{v_{i}h_{j}}\rangle_{model} $$
(8)

where 〈⋅〉 d a t a is the expectation that v i and h j are on together in the training set and 〈⋅〉 m o d e l is the same expectation calculated from the model. Because compute 〈v i h j 〉 is expensive, using contrastive divergence (CD) approximation for the compute gradient. It is possible to compute 〈v i h j 〉 by once the Gibbs sampling.

To obtain a pre-trained RBM, we trained all hidden layers by using the Bernoulli-Bernoulli RBM. DBNs are hierarchically configured by connecting these pre-trained RBMs. In DAE, output is reconstruction of input, so network’s first half layer are for encoding, and second half layer are for decoding. W 1, and W 2 are learned automatically and \({W^{T}_{1}}\) and \({W^{T}_{2}}\) are generated from W 1 and W 2 in Fig. 1.

2.1.2 Backpropagation Algorithm

After pre-training, a backpropagation algorithm was applied to adjust the parameters. Backpropagation modifies the weights of the network to reduce the error of the teacher signal and the output value when a pair of signals (input signal and the ideal teacher signal, the cepstral feature of clean speech) are given.

3 Temporal Structure Normalization

Noise and reverberation distorts speech features in multiple aspects, e.g. the timbre of speech and also the temporal structure of speech up to 1 second. The DAE uses its flexible mapping capability to learn a mapping from a window of distorted input features to the clean output features. Due to the limitation on the model complexity, the input window of DAE cannot go too big. In this study, we use a window of 9 frames, which covers roughly 0.1s of speech. However, the effects of reverberation on speech features may be as long as 1s. For example, the evaluation data in this study has a T60 time up to 0.7s. Therefore, the DAE is clearly not adequate to deal with the reverberation distortion by itself.

In this section, we describe a method called temporal structure normalization (TSN) to complete DAE for the task of dereverberation. TSN was previously proposed to reduce the effects of transmission channel and additive background noise on speech features for robust speech recognition. It is motivated by the observation that noise and channel modifies the temporal structure of speech features, hence there is a need to restore the clean temporal structure. In TSN, temporal structure of speech features are represented by the modulation spectra of speech signal, i.e. the power spectral density function (PSD) of feature trajectories. The restoration of temporal structure of clean features is implemented by performing a linear filtering on the feature trajectories, where the linear filter weights are designed independently for each feature trajectory and speech utterance to normalize the PSD of the feature trajectory to a reference PSD. The reference PSD of a feature trajectory is estimated as the mean of PSD of clean utterances and represent the clean temporal structure of clean speech.

In our preliminary study, we found that the PSD function of the clean and reverberant speech features are very different. This is actually expected as reverberation is known to have a blurring effect on the speech spectrum, hence introduce a smoothing effects on the spectrum and hence the features. Therefore, it is natural to apply TSN to deal with the remaining distortions that exist in speech features after DAE.

4 Experiments

4.1 Experimental Setup

4.1.1 Training Dataset

The training dataset provided by gREVERB challengeh (Reverberant Voice Enhancement and Recognition Benchmark) [33] was used. This dataset consists of the clean WSJCAM0 [34] training set and a multi-condition (MC) training set. Reverberant speech is generated from the clean WSJCAM0 training data by convolving the clean utterances with measured room impulse responses and adding recorded background noise. The reverberation times of the measured impulse responses range from approximately 0.1 to 0.8 sec. This training dataset was used for both training of acoustic models. Clean speeches are also used by training the reference PSD function of the TSN.

It should be noted that the recording rooms used for the multi-condition training data and test data were different.

4.1.2 Evaluation Test Set

It is important to note that the proposed dataset consists of real recordings (RealData) and simulated data (SimData), part of which has similar characteristics to RealData in terms of reverberation time and microphone-speaker distance. This setup allows us to perform evaluations in terms of both practicality and robustness in various reverberant conditions. Specifically, the development (Dev.) test set and the final evaluation (Eval.) test set each consists of the following SimData and RealData: SimData is generated from WSJCAM0 corpus [34], and RealData from MC-WSJ-AV corpus [35]. This development dataset was used to determine the optimal parameter for dereverberation and speech recognition. The details of data set of training and test are shown in Tables 1 and 2.

Table 1 Quantity of data for Dev. and Eval. set of SimData and RealData and for training dataset. 2
Table 2 Details of data set of SimData and RealData.

4.1.3 Experimental conditions for LVCSR and Dereverberation

In this study, Mel Frequency Cepstral Coefficients (MFCCs) were used as features for LVCSR. The dimension of the MFCCs was 39 including 12 MFCCs plus power and their Delta and Delta-Delta coefficients. MFCC features were normalized using the mean of the entire multi-condition training set. The DAE training was carried out using stochastic mini-batch gradient descent with a minibatch size of 256 samples. Fifty epochs with a learning rate of 0.002 were used for all layers during pre-training, and 100 epochs with a learning rate of 0.1 were used for all layers during fine-tuning.

Multi-step linear prediction (MSLP) algorithm generate inverse filter through the prediction coefficients to estimate inverse system [14]. We estimate the late reverberation components using the inverse filter and apply dereverberation by power spectral subtraction. For MSLP-based dereverberation, the step size and the order of linear prediction were set to 500 and 750, respectively. For the TSN filter, the Yule-Walker method is used to estimate the PSD functions of feature trajectories. The order of the AR model for PSD estimation is set to 6 to obtain proper level of details. A filter length of 33 taps is used for the evaluation if not otherwise stated.

In this study, we used a speech recognition system provided by the gREVERB challengeh task [33], which is based on the hidden Markov model tool kit (HTK) [36]. As an acoustic model, it employs tied-state HMMs with eight Gaussian components per state, trained according to the maximum-likelihood criterion. We use a multi-condition training set for training of acoustic model. This training set is generated from the clean WSJCAM0 training data by convolving the clean utterances with measured room impulse responses and adding recorded background noise. The reverberation times of the measured impulse responses range roughly from 0.1 to 0.8 sec. Note that the recording rooms used for the SimData, RealData and multi-condition training data are all different. CMLLR [37] is the method for converting the mean and variance of the Gaussian distribution for each state of the hidden Markov models (HMMs) by using the regression matrix to reduce the mismatch between the adaptation data and model. This method is intended to obtain a transformation matrix for modifying the model parameters that maximize the likelihood of the adaptation data. In this paper, we applied CMLLR for unsupervised model adaptation, i.e., environment adaptation.

4.2 Experimental Results

4.2.1 Parameters Tuning for DAE

For DAE-based dereverberation, feature vectors of the current frame and previous eight frames of reverberant speech were used as input. Thirty-nine MFCCs of the current frame of clean speech were used as teacher signals for output, i.e., the dimension of input was 39 × 9 = 351. Optimum value of number of hidden layer and units in each hidden layer were determined from the experimental. Table 3 shows the speech recognition results with different number of hidden layer (1024 hidden units in each hidden layer). According to these results, the number of hidden layer of DAE is set to 3 in the following part of this paper. It is confirmed that the performance of speech recognition is decreased when the number of hidden layer is increased. This can be explained by the complex structure of DNNs: too many layers cause an increase in parameters of DNNs, with the output being over-learned using small training data. In this paper, the size of training data is less than 20 hours, so an appropriate number of layers is sufficient in this task. Table 4 shows the results with different units in each hidden layer. We decided that the number of units in each hidden layer is 1024 according to these results.

Table 3 Word error rate by DAE-based dereverberation with different number of hidden layers (%).
Table 4 Word error rate by DAE-based dereverberation with different number of units in each hidden layer (%).

4.2.2 Comparison of Spectral-Domain DAE and Cepstral-Domain DAE

Ishii et al. applied a DAE for spectral-domain dereverberation [27] for the JNAS database [28]. However, the suppressed spectral-domain feature needs to be converted to a cepstral-domain feature, and this improvement is not sufficient. In our study, we applied DAE for cepstral-domain dereverberation for the REVERB-challenge task [33]. In this section, we compare spectral-domain and cepstral-domain DAE-based dereverberation for this task. The results are compared in Table 5. These results indicate that cepstral-domain DAE is better than spectral-domain DAE for reverberant speech recognition on “REVERB Challenge” task.

Table 5 Word error rate by spectral-domain DAE and cepstral-domain DAE (%).

A possible reason for the better results of cepstral domain DAE is that the measure square error (MSE) of cesptral features are used as the cost function, which is more relevant to the speech recognition task than the MSE of the spectrum. By using Mel filterbank, the cepstral features take into consideration that fact human auditory system has higher resolution in low frequency than in high frequency. Hence, MSE on cesptral feature automatically emphasizes more on low frequencies than high frequencies. On the other hand, if MSE on spectrum is used, all frequencies are treated equally important.

4.2.3 Results by Combining DAE and TSN

Tables 6 and 7 show the speech recognition results with Dev. and Eval. dataset. We compared three kinds of dereverberation methods (MSLP, DAE and TSN) and the combination of these. CMN was applied to all methods. When DAE-based cepstral-domain dereverberation was compared with CMN-based dereverberation, single channel MSLP-based dereverberation and TSN filter-based dereverberation, a remarkable improvement was achieved. DAE worked well especially with strong reverberation, i.e., far-field microphone in gRoom 2hand gRoom 3h of SimData. The performance with CMLLR-based environment adaptation was better than that without CMLLR. However, the results of DAE are worse than that of baseline in gRoom1. This trend is also seen in Table 5. It is considered that the late reverberation in gRoom 1h is relative small and early reverberation can be suppressed by CMN effectively. Hence, the merit of suppression of late reverberation under light reverberant condition is not very large. DAE+TSN cause some distortion by doing dereverberation due to mismatch between gRoom 1h condition and training condition. So the performance of DAE+TSN is worse than CMN when RT60 is very small. The results indicate that the proposed methods work better in heavy reverberation than in light reverberation.

Table 6 Word error rate of each method for Dev. dataset (%).
Table 7 Word error rate of each method for Eval. dataset (%).

For Dev. dataset, the average word error rates (WERs) in SimData were improved from 25.16 % of CMN to 22.05 % of cepstral-domain DAE with CMLLR-based environment adaptation. In RealData, WER were improved from 47.48 % to 42.51 % with CMLLR-based environment adaptation. In SimData, by combining cepstral-domain DAE and TSN filter with environment adaptation, the WER was reduced from 25.16 % in the baseline state to 21.19 %, i.e., the relative error reduction rate was 15.8 %. In RealData, the WER was reduced from 47.48 % to 41.26 %, i.e., the relative error reduction rate was 13.1 %. TSN filter didn’t work well alone due to reverberation . However, when combined with cepstral-domain DAE, improvement of TSN filter was increased. It was considered that noise reduction capability of TSN filter was improved by dereverberation of cepstral-domain DAE.

For Eval. dataset, the similar trend in Dev. Dataset was obtained. The WERs in SimData were improved from 25.82 % of CMN to 22.92 % of cepstral-domain DAE with CMLLR-based environment adaptation. In RealData, WER were improved from 48.56 % to 45.17 % with CMLLR-based environment adaptation. The proposed combination of DAE and TSN achieved best speech recognition performance. That is, combination of DAE and TSN outperformed the other dereverberation methods for both Dev. dataset and Eval. dataset.

We found that the combinations with MSLP does not produce good results. MSLP work well for dereverberation, but does not work well in combination with other dereverberation methods.

5 Conclusions

In this paper, we proposed a robust distant-talking speech recognition method by combining the cepstral-domain DAE and the TSN filter. The proposed method was evaluated in simulated and real distant-talking environments. DAE-based cepstral-domain dereverberation achieved a remarkable improvement compared with CMN-based dereverberation, MSLP-based dereverberation and TSN filter-based feature normalization in both environments. Furthermore, speech recognition performance was improved by combining the cepstral-domain DAE and the TSN filter. In SimData of Dev. dataset, by combining cepstral-domain DAE and TSN filter with environment adaptation, the WER was reduced from 25.16 % in the baseline state to 21.19 %, i.e., the relative error reduction rate was 15.8 %. In RealData of Dev. dataset, the WER was reduced from 47.48 % to 41.26 %, i.e., the relative error reduction rate was 13.1 %. For Eval. dataset, the similar trend was obtained. In SimData of Eval. dataset, the WER was reduced from 25.82 % to 22.54 %, i.e., the relative error reduction rate was 12.7 %. In RealData of Eval. dataset, the WER was reduced from 48.56 % to 44.03 %, i.e., the relative error reduction rate was 9.33 %.