Keywords

1 Introduction

In recent years, Deep Neural Networks (DNN) have proven their efficiency in solving a wide variety of classification and regression tasks [14]. In particular, DNNs have been used as acoustic models for Automatic Speech Recognition (ASR), significantly outperforming the previous state-of-the-art methods based on Gaussian Mixture Models (GMM) [9]. Improvements brought by neural networks have progressively reduced the Word Error Rate (WER) to a level where some studies argue that ASR can now achieve near human-level performance [31]. Despite these recent improvements, dealing with noisy and reverberant conditions is still a major challenge for ASR [29]. Several techniques have been developed to address this problem, including feature enhancement for example, where features are cleaned at the front-end of the ASR system.

In this work, we use Multi-Task Learning (MTL) to improve ASR performance in the noisy and reverberant acoustic context. MTL consists of training a single system, specifically a DNN, to solve multiple tasks that are different but related, as opposed to the traditional Single-Task Learning (STL) architecture where the system is trained on only one task [2]. MTL has previously been applied in a variety of situations where ASR is the main task and different auxiliary tasks are added. In most cases, however, few MTL auxiliary tasks have been found to be helpful for the main ASR task when speech is corrupted by noise and reverberation. Generating the clean-speech feature as an auxiliary task is one of the most efficient such approaches [5, 15, 17, 23]. We explore this idea further here by generating the noise features alone as an auxiliary task, as well as generating the noise and clean-speech features separately as two additional auxiliary tasks. The core idea is to increase the acoustic model’s awareness of the noisy environment, and how it corrupts speech. To evaluate these auxiliary tasks, we use the simulated part of the CHiME4 dataset [29]. While the CHiME4 dataset contains both real and simulated data, only the simulated part may be used here since we need to extract clean-speech and noise features to train the MTL system.

This paper is organized as follows. First, we present the state-of-the-art in MTL for ASR in Sect. 2. We then describe the MTL mechanism in depth in Sect. 3. Details of the experimental setup used to evaluate the noise estimation auxiliary task are presented in Sect. 4, with the results and analysis presented in Sect. 5. Finally, the conclusion and ideas for future work are discussed in Sect. 6.

2 Related Work

Many speech and language processing problems including speech synthesis [10, 30], speaker verification [4], and spoken language understanding [16] have benefited form MTL training. In the case of ASR, whether applying an STL or MTL architecture, the main task consists of training the acoustic model to estimate the phone-state posterior probabilities. These probabilities are then fed as input to a Hidden-Markov Model (HMM) that deals with the temporality of speech. The use of MTL for ASR has already been tested with a variety of auxiliary tasks. Early studies used MTL with gender classification as an auxiliary task [17, 26], the goal being to increase the acoustic model’s awareness of the impact of the speaker gender on the speech. As explained previously, the goal of the main task is to predict phone-state probabilities; some studies investigate a broader level of classes as the auxiliary task, as they try to directly predict the phone probability instead of the probability of the HMM state [1, 25]. A related auxiliary task consists of classifying even broader phonetic classes (e.g. fricative, plosive, nasal,...) but has shown poor performance [26]. Another approach consists of classifying graphemes as auxiliary task, where graphemes are the symbolic representation of speech (e.g. any alphabet), as opposed to the phonemes that directly describe the sound [3, 26]. In order to increase the generalization ability of the network, recent studies have also focused on increasing its speaker-awareness. This is done by recognizing the speaker or by estimating the associated i-vector [6] of each speaker as auxiliary task [19, 20, 27, 28], instead of concatenating the i-vector to the input features. Adapting the acoustic model to a particular speaker can also benefit from MTL [11]. Additional information about these methods can be found in [18].

Most of the previously cited methods do not particularly focus on ASR in noisy and reverberant conditions, nonetheless robust ASR is a field of interest as well. Some studies have focused solely on improving ASR in reverberant acoustic environment by generating de-reverberated speech as auxiliary task, using reverberated speech as input during training [8, 22]. Another approach that tackles the noise problem in ASR with MTL consists of recognizing the type of noise corrupting the speech, where a single noise type among several possible types is added for each sentence of the clean speech [12, 24]. This approach does not seem to have a real positive impact on the main ASR task, however. The MTL task that shows the highest improvement consists of generating the clean-speech features as auxiliary task [15, 17, 23]. Of course, in order to generate the targets needed to train this auxiliary task, access to the clean speech is required to extract the features, and this can only be done with simulated noisy and reverberant data. It is also possible to use an MTL system as a feature extractor for robust ASR, where a bottle-neck layer is used, the goal being to use the activations of the bottle-neck layer as input of a traditional STL/ASR system [13].

Though previous studies have proposed recognizing the type of noise, or generating the clean-speech features, to the best of our knowledge, there have been no attempts to estimate the noise features alone as an auxiliary task, or to estimate both the noise and speech features separately in an MTL setup.

3 Multi-Task Learning

Initially introduced in 1997, the core idea of multi-task learning consists of training a single system (a neural network here) to solve multiple tasks that are different but still related [2]. In the MTL nomenclature, the main task is the principal task, i.e. the task that would be initially used for a STL architecture, whereas at least one auxiliary task is added to help improve the network’s convergence to the benefit of the main task. An MTL architecture with one main task and N auxiliary tasks is shown in Fig. 1 as an example.

Fig. 1.
figure 1

A Multi-Task Learning system with one main task and N auxiliary tasks.

All MTL systems share two essential characteristics: (a) The same input features are used for training both the main and the auxiliary tasks. (b) The parameters (weights and biases) of all neurons, and more generally the internal structure of the network, are shared among the main and auxiliary tasks, with the exception of the output layer. Furthermore, these parameters are updated by backpropagating a mixture of the error associated with each task, with a term:

$$\begin{aligned} \epsilon _{MTL}=\epsilon _{Main} + \sum _{n=1}^N\lambda _n*\epsilon _{Auxiliary_n}, \end{aligned}$$
(1)

where \(\epsilon _{MTL}\) is the sum of all the task errors to be minimized, \(\epsilon _{Main}\) and \(\epsilon _{Auxiliary_n}\) are the errors obtained from the main and auxiliary tasks respectively, \(\lambda _n\) is a nonnegative weight associated with each of the auxiliary tasks, and N is the total number of auxiliary tasks added to the main task. The value \(\lambda _n\) controls the influence of the auxiliary task with respect to the main task. If the \(n^{th}\) auxiliary task has a \(\lambda _n\) close to 1, the main task and the auxiliary task will contribute equally to the error estimation. On the other hand, if \(\lambda _n\) is close to 0, a single-task learning system could be obtained due to the very small (or nonexistent) influence of the auxiliary task. The auxiliary task is frequently removed during testing, keeping only the main task. Selecting a relevant auxiliary task with respect to the main task is the crucial point leading to convergence of the main task. Instead of computing and training each task independently, sharing the parameters of the system among multiple tasks may lead to better results than an independent processing of each task [2].

4 Experimental Setup

In this section, we will present the tools and methods used to evaluate the new auxiliary task that we propose for robust ASR.

4.1 Database

In order to evaluate noise estimation as an auxiliary task for robust ASR, we use the CHiME4 database [29]. This database was released in 2016 for a speech recognition and separation challenge in reverberant and noisy environments. This database is composed of 1-channel, 2-channel, and 6-channel microphone array recordings. Four different noisy environments (café, street junction, public transport, and pedestrian area) were used to record real acoustic mixtures through a tablet device with 6-channel microphones. The WSJ0 database [7] is used to create simulated data. WSJ0 contains clean-speech recordings to which noise is added. The noise is recorded from the four noisy environments described above. For the noise estimation auxiliary task, we use features extracted from these recordings containing only noise as targets for training. As we cannot obtain these targets for real data, we only use the simulated data in this study.

All datasets (training, development, and test sets) consist of 16 bit wav files sampled at 16 kHz. The training set consists of 83 speakers uttering 7138 simulated sentences, which is the equivalent of \(\sim \)15 h of training data. The development set consists of 1640 utterances (\(\sim \)2.8 h) uttered by 4 speakers. Finally, 4 additional speakers compose the test set with 1320 utterances corresponding to approximately 4.5 h of recordings.

In this work, we investigate noise and clean-speech estimation as auxiliary tasks, therefore we use only the noise recorded from a single channel during training (channel no 5). The test and development set noises are randomly selected from all channels, making the task harder but also challenging the generalization ability of the setup.

4.2 Features

The features used as input for training the MTL system as well as targets for the noise and/or clean-speech estimation tasks are obtained through the following traditional ASR pipeline:

  1. 1.

    Using the raw audio wav files, 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) features are extracted and normalized through Cepstral Mean-Variance Normalization (CMVN).

  2. 2.

    For each frame, the adjacent ±3 frames are spliced.

  3. 3.

    These 91-dimensional feature vectors are reduced through a Linear Discriminative Analysis (LDA) transformation to a 40-dimensional feature space.

  4. 4.

    The final step consists of projecting the features through a feature-space speaker adaptation transformation known as feature-space Maximum Likelihood Linear Regression (fMLLR).

Finally, the 40-dimensional features that are computed through this pipeline are spliced one more time with the surrounding ±5 frames for the input features fed to the acoustic model, thus giving additional temporal context to the network during training. For the auxiliary tasks’ targets, the same pipeline is followed to generate the clean-speech and noise features but there is no ±5 splicing at the final stage. Alignments from the clean-speech are reused for the transformations applied on noisy features.

4.3 Training the Acoustic Model

Training and testing this MTL auxiliary tasks was done using the nnet3 version of the Kaldi toolbox [21].

We use a classic feed-forward deep neural network acoustic model to evaluate the performance of this new auxiliary task. The DNN is composed of 4 hidden layers, each of them consisting of 1024 neurons activated through Rectified Linear Units (ReLU). The main task used for STL and MTL computes 1972 phone-state posterior probabilities after a softmax output layer. The training of the DNN is done through 14 epochs using the cross-entropy loss function for the main task, and quadratic loss function for the auxiliary tasks (as they are regression issues), with an initial learning rate starting at 0.0015 that is progressively reduced to 0.00015. Stochastic gradient descent (SDG) is used to update the parameters of the network through the backpropagation of the error derivatives. The size of the mini-batch used to process the input features is set 512. These parameters were selected through empirical observations.

The same experiments were also conducted using other deep learning algorithms including Recurrent Neural Networks (RNN) with Long Short-Term Memory (LSTM) cells and Time-Delay Neural Networks (TDNN). However, the feed-forward DNN showed similar or better results than these more complex architectures on the simulated data of CHiME4. Also, the computational time for the RNN-LSTM network was much higher than for the feed-forward DNN. While the complexity and temporarily of the main and auxiliary tasks did not require a more complex acoustic model here, we note that for some auxiliary tasks, having a more complex network can be crucial for the convergence of the auxiliary task, as is the case for speaker classification for instance [19].

During decoding, the most likely transcriptions are obtained through the phone-state probabilities estimated by the feed-forward network, and used by the HMM system and associated with a language model. The language model is the 3-gram KN language model trained on the WSJ 5K standard corpus.

4.4 Baseline

The baseline of our system is obtained by training the setup presented in the previous section in single-task learning manner. We compute the word error rate for both the development and test sets over all four noisy environments for the simulated data of CHiME4. The results are shown in Table 1. A very significant mismatch coming from the recording environments between the development and test set can be noticed, explaining the higher WER for the test set. For the rest of this paper we display only the Average results as the trends and evolutions of the WER are similar over all four noisy environments.

Table 1. Word error rate in % on the development and test sets of CHiME4 dataset used as baseline. Average is the mean WER of all 4 environmental noises and Overall is the mean WER over the development and test sets.

5 Results

In this section, we investigate the improvement brought by the new MTL auxiliary task, namely regenerating the noise contained in the corrupted sentence, in comparison to STL. We also combine this auxiliary task with the more traditional clean-speech generation auxiliary task.

5.1 Noise Features Estimation

In order to evaluate the impact of estimating the noise features as an auxiliary task in our MTL setup, we vary the value of \(\lambda _{noise}\), thus varying the influence of this auxiliary task with respect to main ASR task. The obtained results for values of \(\lambda _{noise}\) varying between 0 (STL) and 0.5 are presented in Table 2. There is a small but persistent improvement of the WER for \(\lambda _{noise}=0.05\), over both the development and test sets. For smaller values (\(\lambda _{noise}=0.01\)), the improvement is nearly insignificant as the value of \(\lambda _{noise}\) brings the training too close to STL (\(\lambda _{noise}=0\)), while for values of \(\lambda _{noise}\) too high (\(\lambda _{noise}\ge 0.15\)), the WER is worse than for STL as the influence of the auxiliary task overshadows the main ASR task.

Table 2. Average word error rate (in %) of the Multi-Task Learning architecture when the auxiliary task is noise feature estimation, where \(\lambda _{noise}\) is the weight attributed to the noise estimation auxiliary task during training. The baseline, which is the Single-Task Learning architecture, is obtained for \(\lambda _{noise}=0\). The Overall values are computed over both datasets.

In order to further highlight these observations, we present the relative WER improvement brought by MTL in comparison to STL in Fig. 2. An improvement is obtained for values of \(\lambda _{noise}\) between 0.01 and 0.1. The highest improvement is obtained for \(\lambda _{noise}=0.05\), with a relative improvement in comparison to STL going up to 1.9% on the development set for instance. Larger values of \(\lambda _{noise}\) degrade performance on the main speech recognition task.

Fig. 2.
figure 2

Evaluation of the relative improvement of the word error rate brought by multi-task learning in comparison to single-task learning, with \(\lambda _{noise}\) the weight attributed to the noise estimation auxiliary task. The Overall values are computed over both the development and test datasets.

Fig. 3.
figure 3

Evolution of the tasks errors over training epochs. The Main Task is the speech recognition error computed through the cross-entropy loss function, whereas the Auxiliary Task corresponds to the noise estimation error obtained through the quadratic loss function.

As discussed in Sect. 4.3, training is done over 14 epochs. In order to prove the ASR improvement is not only the result of the introduction of a small noise into the system, but rather that both tasks are converging, we present the error over these 14 epochs in Fig. 3, highlighting in this way the error reduction obtained on both tasks loss functions over time.

Despite the persistence of the relative improvement for small values of \(\lambda _{noise}\), it can be noted that this improvement is quite small. This can be explained by several considerations. First, this auxiliary task is less directly related to the main task than for instance clean speech generation, meaning that the convergence of the auxiliary task may not significantly help the main task. Another consideration is that the auxiliary task is in fact quite a hard task here as the Signal-to-Noise Ratio (SNR) is always in favor of the clean-speech and not the noise, making it hard to estimate the noise alone. Finally, the suitability of the features extracted following the pipeline presented in Sect. 4.2, as well as using fMLLR transformation in this context, is most likely not optimal for noise.

Table 3. Average word error rate in % on the development and test sets of CHiME4 dataset, when different auxiliary tasks are applied. Overall is the mean WER over the development and test sets data.
Fig. 4.
figure 4

Evaluation of the relative improvement of the word error rate brought by multi-task learning in comparison to single-task learning, with different auxiliary tasks. The Overall values are computed over both the development and test datasets.

Despite these considerations, using noise estimation as auxiliary task seems to be helpful for the main ASR task when \(\lambda _{noise}\) is properly selected. Additionally, using a MTL setup is easy to implement and does not require extensive computational time in comparison to STL (as the same network is trained for both tasks). Finally, the targets for this particular auxiliary task, noise estimation, are easy to get as we have access to the noise when generating the simulated data.

5.2 Combining Noise and Clean-Speech Features Estimation

Instead of separately generating clean-speech or noise as auxiliary tasks, we investigate here the combination of both tasks in the MTL framework. In order to do that, we first repeat the same experiment as in Sect. 5.1 but where we generate only the clean-speech features as the auxiliary task. After varying the value of \(\lambda _{speech}\) we found the best WER is obtained for \(\lambda _{speech}=0.15\). The obtained results are depicted in Table 3 and, as in the previous section, we compute the relative improvement brought by the different auxiliary tasks (plus their combination) in comparison to STL in Fig. 4.

The results show that, as expected, a better WER is obtained when using clean-speech estimation as auxiliary task in comparison to noise estimation, with an overall relative improvement of 2.9% (while it was 1.5% in the previous experiment). Interestingly however, using both the clean-speech and noise estimation auxiliary tasks lead to even better performance, with 3.9% overall relative improvement and more than 1% absolute improvement on the test set. This result highlights the fact that the network is learning different and valuable information from both auxiliary tasks in order to improve the main task. Once again, implementing these auxiliary tasks is simple and does not require significant additional computational time in comparison to classic single-task learning architectures.

6 Conclusion

In this paper, we have studied multi-task learning acoustic modeling for robust speech recognition. While most previous studies focus on clean-speech generation as auxiliary task, we propose and investigate here another different but related auxiliary task: noise estimation. This auxiliary task consists of generating the features extracted from the audio file containing only the noise that is later added to the clean-speech to create the simulated noisy data. After showing that an improvement can be obtained with this auxiliary task, we combined it with the clean-speech estimation auxiliary task, resulting in one main task and two auxiliary tasks. A relative WER improvement of 4% can be obtained thanks to the association of these two auxiliary tasks in comparison to the classic single-task learning architecture. Training and testing here was done only on the simulated data taken from the CHiME4 dataset, as the clean-speech and noise audio are required separately for the auxiliary tasks training, thus making it impossible to train with real data. In future work, we would like to find a way to integrate real data to the training, and re-evaluate the impact of these two auxiliary tasks. We would also like to use other types of features which may be more suitable to capture the noise variations, as the features we are currently using are designed to best capture the diversity of speech.