Keywords

1 Introduction

Recently, multilayer neural networks (deep neural networks, DNNs) have found a widespread use for acoustic modeling in speech recognition [1]. In many cases the DNNs demonstrate better generalization capabilities as compared with the conventional Gaussian mixture models (GMMs). But in the case where the conditions for training and testing (usage) of the DNN mismatch the recognition quality may degrade significantly. In order to compensate this mismatch, various techniques are used to increase the quality of the speech and decrease the influence of noises.

This research is concerned with methods to improve the DNN based acoustic models using bottleneck features [2] and speech data augmentation [3].

The initial training dataset includes clean headset recordings, whereas the trained acoustic model is intended to be used for recognition in noisy open space or in a meeting room.

The general problem which arises in the case where the training and testing corpora mismatch is to construct a recognition system which is robust to acoustic environment variability.

To solve that problem, techniques are utilized which compensate the mismatch between the testing and training corpora with the help of:

  1. 1.

    special features (application of noise robust features such as PNCC [4] and RASTA [5], feature normalisation [6], feature compensation—correction of features in the frequency domain—spectral subtraction [7], Wiener filtering [8]) or acoustic model parameters transformation (standard statistical techniques such as the maximum a posteriori (MAP) estimators [9], SAT+CMLLR [10]);

  2. 2.

    a priori knowledge about the environment (utilization of stereo data [11] to train the mapping from the noisy to clean speech; here the advantage depends on how close the training corpora is to the testing environment; multi-condition training, construction of noise dictionaries (cluster adaptive training, CAT [12]); combination of pre-trained acoustic models with the use of non-negative matrix factorisation (NMF [13]));

  3. 3.

    application of explicit and implicit noise models (vector Taylor series [14]);

  4. 4.

    addition of various kinds of noise with different SNRs, which may occur in the testing corpus (data augmentation) [15,16,17].

Many of the above approaches use a priori information to estimate the parameters for specific conditions and fail when no environment-specific data are present. The data augmentation based approach provides a considerable advantage because it works well even when no target data is available.

There are several ways to augment the training data: semi-supervised training [15], multi-lingual training [18], transformation of acoustic data [19], speech synthesis [20, 21].

The semi-supervised training approach assumes the use of the text produced by an automatic speech recognition system to train acoustic models. The advantage of this approach is that we are able to use, say, radio or TV broadcasts featuring various kinds of speakers and noises; the obvious drawback is the presence of recognition errors in the texts.

The important advantage of synthesized datasets lies in the ability to approximate the required recognition conditions and get the necessary amount of training data. In addition, this method allows to obtain a precise alignment of noised data using known text transcriptions and the corresponding clean recordings.

The methods based on transformation of acoustic features include the variation of the vocal tract length on the stage of extracting the standard features [17] and stochastic feature mapping (SFM) [20].

The family of techniques based on recording transformations includes such methods as the audio signal speed alteration [19], applying noises, introduction of artificial reverberation into the records [22].

To transform the data we apply the artificial reverberation with the use of binaural room impulse response (BRIR) [21] and several kinds of noise (street noise, office or home noise, babble) with various signal-to-noise ratio (SNR). The initial training dataset includes headset recordings. The problem consists of training the acoustic model which can be applied both to headset and to distant microphone recordings under various noises and reverberation conditions. We demonstrate that the bottleneck feature extractor trained on the augmented train datasets is more robust to the noise and increases the recognition accuracy.

In the second section, we describe acoustic features and the DNN structure used in training. The third section includes the description of the train and test datasets, as well as the datasets resulting from data augmentation. The fourth section presents the results and discussion of the study, and the conclusion follows in the fifth section.

2 Bottleneck Features and DNN Structure

The bottleneck features extracted from a multilayer neural network have found a wide use in automatic speech recognition systems. Such features have been successfully used in [23, 24] to solve the recognition problem under the testing and training corpora mismatch conditions. All acoustic models in our presentation are trained on this kind of features. The bottleneck features are generated from the DNN which has a hidden layer of smaller dimension as compared with the other layers.

In this paper we consider two bottleneck feature extractors:

  1. 1.

    the extractor trained on the initial training dataset including clean headset voice records only;

  2. 2.

    the extractor trained on the same corpora after applying data augmentation.

In Fig. 1, the general structure of deep neural networks used for training is shown. The first DNN is trained on plain MFCC features [25] (the left and right context length is equal to 15) to produce the bottleneck features. The network contains four fully connected hidden layers of dimension 2048 and a bottleneck layer of dimension 80.

The second DNN is trained on bottleneck features with context of length 5 and left/right spacing 3. The network contains four fully connected layers of dimension 2048 and a final classification layer with 2857 outputs.

Fig. 1.
figure 1

The general DNN structure used to train the acoustic model

3 Speech Datasets for Training and Testing

In order to decrease the mismatch between the training and testing conditions, we make use of various transformations of the initial sound files preserving the state alignment unaltered. The difficulty consists in constructing a corpus which matches the reverberation and noise conditions which are unknown at the training phase. Since this objective is unattainable, we augment the training dataset with some variations to make our acoustic model more robust.

The training and test datasets are compiled from the recordings made by the Speech Technology Center. The sets contain phonetically rich sentences recorded with the use of a headset and distant microphones.

We consider the following ways to augment the training dataset:

  1. 1.

    application of noises corresponding to certain acoustic conditions (babble, office, home, car, street) with SNR from a fixed interval;

  2. 2.

    artificial reverberation of speech recordings.

For convenience we label the training datasets by abbreviations that reflect the properties of data containing in them. The training set C (clean data) contains only clean headset recordings of more than a thousand of different speakers. The set NB (noise, babble) includes a subset of recordings from C mixed with office, street, car noises and background speech (babble). The background recordings were scaled before mixing them with the clean data to produce the desired signal to noise ratio.

For artificial reverberation, we use BRIR, which contains the information about the size of the room where the recording is carried out, the distance to the sound source and its direction. BRIR includes three basic components:

$$ h(t) = h_{\mathrm {dp}}(t) + h_{\mathrm {ee}}(t) + h_{\mathrm {rev}}(t), $$

where

\(h_{\mathrm {dp}}(t)\) :

reproduces the sound passing directly from the source to the microphone; it depends on the azimuth and height of the source and the microphone; its energy decreases as \(r^2\), where r is the distance between the source and the microphone;

\(h_{\mathrm {ee}}(t)\) :

is the early echo related to reflection; it contains the information concerning the geometry of the room, its volume, number and positions of the walls;

\(h_{\mathrm {rev}}(t)\) :

is the echo induced by reverberation, it contains a large number of reflections and dispersions of higher order.

We use two kinds of BRIR:

  1. 1.

    the distance from the source to the microphone is equal to 3 m, the azimuth is 0, the room parameters are \(24\times 15\times 4{.}5\), which makes the reverberation time equal to 0.5 s;

  2. 2.

    the distance from the source to the microphone is equal to 5.5 m, the azimuth is 90, the room parameters are \(24\times 15\times 4.5\), which makes the reverberation time equal to 0.8 s.

Detailed description of the training and test datasets is presented in Table 1.

Table 1. The description of the training and test datasets
Table 2. The description of the train and test datasets derived from real recordings

The test datasets are divided into five groups based on the SNR and noise types. Each test dataset contains recordings of several dozens of speakers which were not included in the training sets. The first three groups contain the recordings made with the use of the close (T1), medium (T2) and long (T3) range microphone respectively. The environments are the office, domestic, and street. T4 contains background speech. T5 contains headset recording with a high SNR. The SNR is calculated as in [27] with decisions made by our voice activity detection (VAD) algorithm. RT60 denotes the reverberation time which is the time required for reflections of a direct sound to decay 60 dB.

The concluding table in this paper contains the results of comparison of acoustic models trained with the use of data augmentation on real-life datasets, which consist of recordings of dialogues in a meeting room and in a noisy open space at a peak rush of people. The recordings are characterized by a low SNR (10 dB on average), presence of background speech and noise of various kinds (the sales register printer, electronic queue alerts, phone rings, etc.).

The information concerning the datasets compiled from real-life data is presented in Table 2.

R1 and R2 are done at the same time and at the same place but with the use of different devices.

4 Experimental Results

In order to test the acoustic models which utilize the data augmentation techniques, we train several DNNs on bottleneck features. All networks contain 4 fully connected hidden layers of dimension 2048 and are trained with the use of discriminative pre-training [29]. In Table 3, we show how the word accuracy (recognition accuracy, WAcc) depends on the properties of the train datasets compiled with the use of clean, noisy and reverberated recordings. Only the most interesting results were included in Table 3.

Word accuracy is defined as follows:

$$ WAcc = 1-WER=\frac{N-S-D-I}{N}, $$

where WER – word error rate, N is the number of words in the reference, S is the number of substitutions, D is the number of deletions, I is the number of insertions.

Table 3. The recognition accuracy dependence from the datasets properties

As a baseline we used the model trained with plain MFCC features on a subset of the C dataset (280 of 353 h).

From the Table 3 it is obvious that adding augmented data improves recognition accuracy a lot and that bottleneck features are more robust to the speaker and environment variability.

In Table 4, comparison results on real-life test sets are given.

Table 4. The recognition accuracy on real-life test cases

One can see that at different test cases the increase of the recognition accuracy as compared with the baseline model is substantial and varies from 12 to 40%.

The test set Real_3 is a more challenging one, so the recognition accuracy gain obtained with the proposed methods is less than on Real_2. Recordings in the Real_3 contain specific kinds of noise which we didn’t use during the augmentation process and background speech. The latter is loud enough to be passed by the voice activity detection algorithm so the acoustic models recognize it as they become more robust to noisy environment and since the reference texts contain only words belonging to a target speaker a larger number of insertions occurs. Some reduction in WER may be achieved with a VAD algorithm tuned to work in adverse noisy environments.

The presented recognition accuracy values are low but they allow to successfully perform keyword search and solve certain speech analytics tasks.

We publish a Kaldi recipeFootnote 1 for building a speech recognition system for the Russian language. It is based on publicly available speech corpus (Voxforge) and may well serve as a starting point to study data augmentation and other techniques aimed at producing effective ASR solutions.

5 Conclusions

In this research, it has been shown experimentally that the application of data augmentation methods increases substantially the robustness of the DNN-based acoustic models. The bottleneck features themselves are more robust to perturbations of acoustic conditions, but when the extractor is trained on the augmented datasets the recognition accuracy increases even more. The increase of the recognition accuracy has been found to be as high as 45% at some test cases. Experiments with real-life recordings in a quiet meeting room and in a noisy open space with low SNR demonstrate that even in the case where we have only clean recordings from a low-range microphone for training purposes, certain data transformations allow us to significantly increase the recognition accuracy.