Keywords

1 Introduction

Spontaneous conversational speech recognition is one of the most difficult tasks in the field of automatic speech recognition (ASR). The difficulties are due to the following characteristics of spontaneous conversational speech: high channel and speaker variability, presence of additive and non-linear distortions, accents and emotional speech, diversity of speaking styles, speech rate variability, reductions and weakened articulation.

There is a large number of studies on recognizing English spontaneous speech, such as [15]. Systems proposed in these papers demonstrate high effectiveness, which makes it possible to use them in commercial applications. As far as we know, the state-of-the-art English spontaneous speech recognition system [4] achieves word error rate (WER) of 8.0 % on the Switchboard part and 14.1 % on the CallHome part of the HUB5 2000 evaluation set. This impressive results were obtained by combining various effective techniques of acoustic and language modeling.

Our goal is to build a speaker-independent system for high-quality Russian spontaneous speech recognition. At present none of the Russian spontaneous speech recognition systems provide recognition accuracy comparable with the above-mentioned English systems. We would like to highlight two reasons of this. First, there are not available training and evaluation datasets for the Russian language, such as the Switchboard and Fisher English speech corpora and the HUB5 2000 evaluation set. Second, Russian is an inflective language with a several times larger number of unique words than English. Moreover, the Russian language is characterized by a relatively free word order in a sentence. This considerably complicates the recognition task [6]. Our previous system achieved WER of 25.1 % [7]. In this work we present the set of recent improvements of the system.

The rest of this paper is organized as follows. Section 2 contains the experimental setup description. Section 3 presents the acoustic modeling approach based on speaker-dependent bottleneck features. Section 4 describes deep BLSTM acoustic models and score fusion of DNN and BLSTM acoustic models (AMs). Section 5 presents the experiments on hypothesis rescoring with language models (LMs) based on Recurrent Neural Networks (RNNs). Finally, Sect. 6 concludes the paper and discusses future work.

2 Experimental Setup

For experiments we used the Kaldi speech recognition toolkit [8]. AM training was performed using a 390 h Russian spontaneous speech dataset (telephone channel, several hundreds of speakers). A test set consisted of about 1 h of Russian telephone conversations. Both training and test sets are the same as used in our previous work [7].

Language models training data consisted of 2 datasets. The first one contained the transcriptions of the AM training dataset. The second one was a large amount (about 200 M words) of texts from Internet forum discussions, books and subtitles from the OpenSubtitles site. The baseline 3-g language model with a vocabulary of 214 K words was built in the SRILM Toolkit [9]. It was obtained by interpolation of 3-g LMs trained on the first and second datasets using Modified Kneser-Ney smoothing. The size of this model was reduced to 4.16 M bigrams and 2.49 M trigrams by the use of pruning.

3 Speaker-Dependent Bottleneck Features

Here we describe the acoustic modeling approach based on speaker-dependent bottleneck (SDBN) features. This approach was proposed in our previous works [7, 10]. Its underlying idea is to extract high-level features from the DNN model, which is adapted to the speaker and acoustic environment by the use of i-vectors. The extracted features are applied to training another acoustic model (Fig. 1).

Fig. 1.
figure 1

Speaker-dependent bottleneck approach scheme

Our approach consists of the following main steps:

  1. 1.

    Training the DNN model on the source features using the Cross-Entropy (CE) criterion.

  2. 2.

    Expanding an input layer of the DNN trained at the first step and retraining using an input feature vector appended with i-vector. The regularizing term

    $$\begin{aligned} R = \lambda \sum _{l=1}^{L} \sum _{i=1}^{N_l} \sum _{j=1}^{N_{l-1}} (\mathbf {W}_{ij}^l - \mathbf {\bar{W}}_{ij}^l)^2 \end{aligned}$$
    (1)

    is added to the CE criterion for penalizing parameters deviation from the source model. Here \(\mathbf {W}^{l}\) and \(\mathbf {\bar{W}}^{l}\) are weight matrices of l-th layer \((1 \le l \le L)\) of the current and the source DNNs, \(N_l\) is the size of l-th layer, and \(N_0\) is the dimension of the input feature vector.

  3. 3.

    Transforming the last hidden layer into two layers. The first one is a bottleneck layer with the weight matrix \(\mathbf {W}_{bn}\), a zero bias vector and linear activation function. The second one is a non-linear layer with the dimension of the source layer, with weight matrix \(\mathbf {W}_{out}\) and the original bias vector \(\mathbf {b}\), activation function f and the dimension of the source layer:

    $$\begin{aligned} \mathbf {y} = f(\mathbf {W} \mathbf {x} + \mathbf {b}) \approx f(\mathbf {W}_{out}(\mathbf {W}_{bn} \mathbf {x} + \mathbf {0}) + \mathbf {b}). \end{aligned}$$
    (2)

    These layers are formed by applying Singular Value Decomposition (SVD) to the weight matrix \(\mathbf {W}\) of the source layer:

    $$\begin{aligned} \mathbf {W} = \mathbf {U} \mathbf {S} \mathbf {V}^T \approx \mathbf {\tilde{U}}_{bn} \mathbf {\tilde{V}}_{bn}^T = \mathbf {W}_{out} \mathbf {W}_{bn}, \end{aligned}$$
    (3)

    where \(_{bn}\) designates reduced dimension.

  4. 4.

    Retraining the network formed at the previous step using the CE criterion with the penalty (1) for parameters deviation from original values.

  5. 5.

    Discarding all layers after the bottleneck and extracting high-level SDBN features using the resulting DNN.

  6. 6.

    Training the GMM-HMM acoustic model using the constructed SDBN features and generating the senone alignment of the training data.

  7. 7.

    Training the final DNN-HMM acoustic model using SDBN features and the generated alignment.

The extractor of 120-dimensional SDBN features was trained using the presented approach. Training was carried out using 23-dimensional log mel filterbank energy (FBANK) features with Cepstral Mean Normalization (CMN), appended with the first and second order derivatives. These features were taken with the temporal context of 11 frames (± 5) and appended with an i-vector. We applied 50-dimensional i-vectors extracted by the use of the Universal Background Model with 512 Gaussian, which was trained with our toolset [11] on the full 390 hour training set. We applied the following configuration of the basic network: 6 hidden layers with 1536 sigmoidal neurons in each, the output softmax layer with about 13000 neurons corresponding to senones of the GMM-HMM acoustic model. DNN parameters were updated using the Nesterov Accelerated Gradient algorithm with the momentum value equal to 0.7. Extractor training was initialized using the algorithm presented in the paper [12].

DNN training with the constructed SDBN features (SDBN-DNN) was performed using the temporal context of 31 frames taking every 5th frame. We applied the following DNN configuration: 4 sigmoidal hidden layers with 2048 neurons in each, the output softmax layer with about 13000 neurons corresponding to senones of the GMM-HMM model, which was trained using the same SDBN features. The training was carried out with the CE criterion and the state-level Minimum Bayes Risk (sMBR) sequence-discriminative criterion.

Table 1. SDBN results

Table 1 gives the comparison of SDBN-DNN and DNN trained in a speaker adaptive manner using i-vectors (DNN-ivec). It can be seen that the SDBN approach provides a significant gain. Note that SDBN-DNN WER of 19.5 % is much lower than the result of our previous system (25.1 % WER). This is due to the larger SDBN features extractor, more careful tuning of the AM training procedure and the larger language model.

4 Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks

Acoustic models based on deep Bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks demonstrate high effectiveness in various ASR tasks [5, 13]. In this section we describe our experiments with these models carried out with the nnet3 setup of the Kaldi speech recognition toolkit.

We used BLSTM architecture with projection layers described in the paper [13]. The following configuration of the network was applied: 3 forward and 3 backward layers, 1024 cell and hidden dimensions, 128 recurrent and non-recurrent projection dimensions. Training examples consisted of chunks of 20 frames with additional left context of 40 frames and right context of 40 frames. We performed 8 epochs of cross-entropy training with an initial learning rate of 0.0003 and final learning rate of 0.00003. Model parameters were updated using BPTT algorithm with the momentum value equal to 0.5. The models obtained at the iterations of the last epoch were combined into the final BLSTM model. For BLSTM training we used 23-dimensional log mel filterbank energy (FBANK) features with CMN with the first and second order derivatives, appended with the 50-dimensional i-vector described before. Training data alignments prepared using the SDBN-DNN acoustic model were used for the training.

Table 2. Deep BSLTM acoustic models and score fusion results

4.1 Score Fusion of SDBN-DNN and BLSTM Acoustic Models

The underlying idea of the score fusion technique is in combining the benefits of both different model architectures and different input features. In this subsection we analyze effectiveness of this technique applied to SDBN-DNN and BLSTM acoustic models. We used log-likelihoods (LLH) determined by the formula

$$\begin{aligned} \text {LLH} = \alpha \, \log \left( \frac{\text {P}_{1}(\mathbf {s}|\mathbf {x})}{\text {P}_{1}(\mathbf {s})}\right) + (1-\alpha ) \log \left( \frac{\text {P}_{2}(\mathbf {s}|\mathbf {x})}{\text {P}_{2}(\mathbf {s})}\right) \end{aligned}$$
(4)

for the decoding with fusion of these acoustic models. Here \(\text {P}_{1}(\mathbf {s}|\mathbf {x})\) and \(\text {P}_{2}(\mathbf {s}|\mathbf {x})\) are posterior probabilities of state \(\mathbf {s}\) given an input vector \(\mathbf {x}\) on the current frame, \(\text {P}_{1}(\mathbf {s})\) and \(\text {P}_{2}(\mathbf {s})\) are prior probabilities of state \(\mathbf {s}\) for SDBN-DNN and BLSTM models respectively. We estimated prior probability of state \(\mathbf {s}\) as average posterior probability calculated with the corresponding model on the training data. \(\alpha \) value was chosen equal to 0.5. The results of deep BLSTM acoustic model and score fusion are given in Table 2.

5 RNN-based Language Models

In this section we describe the experiments with sophisticated language models based on recurrent neural networks. Word lattices obtained on the decoding pass with the 3-g LM and the best DNN+BLSTM models fusion in subsection 4.1 were taken as a starting point for these experiments.

We trained two RNN-based language models on shuffled utterances from transcriptions of the AM training dataset. To speed-up the training we used the vocabulary of 45 K most frequent words. All other words were replaced with the \({<}\mathrm {UNK}{>}\) token. Utterances were divided into two parts: a valid set (15 K utterances) and a train set (all other, 243 K utterances).

Table 3. Rescoring results
Fig. 2.
figure 2

System architecture

The first RNN-based LM was the Recurrent Neural Network Language Model (RNNLM) [14] which significantly outperforms n-gram LMs in various speech recognition tasks. We applied the following RNNLM configuration: 256 neurons in the hidden layer and 200 classes in the output layer.

The second RNN-based LM was the LSTM recurrent neural network LM (LSTM-LM) trained with dropout regularization [15]. We trained two LSTM-LMs using the Tensorflow toolkit [16]: “medium” (2 layers with 650 units each, 50 % dropout on the non-recurrent connections) and “large” (2 layers with 1500 units each, 65 % dropout on the non-recurrent connections) configurations from the paper [15].

The trained RNNLM and both LSTM-LMs were applied for hypothesis rescoring. We generated 100-best lists from the word lattices using Kaldi scripts. For the rescoring we took the weighted sum of n-gram LM and RNN-based LM scores. If the sentence contained a word missing in the 45K RNN vocabulary, we added an unigram score of this word from the 3-g model to the RNN score. The results of the rescoring are given in Table 3. It can be seen that RNNLM provided substantial improvement over the n-gram LM, as well as LSTM-LM over RNNLM.

6 Conclusion

The architecture of our system is depicted in Fig. 2. The system achieves WER of 16.4 %, with an absolute reduction of 8.7 % and relative reduction of 34.7 % over our previous system.

We consider several ways of further improvement of our system. First, BLSTM acoustic models improving techniques, such as sequence-discriminative training and dropout regularization, can lead to substantial WER reduction. Second, significant acoustic models improvement can be obtained by the use of the data augmentation approach [17]. Last but not least, we plan to carry out experiments with other promising language model architectures as well as to investigate more complicated approaches of applying sophisticated language models than simple n-best rescoring.