Keywords

1 Introduction

A language model (LM) is one of the main parts of a speech recognition system. Nowadays, neural networks (NNs) are widely used for language modeling. As it was shown in many papers, NN-based LMs outperform standard n-gram models [1, 2]. For language modeling, the usage of recurrent NNs (RNNs) is preferable because this type of NN can store the whole context preceding the given word in contrast to feedforward NNs which store a context of restricted length.

A long short-term memory (LSTM) network is RNN, which contains special units called memory blocks. Each memory block is composed of a memory cell, which stores the temporal state of the network, and multiplicative units named gates (an input gate, an output gate, and a forget gate) controlling the information flow [3].

In our research we used a LSTM-based LM for N-best list rescoring for automatic speech recognition (ASR) system. The paper is organized as follows: in Sect. 2 we give a survey of application of LSTMs for language modeling, in Sect. 3 we give a description of our LSTM-based LMs, experimental results of N-best list rescoring using LSTM-based LMs are presented in Sect. 4.

2 Related Works

LSTMs are widely used in speech recognition systems at N-best or lattice rescoring stage. In [4] comparison of LMs based on n-grams, feedforward, recurrent, and LSTM NNs in terms of perplexity and word error rate (WER) is presented. LMs were created for English and French. In the paper, it was shown that application of LSTM-based LMs for lattice rescoring outperforms other type of LMs. In addition, experimental analysis of relationship between perplexity of NN-based LMs and WER was performed. It showed that WER decreases with decreasing perplexity that is analogous to correlation between perplexity and WER for n-gram LMs.

In [5] LSTM-based LM was used for lattice rescoring for a YouTube speech recognition task. The proposed model decreased WER by 8% as compared with the result obtained with the n-gram model.

Automatic speech recognition for conversational Finnish and Estonian speech with LSTM LM is described in [6]. The authors tried subword-based and fullword-based language modeling and investigated the usage of classes for language modeling. LSTM LM was used for lattice rescoring. On both languages, the best results were obtained from class-based subword models.

Czech language modeling using LSTM is represented in [7]. As the baseline, 5-gram Knesser-Ney statistical model with 120 K vocabulary was used. The LSTM LMs were trained with limited vocabulary consisted of 10 K most frequent words. LSTM LM interpolated with the baseline model was used for rescoring of 1000-best list. Experiments were performed on the corpus of Czech spontaneous speech which was recorded from phone calls. Application of LSTM LM allowed increasing speech recognition accuracy by 3.7% in relative comparing to the result obtained with the baseline model.

A comparison of LMs based on LSTM and gated recurrent units (GRU) is presented in [8]. In experiments of lattice rescoring for English speech recognition task, LSTM-based LM outperformed GRU-based LM in terms of both perplexity and WER. Also experiments with Highway network based on GRU were performed that showed WER improvement, but similar investigation on the base of LSTM was not conducted.

In [9] a system which uses LSTM for both acoustic and language modeling is presented. The system uses CNN-BLSTM acoustic models and 4-gram LM for decoding and lattice rescoring. LSTM-based LM was applied for 500-best list rescoring. Relative WER reduction obtained after rescoring was about 20%.

Russian language modeling with the use of LSTM is described in [10]. The baseline 3-gram LM was trained on transcriptions of telephone conversations (390 h of speech) as well as on text corpus (about 200 M words) containes materials from Internet forum discussions, books etc. Vocabulary for the baseline model contains 214 K words. NN-based LMs were trained only with a part of the test corpus, and for this corpus the vocabulary of 45 K most frequent words was used. LSTM-based LM was used for rescoring of 100-best list. Relative WER reduction was equal to 8%.

In our previous researches on Russian language modeling [11, 12] we have experimented with LMs created on the base of RNN with one hidden layer using RNNLM toolkit [13]. We have obtained relative WER reduction of 14% as compared to the result obtained with our 3-gram model. The current research is aimed to investigation of another type of RNN for language modeling.

3 LSTM Language Models for Russian

For training of LSTM language models, we used TheanoLM toolkit [14]. We trained LMs on a text corpus composed with the use of on-line Russian newspapers [15]. The vocabulary size was 150 K word-forms. We created NN LMs consisting of a projection layer, which maps words to specified dimensional embeddings, one hidden LSTM layer, and a hierarchical softmax layer. Hierarchical softmax factors the output probabilities into the product of multiple softmax functions [16]. Thus, the output layer is factorized into two levels, both performing normalization over an equal number of choices [6], it allows using of very large vocabulary for language modeling. NN LM architecture is presented on Fig. 1, where wt is an input word at time t; ht is the hidden layer state, ct is LSTM cell state.

Fig. 1.
figure 1

LSTM-based LM architecture

We tried NNs with LSTM layer sizes equal to 256 and 512, and projection layer sizes equal to 100, 500, and 1000. LSTM-based LMs were trained using stochastic gradient descent (SGD) optimization method. The stopping criteria was “no-improvement” which means that learning rate is halved when validation set perplexity stops improving, and training is stopped when the perplexity does not improve at all with the current learning rate [14]. The maximum number of training epoch was 15. The initial learning rate was equal to 1.

As well, we made a linear interpolation of the LSTM-based LM and baseline LM. As a baseline, we used 3-gram LM with Kneser-Ney discounting trained on the same text corpus using the SRI Language Modeling Toolkit (SRILM) [17]. Perplexities of the obtained LMs computed on held-out text data are presented in Table 1. The interpolation coefficient of 1.0 means that only LSTM-based LM was used. The perplexity of the baseline model was 553.

Table 1. Perplexities of LSTM LMs

The lowest perplexity was obtained with the NN with the projection layer size equal to 1000 and the hidden layer size equal to 512. Interpolation with the 3-gram model gave the additional improvement of perplexity. The interpolation coefficient equal to 0.7 provided the best result. Thus, relative reduction of perplexity was 46% as compared with the perplexity of the baseline model.

4 Experiments

4.1 Experimental Setup

For training the acoustic models and testing the speech recognition system, we used our own corpora of continuous Russian speech recorded at SPIIRAS. The total duration of the entire speech data is more than 30 h. The corpus is described in detail in [18].

We used hybrid DNN/HMMs acoustic models based on time-delay neural network with 5 hidden layers and time context [−8, 8]. Acoustic models were trained using the open-source Kaldi toolkit [19]. Mel-frequency cepstral coefficients (MFCCs) were used as input to the NNs. For speaker adaptation, 100-dimensional i-Vector [20] was appended to the 40-dimensional MFCC input. Detail description of our acoustic models is presented in [12]. We have obtained WER equal to 17.62% with our baseline 3-gram model, and WER equal to 15.13 was obtained after rescoring 500-best list with the help of RNN LM with one hidden layer interpolated with the 3-gram model.

LSTM-based LM was applied for rescoring of 500-best list of hypotheses and for selection of the best recognition hypothesis for the pronounced phrase. Interpolated LMs were used for rescoring as well. Obtained speech recognition results are presented in Table 2.

Table 2. WER after 500-best list rescoring (%)

As one can see from the table, application of LSTM-based LMs allows to improve speech recognition results. Additional improvement was achieved with interpolated LSTM-based LM with baseline LM. The lowest WER (14.06%) was obtained using NN with projection layer size equal to 500 and hidden layer size equal to 512 interpolated with the baseline model with interpolation coefficient equal to 0.7, though this model was not the best in terms of perplexity. This may be connected with the fact that we used different texts material for estimation of perplexity and for speech corpora recordings.

Then we experimented with optimization method for NN training. We tried Nesterov Momentum [21], AdaGrad [22], and Adam [23] optimization methods, and compared them with SGD method in terms of perplexity and WER of the created models. We trained models with 512 units in the hidden layer and 512 units in the projection layer because LSTM with these parameters gave us the best results in terms of WER in our previous experiments with models with SGD optimization method. Initial learning rates were chosen according to recommendations of TheanoLM toolkit. Results of experiments on comparing optimization methods in term of perplexity and WER are presented in Tables 3 and 4 respectively.

Table 3. Results of experiments with LMs trained using different optimization methods in terms of perplexity
Table 4. Results of experiments with LMs trained using different optimization methods in terms of WER (%)

Only Nesterov Momentum method slightly outperform SGD in terms of both perplexity and WER of the obtained models. Thus, the best results (perplexity equals 289; WER equals 14.01) were obtained after interpolation of LSTM LM trained using Nesterov Momentum optimization method interpolated with the baseline LM with interpolation coefficient equal to 0.7.

Then we trained NNs with 2 and 3 LSTM layers using the parameters of the best 1-layer LSTM. In these NNs we applied dropout at rate 0.3 between LSTM layers. Obtained results are presented in Table 4.

Thus, the best result was obtained using NN LM with 2 LSTM layers interpolated with the baseline LM with interpolation coefficient of 0.8, in this case WER equaled 13.80%. Further increasing the number of the hidden layers led to increasing WER that may be caused by overtraining (Table 5).

Table 5. Results of experiments with LMs with different number of LSTM layers

5 Conclusions and Future Work

In the paper, we have investigated LSTM-based LMs for Russian speech recognition task. We have tried NNs with different hidden layer sized, projection layer sizes, optimization methods, and number of hidden layers. LSTM-based LMs were applied for N-best list rescoring. The lowest WER was achieved with the NN with 2 hidden layers, 512 units in hidden layer and projection layer of 500 trained with Nesterov Momentum optimization method. We have achieved 22% relative reduction of WER using LSTM LM with respect to the baseline 3-gram model. In further research, we are going to investigate other topologies of RNNs for language modeling.