Keywords

1 Introduction

For automatic speech recognition (ASR) a language model (LM) is needed. The most widely used model is n-gram model which estimates posterior probability of the word consequence in a text. Commonly 3-gram model is employed. The usage of n-gram LMs with longer context can lead to the data sparseness problem. LMs based on recurrent neural networks (RNN) estimate probabilities based on all previous history that is their advantage over n-gram models.

In our research we used RNN LM for N-best list rescoring of automatic speech recognition (ASR) system. In Sect. 2 we give a survey of using NNs for LM creation, in Sect. 3 we describe RNN LM, in Sect. 4 we present our baseline LM, Sect. 5 gives a description of our RNN LMs, experiments on using RNN LM for N-best list rescoring for Russian speech recognition are presented in Sect. 6.

2 Related Work

The use of NN for LM training was firstly presented in [1]. RNN for language modeling was firstly used in [2]. In [3], a comparison of LMs based on feed-forward and recurrent NN was made. On the test set RNN LM showed 0.4 % absolute word error rate (WER) reduction comparing to feed-forward NN.

In [4], the strategies for NN LM training on large data sets are presented: (1) reduction of training epochs; (2) reduction of number of training tokens; (3) reduction of vocabulary size; (4) reduction of size of the hidden layer; (5) parallelization. It was shown that when data are sorted by their relevance the fast convergence during training and the better overall performance are observed. A maximum entropy model trained as a part of NN LM that leads to significant reduction of computational complexity was proposed. 10 % relative reduction was obtained comparing to the baseline 4-gram model.

In [5] it was proposed to call RNN LM to compute LM score only if newly hypothesized word has a reasonable score. Also cache based RNN inference was proposed in order to reduce runtime. Three approaches for exploiting succeeding word information in RNN LMs were proposed in [6]. In order to speed up training noise contrastive estimation training was investigated in [7] for RNNLMs. Noise contrastive estimation does not require normalization at the output layer and thereby allows speeding up training. A novel RNN LM dealing with multiple time-scale contexts was presented in [8]. Several lengths of contexts were considered in one LM. In [9], paraphrastic RNN LMs, which use multiple automatically generated paraphrase variants, were investigated. In [10] Long Short-Term Memory (LSTM) NN architecture was explored for modeling English and French languages. Investigation of the jointly trained maximum entropy and RNN LMs for Code-Switching speech is presented in [11]. It was proposed to integrate part-of-speech and language identifier information in RNN LM. In [12] the discriminative method for RNN LM was proposed. As a discriminative criterion the log-likelihood ratio of the ASR hypotheses and references was used.

RNN LM for Russian was firstly used in [13]. RNN LM was trained on the text corpus containing 40M words with vocabulary size of about 100K words. An interpolation of the obtained model with the baseline 3-gram and factored LMs was carried out. The resulted LM was used for rescoring 500-best list that demonstrated 7.4 % relative improvement of WER.

Despite of the increasing popularity of usage NNs for language modeling there are only a few studies on NN-based LMs for Russian. We made a research of implementation RNNs for Russian LM creation.

3 Artificial Neural Networks for Language Modeling

We used the same structure of RNN LM as in [2]; it is presented in Fig. 1. RNN consists of an input layer x, a hidden (or context) layer s, and an output layer y. The input to the network in time t is vector x(t). The vector x(t) is a concatenation of vector w(t), which is a current word in time t, and vector s(t-1), which is output of the hidden layer obtained on the previous step. Size of w(t) is equal to vocabulary size. The output layer y(t) has the same size as w(t) and it represents probability distribution of the next word given the previous word w(t) and the context vector s(t-1). The size of the hidden layer is chosen empirically and usually it consists of 30–500 units [2].

Fig. 1.
figure 1

General structure of the recurrent neural network.

Input, hidden, and output layers are as follows [2]:

$$ x\left( t \right) = w\left( t \right) + s\left( {t - 1} \right) $$
$$ s_{j} \left( t \right) = f\left( {\mathop \sum \limits_{i} x_{i} \left( t \right)u_{ji} } \right) $$
$$ y_{k} \left( t \right) = g\left( {\mathop \sum \limits_{j} s_{j} \left( t \right)u_{kj} } \right), $$

where f(z) is sigmoid activation function:

$$ f\left( z \right) = \frac{1}{{1 + e^{ - z} }} $$

g(z) is softmax function:

$$ g\left( {z_{m} } \right) = \frac{{e^{{Z_{m} }} }}{{\mathop \sum \nolimits_{k} e^{{z_{k} }} }} $$

NN training is carried out in several epochs. Usually, for training the back propagation algorithm with the stochastic gradient descent is used.

In order to speed up training in [14] it was suggested to perform factorization of the output layer. Words were mapped to classes according to their frequencies. At first, probability distribution over classes was computed. Then, probability distribution for the words that belong to a specific class was computed. In this case, word probability is computed as follows:

$$ P\left( {w_{i} |h_{i} } \right) = (P\left( {c_{i} } \right) |s\left( t \right) )P\left( {w_{i} |c_{i} ,s\left( t \right)} \right), $$

where c i is a class of the given word, h i is a history of the previous word.

4 Training Textual Corpus and Baseline Language Model

For the language model creation, we collected and automatically processed a Russian text corpus of a number of on-line newspapers. The procedure of preliminary text processing and normalization is described in [15]. At first, texts were divided into sentences. Then, a text written in any brackets was deleted, and sentences consisting of less than six words were also deleted. Uppercase letters were replaced by lowercase letters, if a word began from an uppercase letter. If a whole word was written by the uppercase letters, then such change was made, when the word existed in a vocabulary only. The size of the corpus after text normalization is over 350M words, and it has above 1M unique word-forms.

For the statistical text analysis, we used the SRI Language Modeling Toolkit (SRILM) [16]. During LMs creation we used the Kneser-Ney discounting method, and did not apply any n-gram cutoff. We created various 3-gram LMs with different vocabulary sizes, and the best speech recognition results were obtained with 150K vocabulary [17]. The perplexity measure of the baseline model was 553. So this vocabulary was chosen for further experiments with N-best list rescoring.

5 Creation of Language Models Based on Recurrent Neural Networks

For creation of RNN LM we used Recurrent Neural Network Language Modeling Toolkit (RNNLM toolkit) [18]. We made factorization of the output layer of RNN and created LMs with the number of classes equal to 100 and 500. We created models with different number of units in the hidden layer: 100, 300, and 500 [19, 20].

Then we have made a linear interpolation of the RNN LMs with the baseline 3-gram model. In this case, the probability score was computed as follows:

$$ P_{IRNN} \left( {w_{i} |h_{i} } \right) = \lambda P_{RNN} \left( {w_{i} |h_{i} } \right) + (1 - \lambda )P_{BL} \left( {w_{i} |h_{i} } \right) $$

where \( P_{RNN} \left( {w_{i} |h_{i} } \right) \) is a probability computed by the RNN LM; \( P_{BL} \left( {w_{i} |h_{i} } \right) \) is a probability computed by the baseline 3-gram model; λ is an interpolation coefficient.

LMs are evaluated by perplexity which is computed on held-out text date. Perplexity can be considered to be a measure of on average how many different equally most probable words can follow any given word. Lower perplexities represent better LMs [21]. Perplexities of the obtained models computed on the text corpus of 33M words are presented in Table 1. The interpolation coefficient of 1.0 means only RNN LM was used. In the table, we can see RNN LMs have smaller perplexities than the 3-gram LM.

Table 1. Perplexities of RNN LMs interpolated with 3-gram LM.

6 Experiments

Architecture of the Russian ASR system with developed RNN LMs is presented on Fig. 2. The system works in 2 modes [15]: training and recognition. In the training mode, acoustic models of speech units, LMs, and phonemic vocabulary of word-forms that will be used by recognizer are created.

Fig. 2.
figure 2

Architecture of Russian ASR system with RNN LMs.

For training the speech recognition system we used our own corpus of spoken Russian speech Euronounce-SPIIRAS [22]. The database consists of 16,350 utterances pronounced by 50 native Russian speakers (25 male and 25 female). Each speaker pronounced more than 300 phonetically-balanced and meaningful phrases. Total duration of speech data is about 21 h. For acoustic modeling, we applied continuous density Hidden Markov Models (HMMs).

To test the ASR system we used a speech corpus that contains 500 phrases pronounced by 5 different speakers (each speaker said the same 100 phrases). The phrases were taken from the materials of an on-line newspaper that were not used in the training data.

For automatic speech recognition, we applied the open-source Julius engine ver. 4.2 [23]. At speech decoding stage, the baseline 3-gram language models were used, and N-best list of hypotheses was created. Then RNN LM was applied for rescoring obtained N-best list of hypotheses and for selection of the best recognition hypothesis for pronounced phrase.

The WER obtained with the baseline 3-gram LM was 26.54 %. We produced a 50-best list and made its rescoring using RNN LMs as well as RNN LMs interpolated (+) with the baseline model using various interpolation coefficients. Obtained results are summarized in Table 2.

Table 2. WER obtained after rescoring N-best lists with RNN LMs (%).

In the table we can see that in the most cases the rescoring decreased the WER in comparison with the baseline model excepting the case of using RNN LMs with 100 hidden units without interpolation with the baseline model. Application of RNNs with 100 classes gave better results than RNNs with 500 classes. The lowest WER = 22.87 % was achieved using RNN LM with 500 hidden units and 100 classes interpolated with 3-gram model using the interpolation coefficient of 0.5.

Our results are consistent with those obtained in [13]. But we used training set of 350 million words that is 10 times larger set than in [13]. WER obtained in [13] with help of RNN was equal to 32.9 %. Our results are better and support the hypothesis that RNN-based LMs improve speech recognition accuracy.

7 Conclusion

In the paper, we have described the implementation of RNN LMs for rescoring N-best hypotheses lists of the ASR system. The advantage of RNN LMs over n-gram LMs is that they are able to store arbitrary long history of a given word. We have tried RNNs with various number of units in the hidden layer, also we tested the linear interpolation of the RNN LM with the baseline 3-gram LM. And we achieved 14 % relative reduction of WER using RNN LM with respect to the baseline model.