Language Models with RNNs for Rescoring Hypotheses of Russian ASR

Kipyatkova, Irina; Karpov, Alexey

doi:10.1007/978-3-319-40663-3_48

Irina Kipyatkova^16,17 &
Alexey Karpov^16,18

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9719))

Included in the following conference series:

International Symposium on Neural Networks

2697 Accesses
2 Citations

Abstract

In this paper, we describe a research of recurrent neural networks (RNNs) for language modeling in large vocabulary continuous speech recognition for Russian. We experimented with recurrent neural networks with different number of units in the hidden layer. RNN-based and 3-gram language models (LMs) were trained using the text corpus of 350M words. Obtained RNN-based language models were used for N-best list rescoring for automatic continuous Russian speech recognition. We tested also a linear interpolation of RNN LMs with the baseline 3-gram LM and achieved 14 % relative reduction of the word error rate (WER) with respect to the baseline 3-gram model.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Comparison of RNN LM and FLM for Russian Speech Recognition

Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling

A study of neural network Russian language models for automatic continuous speech recognition systems

Article 24 May 2017

Keywords

1 Introduction

For automatic speech recognition (ASR) a language model (LM) is needed. The most widely used model is n-gram model which estimates posterior probability of the word consequence in a text. Commonly 3-gram model is employed. The usage of n-gram LMs with longer context can lead to the data sparseness problem. LMs based on recurrent neural networks (RNN) estimate probabilities based on all previous history that is their advantage over n-gram models.

In our research we used RNN LM for N-best list rescoring of automatic speech recognition (ASR) system. In Sect. 2 we give a survey of using NNs for LM creation, in Sect. 3 we describe RNN LM, in Sect. 4 we present our baseline LM, Sect. 5 gives a description of our RNN LMs, experiments on using RNN LM for N-best list rescoring for Russian speech recognition are presented in Sect. 6.

2 Related Work

The use of NN for LM training was firstly presented in [1]. RNN for language modeling was firstly used in [2]. In [3], a comparison of LMs based on feed-forward and recurrent NN was made. On the test set RNN LM showed 0.4 % absolute word error rate (WER) reduction comparing to feed-forward NN.

In [4], the strategies for NN LM training on large data sets are presented: (1) reduction of training epochs; (2) reduction of number of training tokens; (3) reduction of vocabulary size; (4) reduction of size of the hidden layer; (5) parallelization. It was shown that when data are sorted by their relevance the fast convergence during training and the better overall performance are observed. A maximum entropy model trained as a part of NN LM that leads to significant reduction of computational complexity was proposed. 10 % relative reduction was obtained comparing to the baseline 4-gram model.

In [5] it was proposed to call RNN LM to compute LM score only if newly hypothesized word has a reasonable score. Also cache based RNN inference was proposed in order to reduce runtime. Three approaches for exploiting succeeding word information in RNN LMs were proposed in [6]. In order to speed up training noise contrastive estimation training was investigated in [7] for RNNLMs. Noise contrastive estimation does not require normalization at the output layer and thereby allows speeding up training. A novel RNN LM dealing with multiple time-scale contexts was presented in [8]. Several lengths of contexts were considered in one LM. In [9], paraphrastic RNN LMs, which use multiple automatically generated paraphrase variants, were investigated. In [10] Long Short-Term Memory (LSTM) NN architecture was explored for modeling English and French languages. Investigation of the jointly trained maximum entropy and RNN LMs for Code-Switching speech is presented in [11]. It was proposed to integrate part-of-speech and language identifier information in RNN LM. In [12] the discriminative method for RNN LM was proposed. As a discriminative criterion the log-likelihood ratio of the ASR hypotheses and references was used.

RNN LM for Russian was firstly used in [13]. RNN LM was trained on the text corpus containing 40M words with vocabulary size of about 100K words. An interpolation of the obtained model with the baseline 3-gram and factored LMs was carried out. The resulted LM was used for rescoring 500-best list that demonstrated 7.4 % relative improvement of WER.

Despite of the increasing popularity of usage NNs for language modeling there are only a few studies on NN-based LMs for Russian. We made a research of implementation RNNs for Russian LM creation.

3 Artificial Neural Networks for Language Modeling

We used the same structure of RNN LM as in [2]; it is presented in Fig. 1. RNN consists of an input layer x, a hidden (or context) layer s, and an output layer y. The input to the network in time t is vector x(t). The vector x(t) is a concatenation of vector w(t), which is a current word in time t, and vector s(t-1), which is output of the hidden layer obtained on the previous step. Size of w(t) is equal to vocabulary size. The output layer y(t) has the same size as w(t) and it represents probability distribution of the next word given the previous word w(t) and the context vector s(t-1). The size of the hidden layer is chosen empirically and usually it consists of 30–500 units [2].

Input, hidden, and output layers are as follows [2]:

$$ x\left( t \right) = w\left( t \right) + s\left( {t - 1} \right) $$

$$ s_{j} \left( t \right) = f\left( {\mathop \sum \limits_{i} x_{i} \left( t \right)u_{ji} } \right) $$

$$ y_{k} \left( t \right) = g\left( {\mathop \sum \limits_{j} s_{j} \left( t \right)u_{kj} } \right), $$

where f(z) is sigmoid activation function:

$$ f\left( z \right) = \frac{1}{{1 + e^{ - z} }} $$

g(z) is softmax function:

$$ g\left( {z_{m} } \right) = \frac{{e^{{Z_{m} }} }}{{\mathop \sum \nolimits_{k} e^{{z_{k} }} }} $$

NN training is carried out in several epochs. Usually, for training the back propagation algorithm with the stochastic gradient descent is used.

In order to speed up training in [14] it was suggested to perform factorization of the output layer. Words were mapped to classes according to their frequencies. At first, probability distribution over classes was computed. Then, probability distribution for the words that belong to a specific class was computed. In this case, word probability is computed as follows:

$$ P\left( {w_{i} |h_{i} } \right) = (P\left( {c_{i} } \right) |s\left( t \right) )P\left( {w_{i} |c_{i} ,s\left( t \right)} \right), $$

where c _i is a class of the given word, h _i is a history of the previous word.

4 Training Textual Corpus and Baseline Language Model

For the language model creation, we collected and automatically processed a Russian text corpus of a number of on-line newspapers. The procedure of preliminary text processing and normalization is described in [15]. At first, texts were divided into sentences. Then, a text written in any brackets was deleted, and sentences consisting of less than six words were also deleted. Uppercase letters were replaced by lowercase letters, if a word began from an uppercase letter. If a whole word was written by the uppercase letters, then such change was made, when the word existed in a vocabulary only. The size of the corpus after text normalization is over 350M words, and it has above 1M unique word-forms.

For the statistical text analysis, we used the SRI Language Modeling Toolkit (SRILM) [16]. During LMs creation we used the Kneser-Ney discounting method, and did not apply any n-gram cutoff. We created various 3-gram LMs with different vocabulary sizes, and the best speech recognition results were obtained with 150K vocabulary [17]. The perplexity measure of the baseline model was 553. So this vocabulary was chosen for further experiments with N-best list rescoring.

5 Creation of Language Models Based on Recurrent Neural Networks

For creation of RNN LM we used Recurrent Neural Network Language Modeling Toolkit (RNNLM toolkit) [18]. We made factorization of the output layer of RNN and created LMs with the number of classes equal to 100 and 500. We created models with different number of units in the hidden layer: 100, 300, and 500 [19, 20].

Then we have made a linear interpolation of the RNN LMs with the baseline 3-gram model. In this case, the probability score was computed as follows:

$$ P_{IRNN} \left( {w_{i} |h_{i} } \right) = \lambda P_{RNN} \left( {w_{i} |h_{i} } \right) + (1 - \lambda )P_{BL} \left( {w_{i} |h_{i} } \right) $$

where $ P_{RNN} \left( {w_{i} |h_{i} } \right) $ is a probability computed by the RNN LM; $ P_{BL} \left( {w_{i} |h_{i} } \right) $ is a probability computed by the baseline 3-gram model; λ is an interpolation coefficient.

LMs are evaluated by perplexity which is computed on held-out text date. Perplexity can be considered to be a measure of on average how many different equally most probable words can follow any given word. Lower perplexities represent better LMs [21]. Perplexities of the obtained models computed on the text corpus of 33M words are presented in Table 1. The interpolation coefficient of 1.0 means only RNN LM was used. In the table, we can see RNN LMs have smaller perplexities than the 3-gram LM.

Table 1. Perplexities of RNN LMs interpolated with 3-gram LM.

Full size table

6 Experiments

Architecture of the Russian ASR system with developed RNN LMs is presented on Fig. 2. The system works in 2 modes [15]: training and recognition. In the training mode, acoustic models of speech units, LMs, and phonemic vocabulary of word-forms that will be used by recognizer are created.

For training the speech recognition system we used our own corpus of spoken Russian speech Euronounce-SPIIRAS [22]. The database consists of 16,350 utterances pronounced by 50 native Russian speakers (25 male and 25 female). Each speaker pronounced more than 300 phonetically-balanced and meaningful phrases. Total duration of speech data is about 21 h. For acoustic modeling, we applied continuous density Hidden Markov Models (HMMs).

To test the ASR system we used a speech corpus that contains 500 phrases pronounced by 5 different speakers (each speaker said the same 100 phrases). The phrases were taken from the materials of an on-line newspaper that were not used in the training data.

For automatic speech recognition, we applied the open-source Julius engine ver. 4.2 [23]. At speech decoding stage, the baseline 3-gram language models were used, and N-best list of hypotheses was created. Then RNN LM was applied for rescoring obtained N-best list of hypotheses and for selection of the best recognition hypothesis for pronounced phrase.

The WER obtained with the baseline 3-gram LM was 26.54 %. We produced a 50-best list and made its rescoring using RNN LMs as well as RNN LMs interpolated (+) with the baseline model using various interpolation coefficients. Obtained results are summarized in Table 2.

Table 2. WER obtained after rescoring N-best lists with RNN LMs (%).

Full size table

In the table we can see that in the most cases the rescoring decreased the WER in comparison with the baseline model excepting the case of using RNN LMs with 100 hidden units without interpolation with the baseline model. Application of RNNs with 100 classes gave better results than RNNs with 500 classes. The lowest WER = 22.87 % was achieved using RNN LM with 500 hidden units and 100 classes interpolated with 3-gram model using the interpolation coefficient of 0.5.

Our results are consistent with those obtained in [13]. But we used training set of 350 million words that is 10 times larger set than in [13]. WER obtained in [13] with help of RNN was equal to 32.9 %. Our results are better and support the hypothesis that RNN-based LMs improve speech recognition accuracy.

7 Conclusion

In the paper, we have described the implementation of RNN LMs for rescoring N-best hypotheses lists of the ASR system. The advantage of RNN LMs over n-gram LMs is that they are able to store arbitrary long history of a given word. We have tried RNNs with various number of units in the hidden layer, also we tested the linear interpolation of the RNN LM with the baseline 3-gram LM. And we achieved 14 % relative reduction of WER using RNN LM with respect to the baseline model.

References

Schwenk, H., Gauvain, J.-L.: Training neural network language models on very large corpora. In: Proceedings of the Conference on Empirical Methods on Natural Language Processing. Association for Computational Linguistics, Vancouver, B.C., Canada, pp. 201–208 (2005)
Google Scholar
Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings of INTERSPEECH 2010, vol. 2, pp. 1045–1048. Makuhari, Chiba, Japan (2010)
Google Scholar
Sundermeyer, M., Oparin, I., Gauvain, J.-L., Freiberg, B., Schluter, R., Ney, H.: Comparison of feedforward and recurrent neural network language models. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, B.C., Canada, pp. 8430–8434 (2013)
Google Scholar
Mikolov, T., Deoras, A., Povey, D., Burget L., Černocký, J.: Strategies for training large scale neural network language models. In: Proceedings of ASRU 2011, Hawaii, pp. 196–201 (2011)
Google Scholar
Huang, Z., Zweig, G., Dumoulin, B.: Cache based recurrent neural network language model inference for first pass speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2014, Florence, Italy, pp. 6404–6408 (2014)
Google Scholar
Shi, Y., Larson, M., Wiggers, P., Jonker, C.M.: Exploiting the succeeding words in recurrent neural network. In: Proceedings of INTERSPEECH 2013, Lyon, France, pp. 632–636 (2013)
Google Scholar
Chen, X., Liu, X., Gales, M.J.F., Woodland, P.C.: Recurrent neural network language model training with noise contrastive estimation for speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, pp. 5411–5415 (2015)
Google Scholar
Morioka, T., Iwata, T., Hori, T., Kobayashi, T.: Multiscale recurrent neural network based language model. In: Proceedings of INTERSPEECH 2015, Dresden, Germany, pp. 2366–2370 (2015)
Google Scholar
Liu, X., Chen, X., Gales, M.J.F., Woodland, P.C.: Paraphrastic recurrent neural network language models. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, pp. 5406–5410 (2015)
Google Scholar
Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: Proceedings of INTERSPEECH 2012, pp. 194–197 (2012)
Google Scholar
Vu, N.T., Schultz, T.: Exploration of the impact of maximum entropy in recurrent neural network language models for code-switching speech. In: Proceedings of 1st Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 34–41 (2014)
Google Scholar
Tachioka, Y., Watanabe, S.: Discriminative method for recurrent neural network language models. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, pp. 5386–5390 (2015)
Google Scholar
Vazhenina, D., Markov, K.: Evaluation of advanced language modeling techniques for Russian LVCSR. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 124–131. Springer, Heidelberg (2013)
Chapter Google Scholar
Mikolov, T., Kombrink, S., Burget, L., Černocký, J.H., Khudanpur, S.: Extensions of recurrent neural network language model. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531 (2011)
Google Scholar
Karpov, A., Markov, K., Kipyatkova, I., Vazhenina, D., Ronzhin, A.: Large vocabulary Russian speech recognition using syntactico-statistical language modeling. Speech Commun. 56, 213–228 (2014)
Article Google Scholar
Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop ASRU 2011, Waikoloa, Hawaii, USA (2011)
Google Scholar
Kipyatkova, I., Karpov, A.: Lexicon size and language model order optimization for Russian LVCSR. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 219–226. Springer, Heidelberg (2013)
Chapter Google Scholar
Mikolov, T., Kombrink, S., Deoras, A., Burget, L., Černocký, J.: RNNLM-recurrent neural network language modeling toolkit. In: ASRU-2011, Demo Session (2011)
Google Scholar
Kipyatkova, I., Karpov, A.: A comparison of RNN LM and FLM for Russian speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS, vol. 9319, pp. 42–50. Springer, Heidelberg (2015)
Chapter Google Scholar
Kipyatkova, I., Karpov, A.: Recurrent neural network-based language modeling for an automatic Russian speech recognition system. In: Proceedings of International Conference AINL-ISMW FRUCT, St. Petersburg, Russia, pp. 33–38 (2015)
Google Scholar
Moore, G.L.: Adaptive Statistical Class-Based Language Modelling. Ph.D. thesis, Cambridge University (2001)
Google Scholar
Jokisch, O., Wagner, A., Sabo, R., Jaeckel, R., Cylwik, N., Rusko, M., Ronzhin, A., Hoffmann, R.: Multilingual speech data collection for the assessment of pronunciation and prosody in a language learning system. In: Proceedings of SPECOM 2009, St. Petersburg, Russia, pp. 515–520 (2009)
Google Scholar
Lee, A., Kawahara, T.: Recent development of open-source speech recognition engine julius. In: Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2009, Sapporo, Japan, pp. 131–137 (2009)
Google Scholar

Download references

Acknowledgments

This research is partially supported by the Council for Grants of the President of Russia (projects No. MK-5209.2015.8 and MD-3035.2015.8), by the Russian Foundation for Basic Research (projects No. 15-07-04415 and 15-07-04322), and by the Government of the Russian Federation (grant No. 074-U01).

Author information

Authors and Affiliations

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), St. Petersburg, Russia
Irina Kipyatkova & Alexey Karpov
St. Petersburg State University of Aerospace Instrumentation (SUAI), St. Petersburg, Russia
Irina Kipyatkova
ITMO University, St. Petersburg, Russia
Alexey Karpov

Authors

Irina Kipyatkova
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Karpov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Irina Kipyatkova .

Editor information

Editors and Affiliations

The Chinese Academy of Sciences, Beijing, China
Long Cheng
Huazhong University of Science and Tech., Wuhan, Jiangsu, China
Qingshan Liu
Russian Academy of Sciences, SPIIRAS, St. Petersburg, Russia
Andrey Ronzhin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kipyatkova, I., Karpov, A. (2016). Language Models with RNNs for Rescoring Hypotheses of Russian ASR. In: Cheng, L., Liu, Q., Ronzhin, A. (eds) Advances in Neural Networks – ISNN 2016. ISNN 2016. Lecture Notes in Computer Science(), vol 9719. Springer, Cham. https://doi.org/10.1007/978-3-319-40663-3_48

Download citation

DOI: https://doi.org/10.1007/978-3-319-40663-3_48
Published: 02 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40662-6
Online ISBN: 978-3-319-40663-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Language Models with RNNs for Rescoring Hypotheses of Russian ASR

Abstract

Similar content being viewed by others

A Comparison of RNN LM and FLM for Russian Speech Recognition

Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling

A study of neural network Russian language models for automatic continuous speech recognition systems

Keywords

1 Introduction

2 Related Work

3 Artificial Neural Networks for Language Modeling

4 Training Textual Corpus and Baseline Language Model

5 Creation of Language Models Based on Recurrent Neural Networks

6 Experiments

7 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Language Models with RNNs for Rescoring Hypotheses of Russian ASR

Abstract

Similar content being viewed by others

A Comparison of RNN LM and FLM for Russian Speech Recognition

Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling

A study of neural network Russian language models for automatic continuous speech recognition systems

Keywords

1 Introduction

2 Related Work

3 Artificial Neural Networks for Language Modeling

4 Training Textual Corpus and Baseline Language Model

5 Creation of Language Models Based on Recurrent Neural Networks

6 Experiments

7 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation