Keywords

1 Introduction

Deep neural networks (DNNs) are widely used in automatic speech recognition (ASR) systems. For acoustic modeling, DNN is usually combined with Hidden Markov Models (HMMs) in a hybrid DNN/HMM model. In such systems, HMMs model the long-term dependencies and DNNs provide discriminative training. DNN is trained to predict a-posteriori probabilities of each context-dependent state with given acoustic observations. During decoding the output probabilities are divided by the prior probability of each state forming a “pseudo-likelihood” that is used in place of the state emission probabilities in the HMM  [1]. For language modeling, NNs are basically used for lattice or N-best list rescoring.

In this paper, we made a research of Russian large vocabulary continuous speech recognition (LVCSR) system developed using NNs for acoustic and language modeling. The process of speech decoding using NN-based AM and LM is illustrated on Fig. 1. We used hybrid DNN/HMMs with different topologies as acoustic models (AMs). Speech decoding with N-best list generation was performed using baseline 3-gram model. Then RNN language model (LM) was applied for rescoring obtained N-best list of hypotheses and for selection of the best recognition hypothesis for pronounced phrases. In addition, we performed rescoring using linear interpolation of RNN and n-gram LM.

Fig. 1.
figure 1

Decoding of speech signal with NN-based AM and LM.

The paper is organized as follows: in Sect. 2 we give a survey of application of DNNs for both acoustic and language modeling, in Sect. 3 we give a description of our DNN-based AMs, in Sect. 4 we present our a baseline 3-gram and RNN LMs, experiments on speech recognition using NN-based AMs and LMs are presented in Sect. 5.

2 Related Works

Different types of NNs can be used for acoustic modeling in ASR: feedforward deep neural network (DNN), recurrent neural network (RNN), convolutional neural network (CNN), deep belief network (DBN), time delay neural network (TDNN), long short-term memory (LSTM), bidirectional LSTM [2, 3].

TDNN-based AMs were presented in  [4], where they allowed obtaining a relative word error rate (WER) reduction of 2.6%. TDNN for keyword spotting is described in  [5]. The usage of LSTM in a hybrid DNN/HMM system was presented in [6]; LSTM allowed the authors to reduce WER comparing to the DNN-based system. BLSTM recurrent neural network (RNN) was studied in [7]. Different variants of optimization methods, batching, truncated backpropagation, and regularization techniques such as dropout are researched in the paper. The best BLSTM model gave a relative improvement in terms of WER of over 15% compared to the best feed-forward baseline.

In [8], TDNN was combined with LSTM by interleaving TDNNs and LSTMs. It was shown that this architecture efficiently models the further temporal context. Also a TDNN-LSTM architecture was applied in [9] for graphemic ASR system where it outperformed DNN-based system by 18.6% relatively. Comparing to TDNN and LSTM systems, relative reduction was equal to 7.1% and 6.4% respectively.

For language modeling, generally RNNs are used. In RNN, the hidden layer represents all preceding context as opposite to feedforward NNs, which use preceding context of a fixed length for word prediction. RNN for language modeling was introduced in [10]. A parallel RNN with part-of speech (POS) tags is presented in [11]. The proposed model consists of two RNNs: word RNN and POS RNN. The hidden state of word RNN affected also by an output from the state of POS RNN. LSTM-based LM was used for language modeling in [12]. There are RNN LMs, which contain information about both preceding and succeeding words as well. Usually, bidirectional RNNs are used for this purpose [13]. In [14], the authors proposed unidirectional RNN structure that uses a feedforward unit to model a finite number of succeeding words.

Some researches explore the usage of NNs for both acoustic and language modeling. For example, in [15], an improvement of Microsoft ASR system is described. The system used CNN-BLSTM AM and 4-gram LM for decoding and lattice rescoring, and LSTM-based LM was applied for 500-best list rescoring.

There are a few researches on application of DNNs in Russian speech recognition systems. Samples of Russian ASR systems with DNN-based acoustic models are presented in  [16, 17]. RNN LM for Russian is proposed in [18, 19].

3 Acoustic Modeling with NNs

We have tried three types of NNs for acoustic modeling: feedforward DNN, TDNN, and LSTM. AMs were trained using the open-source Kaldi toolkit  [20]. Mel-frequency cepstral coefficients (MFCCs) were used as input to the NNs. For speaker adaptation, 100-dimensional i-Vector  [21] was appended to the 40-dimensional MFCC input.

We used Dan’s implementation  [22] of DNN training realized in Kaldi and experimented with feed-forward DNNs having p-norm activation function  [23]. The output was a softmax layer with the dimension equal to the number of context-dependent states (1609 in our case). We created DNNs with different numbers of hidden layers and values of input/output dimensions. The system was trained for 15 epochs with the learning rate varying from 0.02 to 0.004 and then for 5 epochs with a constant final learning rate (0.004). Our hybrid DNN/HMM system is described in  [24] in more detail.

TDNN is a feed-forward DNN with nodes modified by time delays. TDNNs are efficient for modeling temporal dynamics in speech allowing capturing long term dependencies between acoustic events. In [4], a sub-sampling technique was proposed for TDNN which allows to speed up training and make training time comparable to standard feed-forward DNN training. According to this technique, hidden activations are computed only on a few time steps instead of all time steps. In this approach, instead of splicing together contiguous temporal windows of frames at each layer, it is proposed to splice together no more than two frames.

We created TDNNs with different numbers of hidden layers, various temporal contexts and splice indexes. p-norm nonlinearity was also used for hidden layers. An example of TDNN architecture with time context [−7, +4] using sub-sampling is presented in Fig. 2. The input layer splices together frames at a context [−1, 1]. For the hidden layer sub-sampling {−2, 1} is performed which means that the input at the current frame minus 2 and the current frame plus 1 are spliced together. Then at 2nd hidden layer sub-sampling {−4, 2} is applied. Our TDNN system is described in [25] in detail.

Fig. 2.
figure 2

An example of TDNN architecture with sub-sampling for network context [−7, 4].

LSTM contains special units called memory blocks. Each memory block is composed of a memory cell, which stores the temporal state of the network, and multiplicative units named gates controlling the information flow. There are an input gate, an output gate, and a forget gate [26]. An example of the memory block is presented in Fig. 3  [27], where xt is an input vector at time t; ht is an output vector.

Fig. 3.
figure 3

An example of LSTM’s memory block.

We created LSTMs and BLSTMs with 3 layers. We tried different cell dimensions equal to 512, 1024, and 2048. The output state label was delayed by 5 frames. The LSTM delays were equal to −1, −2, and −3 at layer 1, layer 2, and layer 3 respectively. BLSTM used recurrent connections with delays −1 for the forward and 1 for the backward at the layer 1; −2 for the forward and 2 for the backward at the layer 2; −3 for the forward and 3 for the backward at the layer 3. LSTMs and BLSTMs were trained for 3 epochs.

4 Language Modeling Using NN

The text corpus for LMs training and evaluation was taken from on-line newspapers. The size of the training corpus after text normalization is over 350 M words. The size of the corpus for perplexity estimation was 33 M words. The vocabulary size was 150 K word-forms. Transcriptions were generated automatically by application of transcribing rules to the list of word-forms with denoted stress vowel  [28]. The baseline 3-gram model with the Kneser-Ney discounting was created using SRI Language Modeling Toolkit (SRILM) [29].

The topology of RNN LM is presented in Fig. 4. We used the same architecture as in [10]. RNN consists of an input layer, hidden (or context) layer, and an output layer. The input layer is a concatenation of the vector, which represents the current word, and the vector, which is the output of the hidden layer. The hidden layer contains all preceding context. The output layer represents a probability distribution of the next word given the previous word and the preceding context. Size of the hidden layer is chosen empirically and usually it consists of 30–500 units [10].

Fig. 4.
figure 4

Recurrent neural network topology.

For creation of RNN LM we used Recurrent Neural Network Language Modeling Toolkit (RNNLM toolkit) [30]. In order to speed up training the factorization of the output layer was performed [31]. We created RNNs with different number of units in the hidden layer and number of classes. Description and evaluation of the models was described in detail in [32]. For the current experiments, we used RNN with 500 hidden units and 100 classes. Also we made linear interpolation of the RNN and 3-gram LM. Perplexities of the models are presented in Table 1. The interpolation coefficient of 0 means that only 3-gram model was used; the interpolation coefficient of 1.0 means only RNN LM was used.

Table 1. Perplexities of interpolated RNN and 3-gram LMs.

5 Experiments

5.1 Speech Corpora

For training and testing the Russian ASR system, we used our own speech corpora recorded at SPIIRAS. The recording of speech data was carried out with the help of two professional condenser microphones Oktava MK-012. The speech data were collected in clean acoustic conditions, with 44.1 kHz sampling rate, 16 bits per sample. The signal-to-noise ratio was about 35 dB. For the recognition experiments, all the audio data were down-sampled to 16 kHz.

The training speech corpus consists of three parts. The first part is recordings of phonetically rich and meaningful phrases and texts. This database was developed within the framework of the EuroNounce project  [33]. The second part consists of recordings of a phonetically representative text, presented in  [34] and contains phrases taken from the Appendix G to the Russian State Standard P 50840-95  [35]. The third part is audio data of the audio-visual speech corpus HAVRUS  [36]. The total duration of the entire speech data is more than 30 h. To test the system we used another speech dataset consisting of 500 phrases pronounced by 5 speakers  [37]. The phrases were taken from the materials of one Russian on-line newspaper (Fontanka.ru) that was not presented in the training speech and text data. A detailed description of the corpora is presented in  [25].

5.2 Speech Recognition Results with 3-Gram LM

Firstly, we have made experiments on Russian speech recognition using DNN/HMM AMs. Obtained results are presented in Table 2. The best result (WER = 20.71%) was obtained when the DNN had 6 hidden layers and the input/output dimension was 900/90. Increasing the number of the hidden layers and units led to increasing the WER, it can be caused by the limited amount of the training data and model overfitting.

Table 2. WER with feed-forward DNN models (%).

Then, we have made experiments with TDNN/HMM AMs. Table 3 presents the obtained results. The lowest WER was 17.62% and it was achieved by application of the TDNN with 5 hidden layers and time context [−8, 8] (TDNN2). The usage of the models with a larger temporal context led to increasing of WER that also can be caused by overtraining.

Table 3. WER with TDNN models (%).

Results obtained with LSTMs and BLSTMs (Table 4) are approximately the same as feed-forward DNNs. This can be connected with the fact that LSTMs are easily overfitted, so parameters of the model should be tuned more carefully.

Table 4. WER with LSTM models (%).

5.3 Speech Recognition Results with RNN LM

The best results obtained during previous experiments were used for the experiments with N-best list rescoring. So, we made rescoring of four 500-best lists obtained with the following AMs: (1) DNN with 6 hidden layers and input/output dimension equal to 900/90; (2) TDNN2 with input/output dimension equal to 600/60; (3) LSTM with 1024 units in one hidden layer; (4) BLSTM with 1024 units in the hidden layer. For rescoring we used solely RNN-based LM as well as RNN interpolated with 3-gram model with different interpolation coefficients. Obtained results are summarized in Table 5. The lowest WER = 15.13% was achieved using TDNN-based AM and RNN LM interpolated with 3-gram model using the interpolation coefficient of 0.5.

Table 5. WER after 500-best list rescoring (%).

6 Conclusions and Future Work

In the paper, we described our NN-based very large vocabulary continuous Russian speech recognition system. For acoustic modeling, we trained hybrid DNN/HMM models with different topologies of DNNs. For language modeling, we used RNN on the N-best list rescoring stage. Training and testing the system was performed on our own speech and text corpora. The lowest WER was achieved with TDNN/HMMs as AM and rescoring 500-best list with the help of RNN LM interpolated with 3-gram model. We achieved the relative WER reduction of 27% comparing to our best result obtained with the baseline feedforward DNN/HMM AM and 3-gram LM. In further work, we plan to investigate other topologies of DNNs for acoustic and language modeling.