Keywords

1 Introduction

In recent years, developing of end-to-end systems became the main trend in researches on automatic speech recognition (ASR). End-to-end ASR systems transform an input speech signal to a sequence of letters using single deep neural network (DNN). This results in reducing the processing time and required amount of memory comparing to standard ASR systems consisting of independent components. However, training the end-to-end system requires much more data than standard system. This drawback makes it difficult to create an end-to-end ASR system for low-resourced languages. One way to overcome this drawback is to use transfer learning methods.

Transfer learning consists in transferring the knowledge obtained on one or several initial tasks to be used for improving the training on target task. There are several ways of application of transfer learning. The most common of them are [1, 2]: (1) instances-based (instances of non-target domain are added to target train dataset with appropriate weight); (2) feature-based, which can be asymmetric (original features are transformed to match the target features) and symmetric (source and target features are transformed into a new feature representation); (3) mapping-based (instances from target and non-target domains are mapped into a new data space with better similarity); (4) network-based (the pre-trained network including its structure and parameters is transferred to target domain with its subsequent fine-tuning on the target data); (5) adversarial-based (the adversarial technology is used to find transferable features that are both suitable for two domains).

Transfer learning is very effective at training DNNs. In the view of speech recognition, the idea of transfer learning is based on the fact that features learned by lower layers of DNN do not depend on language while language specific features are learned by higher layers [3]. At creating the end-to-end ASR for under-resourced language, transfer learning is mostly performed by pre-training the model on data of non-target language and then fine-tuning the model on data of the target language. The parameters of low layers of DNN can be frozen that means that they are not updated during fine-tuning. The transfer learning scheme is presented on Fig. 1.

Fig. 1.
figure 1

The scheme of network-based transfer learning method.

The aim of this research was to explore transfer learning method for training the end-to-end Russian speech recognition system in low-resourced conditions. We have tried several languages for pre-training the model and we have investigated influence of low layer freezing on speech recognition results. The rest of the paper is organized as follows. In Sect. 2 we give a brief survey of the researches in which transfer learning was used for the training of the ASR system, in Sect. 3 we describe our end-to-end Russian speech recognition system with transfer learning, the experimental results are given in Sect. 4, in Sect. 5 we make a conclusion to our work.

2 Related Work

There are many scientific researches on application of transfer learning for training the ASR systems. One of the earliest methods of transfer learning in ASR is a tandem approach [4]. In the tandem approach, at first DNN with bottleneck is trained and then the parameters of bottleneck output are used in standard Hidden Markov model (HMM) based system or hybrid DNN/HMM system [5].

Recently, transfer learning is mostly used for training the hybrid HMM/DNN and end-to-end systems. For example, transfer learning was applied to train the acoustic models for two Tibetan dialects with usage of Mandarin as non-target language in [6]. In [7], German ASR system based on convolutional neural network was trained using transfer learning from the model of English speech trained on Librispeech corpus, with the lower layers of the network being frozen. The influence of freezing of low layer parameters on results of end-to-end speech recognition was researched in [8]. The authors performed experiments on German and Swiss German, with English being used for pre-training. The experiments have shown that freezing of the low layers results in increasing of speech recognition accuracy and reduction of training time. Significant improvement of the accuracy was achieved when the first layer was frozen. The freezing of the higher layers did not lead to recognition accuracy increasing.

The research on training of the hybrid DNN/HMM children speech recognition system using transfer learning was presented in [9]. The adult speech database was used for pre-training. The authors obtained 16.5% relative reduction of WER comparing to the baseline system with Speaker Adaptive Training technique.

In the paper [10], feature transfer learning was performed. At first, the encoder’s lower layers predicting spectral features on the raw waveform were trained. Then trained parameters were transferred to the attention-based encoder-decoder model.

In [11], the transfer learning method called teacher-student was used for initialization of parameters of online speech recognition system by parameters obtained at training of the large offline end-to-end system. The teacher-student learning is an approach to transfer the knowledge from a large deep (“teacher”) network to shallower model [12]. The student neural network is trained to minimize difference between its own output distributions and teacher network’s distributions.

It also should be noted that transfer learning can also be realized as multi-task learning in a multilingual system. Such approach was realized, for example in [13, 14]. In [15] a technique for DNN-based acoustic model adaptation to specific domain in multilingual system was proposed. It performs adaptation of low-resourced language system trained for one source domain into a target domain using adaptation data of high-resourced language.

In the current paper we consider the training of monolingual ASR system in low-resourced condition.

3 End-to-End Speech Recognition Model with Transfer Learning

3.1 Architecture of the End-to-End Speech Recognition Model

We used joint CTC-attention based encoder-decoder model similar to the model proposed in [16]. Our model was described in detail in [17]. Encoder was Bidirectional Long Short-Term Memory (BLSTM) network contained five layers with 512 cells in each with highway connections [18]. Decoder was Long Short-Term Memory (LSTM) network contained two layers with 512 cells in each. Location-aware [19] attention mechanism was used in decoder. Before the encoder, there was a feature extraction block that was VGG [20] model with residual connection (ResNet). At the training stage, the CTC weight was equal to 0.3. Filter banks features were used as input.

At the decoding stage, we additionally used LSTM-based language model (LM), which was trained on text corpus of about 350M words. The text corpus was collected from online Russian newspapers. LSTM contained one layer with 512 cells. The vocabulary consisted of 150K most frequent word-forms from the training text corpus.

For training and testing the end-to-end Russian speech recognition model we used ESPnet toolkit [21] with a PyTorch as a back-end part.

3.2 Application of Transfer Learning at Model’s Training

Transfer learning was carried out by pre-training the model on non-target speech data, transferring the trained parameters of neural network to the target model and the fol-lowing training the model on Russian speech data.

The first step was to choose speech corpora for pre-training. The main criteria for selection the speech corpora were the following: (1) speech data duration of more than 100 hours; (2) sentence-level segmentation; (3) availability of transcripts. We chose five speech corpora of non-target languages which are presented in Table 1. Among these corpora there is a corpus of Ukrainian speech which does not meet the require-ment of duration. However, we decided to use this corpus as well because Ukrainian language is related to Russian, so we hypothesized that pre-training on these speech data may be useful.

Table 1. Characteristics of speech corpora used for pre-training.

The weights obtained from the model trained on non-target data were used for initialization of weights of the feature extraction block, encoder, and attention mechanism. Then, we conducted experiments on freezing parameters of low layers at transfer learning. Architecture of our end-to-end model with transfer learning is presented on Fig. 2.

Fig. 2.
figure 2

Architecture of end-to-end speech recognition model with transfer learning.

The end-to-end model was trained on Russian speech data composed from the speech corpus collected at SPC RAS [17] as well as free speech corpora Voxforge [26] and M-AILABS [25]. The corpus collected at SPC RAS consists of the recordings of phonetically rich and meaningful phrases and texts, also it includes commands for the MIDAS information kiosk [27] and 7–digits telephone numbers. As a result we had 60.6 h of speech data. This speech dataset was splitted into validation and trains parts with sizes of 5% and 95%.

4 Experiments

Experiments on continuous Russian speech recognition were performed on our test speech corpus consisting of 500 phrases pronounced by 5 speakers. The phrases were taken from online newspaper which was not used for LM training. During experiments we used beam search pruning method similar to the approach proposed in [28] and substituted softmax with gumbel-softmax [29]. The decoding setup is described in [30].

We have obtained CER = 14.9% and WER = 37.1% without usage of transfer learning. The results obtained after application of transfer learning using different languages for pre-training are presented in Table 2. In the table, “Init.” means that pretrained parameters were used only for initialization of the parameters of the model’s block without their freezing.

Table 2. Experimental results on Russian speech recognition using different non-target languages for transfer learning (%).

We conducted a series of experiments on application of transfer learning in different parts of the model. Transferring neural network parameters from non-target model for initialization of feature extraction block only slightly decreased recognition error and in some cases even slightly increased it that can be connected with statistic fluctuation. Freezing of first layer of feature extraction block resulted in reduction of CER and WER when Italian, Catalan, and German languages were used. Freezing higher layers did not lead to a decrease in recognition error. Application of transfer learning for initialization of encoder’s parameters decreased recognition error in all cases. Then we conducted experiments on transfer parameters of both encoder and feature extraction block, with first layer of feature extraction block being frozen when transferring from language with which freezing gave better result than just initialization. Freezing the first layer of encoder increased recognition error, therefore we did not perform experiments on freezing the higher layers. Then transfer learning was carried out in attention mechanism. In most cases (except English) application of transfer learning for initialization of parameters in attention mechanism in addition to encoder and feature extraction block gave additional improvement of the result.

The best result (WER = 28.0) was achieved when English was used as non-target language and transfer learning was applied for initialization of parameters of both feature extraction block and encoder. This may be due to the fact that the English corpus was the largest that we used for pre-training. It should also be noted that usage of Ukrainian language gave us the result comparable to usage of other non-target languages although the size of the Ukrainian corpus was significantly smaller. This can be due to the fact that Russian and Ukrainian are related languages. Therefore we can draw a conclusion that in low-resourced condition the usage of other low-resourced language related to the target language can improve speech recognition result.

5 Conclusions and Future Work

In the paper, we have investigated the application of speech data of different non-target languages for pre-training of the end-to-end Russian speech recognition system. The best results were achieved when parameters were transferred from the model pre-trained on English speech for initialization of parameters of the feature extraction block and encoder. In this case relative reduction of WER was 24.53%. The further researches will be connected with enlarging the training data and experimenting with other architectures of neural network for Russian end-to-end speech recognition, for example, Transformer.