Keywords

1 Introduction

Although development of the first speech recognition systems began half a century ago, there has been a significant increase of the accuracy of ASR systems and number of their applications for the recent ten years, even for low-resource languages [17, 20].

This is mainly due to widespread applying of deep learning and very effective performance of neural networks in hybrid recognition systems (DNN-HMM). However, for last few years there has been a trend to change traditional ASR training paradigm. End-to-end training systems gradually displace complex multistage learning process (including training of GMMs [9], clustering of allophones states, aligning of speech to clustered senones, training neural networks with cross-entropy loss, followed by retraining with sequence-discriminative criterion). The new approach implies training the system in one global step, working only with acoustic data and reference texts, and significantly simplifies or even completely excludes in some cases the decoding process. It also avoids the problem of out-of-vocabulary words (OOV), because end-to-end system, trained with parts of the words as targets, can construct new words itself using graphemes or subword units, while traditional DNN-HMM systems are limited with language model vocabulary.

The whole variety of end-to-end systems can be divided into 3 main categories: Connectionist Temporal Classification (CTC) [14]; Sequence-to-sequence models with attention mechanism [8]; RNN-Transducers [13].

Connectionist Temporal Classification (CTC) approach uses loss functions that utilize all possible alignments between reference text and audio data. Targets for CTC-based system can be phonemes, graphemes, syllables and other subword units and even whole words. However, a lot more data is usually required to train such systems well, compared to traditional hybrid systems.

Sequence-to-Sequence Models are used to map entire input sequences to output sequences without any assumptions about their alignment. The most popular architecture for sequence-to-sequence models is encoder-decoder model with attention. Encoder and decoder are usually constructed using recurrent neural networks, basic attention mechanism calculates energy weights that emphasize importance of encoder vectors for decoding on this step, and then sums all these vectors with energy weights. Encoder-decoder models with attention mechanism show results close to traditional DNN-HMM systems and in some cases surpass them, but for a number of reasons their usage is still rather limited. First of all, this is related to the fact, that such systems show best results when the duration of real utterances is close to the duration of utterances from training data. However, when the duration difference increases, the performance degrades significantly [8, Fig. 4 “Utterance Length vs. Error”].

Moreover, the entire utterance must be preprocessed by encoder before start of decoder’s work. This is the reason, why it is hard to apply the approach to recognize long recordings or streaming audio. Segmenting long recordings into shorter utterances solves the duration issue, but leads to a context break, and eventually negatively affects recognition accuracy. Secondly, the computational complexity of encoder-decoder models is high because of recurrent networks usage, so these models are rather slow and hard to parallelize.

The idea of RNN-Transducer is an extension of CTC and provides the ability to model inner dependencies separately and jointly between elements of both input (audio frames) and output (phonemes and other subword units) sequences. Despite of mathematical elegance, such systems are very complicated and hard to implement, so they are still rarely used, although several impressive results were obtained using this technique.

CTC-based approach is easier to implement, better scaled and has many “degrees of freedom”, which allows to significantly improve baseline systems and achieve results close to state-of-the-art. Moreover, CTC-based systems are well compatible with traditional WFST-decoders and can be easily integrated with conventional ASR systems.

Besides, as already mentioned, CTC-systems are rather sensitive to the amount of training data, so it is very relevant to study how to build effective CTC-based recognition system using a small amount of training samples. It is especially actual for low-resource languages, where we have only a few dozen hours of speech. Building ASR system for low-resource languages is one of the aims of international Babel program, funded by the Intelligence Advanced Research Projects Activity (IARPA). Within the program extensive research was carried out, resulting in creation of a number of modern ASR systems for low-resource languages. Recently, end-to-end approaches were applied to this task, showing expectedly worse results than traditional systems, although the difference is rather small.

In this paper we explore a number of ways to improve end-to-end CTC-based systems in low-resource scenarios using the Turkish language dataset from the IARPA Babel collection. In the next section we describe in more details different versions of CTC-systems and their application for low-resource speech recognition. Section 3 describes the experiments and their results. Section 4 summarizes the results and discusses possible ways for further work.

2 Related Work

Development of CTC-based systems originates from the paper [14] where CTC loss was introduced. This loss is a total probability of labels sequence given observation sequence, which takes into account all possible alignments induced by a given words sequence.

Although a number of possible alignments increases exponentially with sequences lengths, there is an efficient algorithm to compute CTC loss based on dynamic programming principle (known as Forward-Backward algorithm). This algorithm operates with posterior probabilities of any output sequence element observation given the time frame and CTC loss is differentiable with respect to these probabilities.

Therefore, if an acoustic model is based on the neural network which estimates these posteriors, its training may be performed with a conventional error back-propagation gradient descent [24]. Training of ASR system based on such a model does not require an explicit alignment of input utterance to the elements of output sequence and thus may be performed in end-to-end fashion. It is also important that CTC loss accumulates the information about the whole output sequence, and hence its optimization is in some sense an alternative to the traditional fine-tuning of neural network acoustic models by means of sequence-discriminative criteria such as sMBR [18] etc. The implementation of CTC is conventionally based on RNN/LSTM networks, including bidirectional ones as acoustic models, since they are known to model long context effectively.

The important component of CTC is a special “blank” symbol which fills in gaps between meaningful elements of output sequence to equalize its length to the number of frames in the input sequence. It corresponds to a separate output neuron, and blank symbols are deleted from the recognized sequence to obtain the final result. In [10] a modification of CTC loss was proposed, referred as Auto SeGmentation criterion (ASG loss), which does not use blank symbols. Instead of using “blank”, a simple transition probability model for an output symbols is introduced. This leads to a significant simplification and speedup of computations. Moreover, the improved recognition results compared to basic CTC loss were obtained.

DeepSpeech [15] developed by Baidu Inc. was one of the first systems that demonstrated an effectiveness of CTC-based speech recognition in LVCSR tasks. Being trained on 2300 h of English Conversational Telephone Speech data, it demonstrated state-of-the-art results on Hub5’00 evaluation set. Research in this direction continued and resulted in DeepSpeech2 architecture [7], composed of both convolutional and recurrent layers. This system demonstrates improved accuracy of recognition of both English and Mandarin speech. Another successful example of applying CTC to LVCSR tasks is EESEN system [22]. It integrates an RNN-based model trained with CTC criterion to the conventional WFST-based decoder from the Kaldi toolkit [23]. The paper [21] shows that end-to-end systems may be successfully built from convolutional layers only instead of recurrent ones. It was demonstrated that using Gated Convolutional Units (GLU-CNNs) and training with ASG-loss leads to the state-of-the-art results on the LibriSpeech database (960 h of training data).

Recently, a new modification of DeepSpeech2 architecture was proposed in [25]. Several lower convolutional layers were replaced with a deep residual network with depth-wise separable convolutions. This modification along with using strong regularization and data augmentation techniques leads to the results close to DeepSpeech2 in spite of significantly lower amount of data used for training. Indeed, one of the models was trained with only 80 h of speech data (which were augmented with noisy and speed-perturbed versions of original data).

These results suggest that CTC can be successfully applied for the training of ASR systems for low-resource languages, in particular, for those included in Babel research program (the amount of training data for them is normally 40 to 80 h of speech).

Currently, Babel corpus contains data for more than 20 languages, and for most of them quite good traditional ASR system were built [6, 12, 16]. In order to improve speech recognition accuracy for a given language, data from other languages is widely used as well. It can be used to train multilingual system via multitask learning or to obtain high-level multilingual representations, usually bottleneck features, extracted from a pre-trained multilingual network.

One of the first attempts to build ASR system for low-resource BABEL languages using CTC-based end-to-end training was made recently [11]. Despite the obtained results are somewhat worse compared to the state-of-the-art traditional systems, they still demonstrate that CTC-based approach is viable for building low-resource ASR systems. The aim of our work is to investigate some ways to improve the obtained results.

3 Experiments

3.1 Basic Setup

For all experiments we used conversational speech from IARPA Babel Turkish Language Pack (LDC2016S10). This corpus contains about 80 h of transcribed speech for training and 10 h for development. The dataset is rather small compared to widely used benchmarks for conversational speech: English Switchboard corpus (300 h, LDC97S62) and Fisher dataset (2000 h, LDC2004S13 and LDC2005S13).

Fig. 1.
figure 1

Architectures

As targets we use 32 symbols: 29 lowercase characters of Turkish alphabet [5], apostrophe, space and special \(\langle {\text {blank}}\rangle \) character that means “no output”. Thus we do not use any prior linguistic knowledge and also avoid OOV problem as the system can construct new words directly.

All models are trained with CTC-loss. Input features are 40 mel-scaled log filterbank energies (FBanks) computed every 10 ms with 25 ms window, concatenated with deltas and delta-deltas (120 features in vector). We also tried to use spectrogram and experimented with different normalization techniques.

For decoding we used character-based beam search [1] with 3-g language model build with SRILM package [4] finding sequence of characters c that maximizes the following objective [15]:

$$\begin{aligned} Q(c) = \log {P(c|x)} +\alpha \log {P_{lm}(c)} +\beta \text {wordcount}(c), \end{aligned}$$

where \(\alpha \) is language model weight and \(\beta \) is word insertion penalty.

For all experiments we used \(\alpha = 0.8\), \(\beta = 1\), and performed decoding with beam width equal to 100 and 2000, which is not very large compared to 7000 and more active hypotheses used in traditional WFST decoders (e.g. many Kaldi recipes do decoding with \(max\_active = 7000\)).

To compare with other published results [2, 11] we used Sclite [3] scoring package to measure results of decoding with beam width 2000, that takes into account incomplete words and spoken noise in reference texts and doesn’t penalize model if it incorrectly recognize these pieces.

Also we report WER (word error rate) for simple argmax decoder (taking labels with maximum output on each time step and than applying CTC decoding rule collapse repeated labels and remove “blanks”).

3.2 Experiments with Architecture

We tried to explore the behavior of different neural network architectures in case when rather small data is available. We used multi-layer bidirectional LSTM networks, tried fully-convolutional architecture similar to Wav2Letter [10] and explored DeepSpeech-like architecture developed by Salesforce (DS-SF) [25] (Fig. 1).

The convolutional model consists of 11 convolutional layers with batch normalization after each layer. The DeepSpeech-like architecture consists of 5-layers residual network with depth-wise separable convolutions followed by 4-layer bidirectional Gated Recurrent Unit (GRU) as described in [25].

Our baseline bidirectional LSTM is 6-layers network with 320 hidden units per direction as in [11]. Also we tried to use bLSTM to label every second frame (20 ms) concatenating every first output from first layer with second and taking this as input for second model layer.

The performance of our baseline models is shown in Table 1.

Table 1. Baseline models trained with CTC-loss

3.3 Loss Modification: Segmenting During Training

It is known that CTC-loss is very unstable for long utterances [14], and smaller utterances are more useful for this task. Some techniques were developed to help model converge faster, e.g. sortagrad [7] (using shorter segments at the beginning of training).

To compute CTC-loss we use all possible alignments between audio features and reference text, but only some of the alignments make sense. Traditional DNN-HMM systems also use iterative training with finding best alignment and then training neural network to approximate this alignment. Therefore, we propose the following algorithm to use segmentation during training:

  • compute CTC-alignment (find the sequence of targets with minimal loss that can be mapped to real targets by collapsing repeated characters and removing blanks)

  • perform greedy decoding (argmax on each step)

  • find “well-recognized” words with \(length \ge T\) (T is a hyperparameter): segment should start and end with space; word is “well-recognized” when argmax decoding is equal to computed alignment

  • if the word is “well-recognized”, divide the utterance into 5 segments: left segment before space, left space, the word, right space and right segment

  • compute CTC-loss for all this segments separately and do back-propagation as usual

The results of training with this criterion are shown in Table 2. The proposed criterion doesn’t lead to consistent improvement while decoding with large beam width (2000), but shows significant improvement when decoding with smaller beam (100). We plan to further explore utilizing alignment information during training.

Table 2. Models trained with CTC and proposed CTC modification

3.4 Using Different Features

We explored different normalization techniques. FBanks with cepstral mean normalization (CMN) perform better than raw FBanks. We found using variance with mean normalization (CMVN) unnecessary for the task. Using deltas and delta-deltas improves model, so we used them in other experiments. Models trained with spectrogram features converge slower and to worse minimum, but the difference when using CMN is not very big compared to FBanks (Table 3).

Table 3. 6-layers bLSTM trained using different features and normalization

3.5 Varying Model Size and Number of Layers

Experiments with varying number of hidden units of 6-layer bLSTM models are presented in Table 4. Models with 512 and 768 hidden units are worse than with 320, but model with 1024 hidden units is significantly better than others. We also observed that model with 6 layers performs better than others.

Table 4. Comparison of bLSTM models with different number of hidden units.

3.6 Training the Best Model

To train our best model we chose the best network from our experiments (6-layer bLSTM with 1024 hidden units), trained it with Adam optimizer and fine-tuned with SGD with momentum using exponential learning rate decay. The best model trained with speed and volume perturbation [19] achieved 45.8% WER (Table 5), which is the best published end-to-end result on Babel Turkish dataset using in-domain data. For comparison, WER of model trained using in-domain data in [11] is 53.1%, using 4 additional languages (including English Switchboard dataset) 48.7%. It is also not far from Kaldi DNN-HMM system [2] with 43.8% WER.

Table 5. Using data augmentation and finetuning with SGD

4 Conclusions and Future Work

In this paper we explored different end-to-end architectures in low-resource ASR task using Babel Turkish dataset. We considered different ways to improve performance and proposed promising CTC-loss modification that uses segmentation during training. Our final system achieved 45.8% WER using in-domain data only, which is the best published result for Turkish end-to-end systems. Our work also shows than well-tuned end-to-end system can achieve results very close to traditional DNN-HMM systems even for low-resource languages. In future work we plan to further investigate different loss modifications (Gram-CTC, ASG) and try to use RNN-Transducers and multi-task learning.