Keywords

1 Introduction

Recent results in many domains like NLP and Computer Vision benefited from the use of self-supervised pretraining method, which can be described as a process of learning robust universal representations based on unlabeled datasets. In the field of speech analysis, this approach was implemented within the wav2vec2 model, which made it possible to obtain high-quality results for the English language with a minimum amount (from 10 min of records) of labeled data [5]. The idea of the technology is to use a large amount of unlabeled data to construct an acoustic representation of the speech signal samples. Wav2Vec2 model solves a problem that does not require manual annotation of the corpus. It uses the CPC (Contrastive Predictive Coding) criterion, and the model needs to distinguish the true speech representation from distractors that are uniformly sampled from other masked time steps of the same utterance [6, 9, 14]. In [10], it is shown that features, revealed by the model in the process of solving this problem, demonstrate robustness to changes in the domain and the language. An illustration of the model from the original article wav2vec2 [5] is shown in Fig. 1.

Fig. 1.
figure 1

An illustration of the work of the wav2vec2 model, which learns the contextual representation of audio fragments based on unlabeled data [5]

And if a few years ago the vast majority of recognition systems were based on the “classical” ASR systems that consist of separate acoustic models, a pronunciation model, and a language model, recently end-to-end systems (E2E) have come to the fore. E2E ASR systems allow obtaining a better result, however, they require a large amount of training data, which is not available for low-resource languages. One way to overcome the lack of training data is to pre-train the system on data for related languages or to use a model that has been trained for high-resources language with a lot of labeled data. The possible benefits of using the wav2vec2 E2E approach are as follows: systems are becoming more robust to various background noises, dialects, pronunciation features; moreover, for low-resource languages, it’s much easier to find a significant amount of unlabeled data.

In this paper, we describe the results of experiments on the creation of Tatar speech recognition systems. We compare different training scenarios, the full scenario consists of 4 training steps:

  1. 1.

    Base self-supervised pretraining (BaseSS).

  2. 2.

    Source self-supervised pretraining (SourceSS).

  3. 3.

    Target self-supervised pretraining (TargetSS).

  4. 4.

    Target fine-tuning (TargetFT).

All scenarios are shown in Fig. 2. In the following sections, we give a training procedure description, provide details of data collection, and present the comparative analysis of the experiments’ results.

Fig. 2.
figure 2

Model training options

2 System Description

This article uses an approach with iterative self-supervised pretraining steps on audio data that is increasingly closer to the target domain. We implement 4 main training stages: base self-supervised pretraining (BaseSS), source self-supervised pretraining (SourceSS), target self-supervised pretraining (TargetSS), and target supervised fine-tuning (TargetFT), and analyze the effect of each pretraining step on the resulting recognition quality of ASR systems. The first stage is the BaseSS pretraining step. This step is the initial training where a (very) large dataset is used. The resulting model learned acoustic representation for a wide variety of noise conditions and speakers’ variability. For our experiments we have chosen three possible alternatives to use as the base pre-trained model:

  1. 1.

    No pre-trained model.

  2. 2.

    Base Wav2Vec 2.0 Librispeech model (language: English, total duration: 1000 h).

  3. 3.

    Multilingual XLSR model (53 languages, total duration: 56k h).

For the second training step, we use source datasets consisting of heterogeneous Tatar audio data. This data allows the model to start learning language-specific acoustic features with a diverse set of speakers, noise conditions, etc. Data for the SourceSS stage were collected from TV shows, radio transmissions, audiobooks, and YouTube videos. More on data collection procedure can be found in the Data Collection section.

The TargetSS stage performs additional self-supervised training with the target Tatar datasets that have annotations, but they are not used here. We haven’t set any hard restrictions on the style of speech for Target datasets due to the small number of available annotated Tatar speech corpora. Therefore, we use all of the existing data including both close-distance microphones read speech and broadcast spontaneous speech.

And at the last stage, the Tatar annotated speech corpus is used to fine-tune the model obtained at the previous stages. Additional training is based on the CTC (Connectionist Temporal Classification) algorithm [6, 7]. A randomly initialized layer with a dimension equal to the number of elements in the dictionary is added to the model. For the case of the Tatar language, the dictionary consists of 39 elements: 38 letters and an additional character ‘—’ as a words’ separator.

3 Data Collection

The multistage approach that we chose for the training of ASR systems dictates the training data requirements. We need an unlabeled dataset for self-supervised pretraining steps and annotated dataset for supervised fine-tuning. To the moment there are two available datasets for the Tatar language: one from the CommonVoice project [1], and another from the TatarCorpus dataset [11]. Both datasets contain read speech with good SNR, all audios are manually annotated. To collect unlabeled datasets we obtained audios from several sources: a private dataset of audiobooks from Tatar book publishing company, records of TV and radio broadcasting, YouTube videos.

The resulting unlabeled corpus consists of 4 subcorpora:

  1. 1.

    Subcorpus of audiobooks: read speech recorded in studio conditions, 520 files with a total duration of 114 h.

  2. 2.

    Subcorpus of television broadcasting: spontaneous speech, variety of external noises and background music, 62 files - 733 h.

  3. 3.

    Subcorpus of two radio stations’ recordings: read and spontaneous speech, background music, 398 files - 215 h.

  4. 4.

    Subcorpus of scientific video lectures from the YouTube platform: mostly read speech, good recording quality, 100 files - 87 h.

We carried out some basic preprocessing of the obtained video and audio files, which included audio track extraction from video files and audio file conversion to 16 bits per sample, 16 kHz mono PCM format. Taking into account the specifics of the initial data (long audiobooks, 12-h fragments of TV snippets, 40-min YouTube clips), the next task was to divide audio files into shorter fragments containing speech. The goal was to convert all data into 5–30 s fragments, where each fragment contains the speech of only one speaker. To solve this problem, we used the Silero-VAD tool [4]. Selective analysis of resulting fragments showed that the model coped with filtering music content that was present in radio and TV air while retaining speech segments with background music. But the duration of split fragments varied markedly. Based on the recommendations of the developers of the wav2vec2 model [3], short (less than 4.5 s) and long (longer than 30 s) audio files were filtered. The summary statistics on the number of files and their duration for each subcorpus are presented in Table 1.

The annotated corpus of Tatar speech, which was used for target self-supervised pretraining and target fine-tuning steps, consists of 3 parts:

  1. 1.

    Tatar speech corpus “TatarCorpus” [11]: close-microphone recordings, read speech - 99 h and 9 min, 500 speakers.

  2. 2.

    Subcorpus of television broadcasting: crowdsource annotation using the web-service [12] - 1 h and 33 min.

  3. 3.

    The Tatar part of the CommonVoice corpus [1] - 28 h and 47 min, 15 speakers.

To construct a test subcorpus we chose recordings of 10 random speakers (5 male, 5 female) from the “TatarCorpus” (1 h and 37 min); for the Common Voice part, we used the original division into training and test samples, proposed by the creators of the corpus (3 h and 33 min); for the subcorpus of TV broadcasts we don’t have speaker-level annotation, so the selection of 110 test fragments was carried out randomly throughout the corpus (5 min). In total it gave us 5 h 15 min test subcorpus.

Table 1. The characteristics of unlabeled speech corpus for the Tatar language

As a language model for the speech recognition system, a 4-gram statistical model was built using the KenLM tool [8]. The total amount of training data was 8,760,330 sentences containing 116 million words. We downloaded and processed Tatar texts from the Internet (archives of leading news agencies, newspapers, magazines, websites of state institutions and departments, forums) and used some parts of the Tatar national corpus “Tugan Tel” [15].

4 Experiments

In total, we trained 8 different models. Taking into account the existence and type of the base model and self-supervised training steps used we will name our models in None, Base, XLSR_[SourceSS]_[TargetSS]_TargetFT format. The experiments were carried out on the fairseq platform [3]. Pretraining was carried out on 8 V100 32 GB video cards.

The recognition quality values were calculated separately for all test subcorpora. Word error rates (WER) for all built systems are presented in Table 2.

The best recognition quality on the test corpus achieved by the Base_SourceSS _TargetSS_TargetFT model: 5.67 WER even though using XLSR as the base model looked promising because of the amount of training data (56k h) and variety of languages (53, including Tatar) used during training. However, it is worth noting that on two of three test subcorpus (CommonVoice and TV) XLSR-based models show better performance than Base ones. Better quality on these subcorpora can be partially explained by the fact that CommonVoice data and Babel (telephone conversational speech) were included in the XLSR training corpus, therefore the model learned essential features right from the initial stage of training.

The previous best value showed by the “canonical” ASR system, built on separate acoustic models, a pronunciation model, and a language model, on the “TatarCorpus” test dataset is equal to 12.89 WER [13]. The best model proposed in this work on the same test subcorpus showed a value of 4.65 WER (Base_SourceSS_TargetSS_TargetFT). The WER values showed by the system [2] were taken as the base values for comparing the quality on the CommonVoice test dataset. The best value presented there is 26.76 WER, while our proposed system showed a value of 5.37 WER (XLSR_SourceSS_ TargetSS_TargetFT).

Table 2. Recognition quality of all trained models, WER

Much higher error rates for TV test subcorpus can be explained by the complexity of spontaneous speech and partially by the fact that annotations were collected through crowdsourcing and contain mistakes. Some analysis of test TV audio fragments showed that there are several aspects that we will keep in mind in our future work:

  1. 1.

    Poorly distinguishable words at the beginning or end of the fragment that were not manually annotated, but were recognized by the ASR system. For instance, reference phrase ‘isemendage’, hypothesis ‘manova isemendage’ where ‘manova’ is an ending of a surname, where the starting part of it is not audible due to background noise);

  2. 2.

    Short interjections, often borrowed from the Russian language. For instance, reference phrase ‘nu anda hal itep beterese’, hypothesis ‘anda hal itep beterese’, where word ‘nu’ is a Russian interjection meaning ‘well’);

  3. 3.

    Other inaccuracies in annotations. For example, reference phrase ‘president rostem minnehanov ta’, hypothesis ‘president rostam min’nehanov ta’ with difference in nn’ (Tatar n letter) letters; annotator made a mistake in spelling the surname in Russian and Tatar.

The second type of mistake can be influenced by the language model and not directly related to the training procedure of acoustic models. So we calculated WERs for the systems without the use of LM. The results are presented in Table 3.

Table 3. Recognition quality of all trained models without language model, WER

With these “raw” acoustic WER values, we still see the same correlation: both SourceSS and TargetSS pretraining steps allow models to perform better on test datasets. The only two exceptions of this fact can be seen in comparison between Base_SourceSS_TargetFT and Base_SourceSS_TargetSS_TargetFT, XLSR_SourceSS_ TargetFT and XLSR_SourceSS_TargetSS_TargetFT for TV test subcorpus. For these two cases, the additional TargetSS step leads to an increase of WER for 3% and 5%, respectively. The increase in the quality of speech recognition for each type of model is presented in Table 4.

Table 4. Influence of self-supervised pre-training steps on recognition quality, % WER

5 Conclusion

This paper presents the results of experiments on building a Tatar speech recognition system using an iterative self-supervised pretraining procedure. We prepared 128-h annotated and 340-h unlabeled speech corpora. We propose two additional pretraining steps between the base pre-trained system and target fine-tuning. The first step that we called SourceSS uses unlabeled data from various sources (TV and radio broadcasting, YouTube clips, audiobooks) while the second TargetSS uses only an audio part from annotated target dataset. The testing of the proposed speech recognition systems confirmed good (SOTA) performance for different types of speech (read and spontaneous) and noise conditions. SourceSS step gave on average 24.3% WER improvement, TargetSS - 12.5%; both pretraining - 33.3%. These values were calculated for models that haven’t used language models. As for absolute numbers, the best model in our experiments showed 5.37% WER for the Common Voice test dataset and 4.65% WER for TatarCorpus, which are 79.9% and 63.9% better than the previously published best result on these datasets.