Keywords

1 Introduction

The traditional approach for multilingual phone recognition uses front-end language-identification (LID)-switched monolingual phone recognition (LID-Mono) trained individually on each of the languages present in the multilingual dataset. The traditional approach has several disadvantages: (i). Complex two-stage architecture, (ii). Failure of LID block leads to the failure of entire system, (iii). Developing monolingual phone recogniser is not feasible for all languages. We propose to use a common multilingual phone-set approach to build Multilingual Phone Recognition System (Multi-PRS).

We address the problem of efficient techniques for speech recognition of code-switched speech. In code-switching, two or more languages are mixed and spoken as if they are one language [15, 29]. Code-switching (or code-mixing or language-mixing) involves switching between multiple languages either inter-sententially and intra-sententially [17]. Bilingual code-switching is more common compared to the mixing of more than two languages [26]. The reasons for code-switching include (i) availability of a better word or phrase in another language to express a particular idea, (ii) certain words or phrases are more readily available in the other language, (iii) to show expertise in multiple languages. Code-switching is a common practice across the world in multilingual societies, where a speaker has spoken proficiency in more than one language. In this study, we have considered intra-sentential code-switching between two Indian languages, namely, Kannada (KN) and Urdu (UR), with Kannada sentence being the primary language within which switching occurs to Urdu words and phrases.

The proposed Multi-PRS is faced with the specific difficulty of having to arrive at the appropriate phone set based on which such a phonetic decoding can be done on input speech from any of the languages of interest. Such a common phone set has to have a coverage of all the phones occurring across the multiple languages while also ensuring that the individual language’s phones are accurately mapped to the phones in the common phone set. We propose the use of International Phonetic Alphabet (IPA) chart to derive common multilingual phone-set. IPA has strict one-to-one correspondence between symbols and sounds which makes it to be able to accommodate all the world’s diverse languages. Few notable works based on common multilingual phone-set approach are reported in [32, 33, 37, 38]. Although there are significant efforts to develop multilingual speech recognizers using Indian languages [2, 8, 34], not many studies have explored the use of IPA based common multilingual phone-set to develop multilingual phone recognizers using Indian languages. The most recent works on multilingual speech recognition using DNNs are reported in [14, 21, 22, 25, 42]. Few notable works on code-switched speech recognition using multilingual speech recognisers are reported in [1, 13, 16, 30, 36].

The focus of our work here is to compare two approaches of multilingual speech recognition: (i) one involving using a front-end language-identification stage to detect the language spoken in short intervals of speech and then use the recognized language’s phone recogniser to decode the speech; we refer to this as a LID-switched monolingual approach (LID-Mono), and (ii) using a Multilingual Phone Recognition System based on common multilingual phone-set (Multi-PRS). We further extend the comparison between LID-Mono and Multi-PRS to code-switching scenario. Because the Multi-PRS can seamlessly decode the code-switched speech without regard to the code-switched instances, since the Multi-PRS is designed using a common phone-set between several languages (from which the code-switched speech could switch between any pair of languages) with the corresponding common phone acoustic-models being trained from shared-data from the multiple languages. The rest of the paper is organized as follows: Sect. 2 describes our experimental setup. Section 3 describes and compares the two approaches of multilingual phone recognition. Section 4 extends the comparison to code-switching scenario. Section 5 provides the summary of the paper.

2 Experimental Setup

2.1 Multilingual Speech Corpora

We describe here the details of the speech corpora of the 6 Indian languages used in this work: Kannada (KN), Telugu (TE), Bengali (BN), Odia (OD), Urdu (UR), and Assamese (AS). The speech corpora was collected as a part of consortium project titled Prosodically guided phonetic engine for searching speech databases in Indian languages supported by DIT, Govt. of India [12]. Speech corpora contains 16 bit, 16 KHz speech wave files along-with their IPA transcription [39]. The wave files contain read speech sentences of size between 3 to 10 s. Detailed description of the speech corpora is provided in [4, 19, 23, 28]. We have used a split of 80:20 for train and test data, respectively. 10% of training data is held out from the training and used as development set. Table 1 shows the statistics of the speech corpora.

Table 1. Statistics of multilingual speech corpora

2.2 Testing Set Speech Corpora for Code-Switching Scenario

We have selected 320 code-switched sentences having Kannada as the primary language and code-switching to Urdu words and phrases. The sentences are carefully chosen to cover all the phonetic units of Kannada and Urdu. Four male and four female speakers who are bilinguals of Kannada and Urdu are made to read 40 sentences each. The speakers are proficient in both spoken Urdu and Kannada, and tend to produce KN-UR utterances. These sentences were transcribed using IPA symbols and then mapped to the common multilingual phone-set to generate the ground-truth transcription for calculation of PER in the decoding.

Figure 1 shows the distribution of durations of KN and UR languages in code-switched KN/UR test sets. The duration of each utterance range from 3.5 s to 11 s in the data set, within which the Urdu words and phrases occur at durations, from which it can be noted that Kannada segments in an utterance are the longer ones, interspersed with Urdu segments of relatively shorter durations, importantly ranging from 0–0.5 s to 3–3.5 s which typically correspond to short words (\({<}500\) ms) and multi-word phrases (of the order of 1–3.5 s). This kind of Urdu segments in the code-switched data plays an important factor in determining how the LID-Mono works, particularly in the choice of the speech interval size on which the front-end LID has to operate.

Fig. 1.
figure 1

Distribution of durations of KN and UR languages in code-switched KN/UR test sets.

2.3 Training DNNs

Context dependent DNNs with tanh non-linearity at hidden layers and softmax activation at the output layer are used. DNNs are trained using greedy layer-by-layer supervised training. Initial learning rate was chosen to be 0.015 and was decreased exponentially for the first 15 epochs. A constant learning rate of 0.002 was used for the last 5 epochs. Once all the hidden layers are added to the network, shrinking is performed after every 3 iterations, so as to separately scale the parameters of each layer. Mixing up was carried out halfway between the completion of addition of all the hidden layers and the end of training. Stability of the training is maintained through preconditioned affine components. Once the final iteration of training completes, the models from last 10 iterations are combined into a single model. Each input to DNNs uses a temporal context of 9 frames (4 frames on either side). The number of hidden layers of DNNs used in the development of Phone Recognition Systems (PRS) are tuned by adjusting the width of the hidden layers. It is found that the DNNs with 5 hidden layers are suitable for building PRSs. Bi-phone (phoneme bi-grams) language model is used for decoding. The language model weighting factor and acoustic scaling factor used for decoding the lattice are optimally determined using the development set to minimize the PER. DNNs training used in this study is similar to the one presented in [41]. All the experiments are conducted using the open-source speech recognition toolkit - Kaldi [11].

2.4 Extraction of i-vectors

The i-vectors are one of the most widely used features for language recognition. They are fixed dimension feature vectors that are derived from the variable length sequence of front-end features [24]. A DNN is trained for automatic speech recognition using the labelled speech data from Switchboard (SWB1) and Fisher corpora (about 2000 h). Training uses hidden layers with ReLU activation with layer-wise batch normalization. Mel-frequency cepstral coefficients (MFCCs) are extracted from each input utterance and fed to DNNs. The bottleneck features (80 dimension) are extracted from the bottleneck layer of trained DNN [5, 6]. The extracted bottleneck features are the front end features. A Gaussian Mixture Universal Background Model (GMM-UBM) is obtained by pooling the front end features from all the utterances in the train dataset. The means of the GMMs are adapted to each utterance using the Baum-Welch statistics of the front-end features. The i-vectors (400 dimension) are computed based on each adapted GMM mean supervector. Since the SWB1 and Fisher corpora used for training the DNNs have the sampling rate of 8 KHz, we have down-sampled the multilingual speech corpora and the testing datasets (see Sects. 2.1 and 2.2) from 16 KHz to 8 KHz for extracting the i-vectors. Detailed description of extraction of i-vectors is given in [7].

3 Approaches for Multilingual Phone Recognition

The following subsections describe the development and comparison of multilingual phone recognizers using two approaches: (i). LID-switched monolingual phone recognition (LID-Mono) approach, and (ii). Multilingual phone recognition using common multilingual phone-set (Multi-PRS) approach.

3.1 LID-switched Monolingual Phone Recognition (LID-Mono) Approach

LID-Mono is a traditional approach for multilingual phone recognition and is shown in Fig. 2. It consists of two stages. In the first stage, the language of the input speech is determined using a language identification block. In the second stage, the input speech utterance is routed to the monolingual phone recognizer of the language identified in stage-1 and the phones present in the input speech are determined. Monolingual phone recognizer is a conventional PRS developed using the data of single language.

Fig. 2.
figure 2

Multilingual phone recognition using LID-Mono approach.

We briefly outline the LID system here. There are two approaches for LID, namely, implicit LID and explicit LID. The explicit LID requires phonetic transcription and language models for each language [3, 27], whereas the implicit LID does not need either phonetic transcription or language models [10, 35]. Since, we do not have language models for the languages considered in this study, we have carried out implicit LID to perform LID. Support Vector Machines (SVM) [20, 40] are used to train the LID classifiers. Multi-class SVM is constructed using one-against-one approach (Max-win voting). The radial basis function is used as a kernel. The LIBSVM library is used for building SVM models [9]. We have explored both MFCCs and i-vectors as features for building LID systems. The 13-dimensional MFCCs [18] along-with their first and second order derivatives are computed using a frame-length of 25 ms with a frame-shift of 10 ms. The i-vectors are extracted using the procedure described in Sect. 2.4. Table 2 shows the LID accuracy (%) for various language sets using SVMs. Since, the performance of LID using i-vectors outperforms MFCCs, we have considered only the i-vector based LID systems in all our experiments. LID accuracy decreases as the number of languages increase.

Table 2. LID accuracy for various language sets using MFCCs and i-vectors.

3.2 Multilingual Phone Recognition Using Common Multilingual Phone-Set Approach (Multi-PRS)

Figure 3 shows the schematic representation of the Multi-PRS. Unlike Fig. 2, which has two stages, Fig. 3 has a single stage - irrespective of any language, Multi-PRS accepts speech input from any language and decodes it into a sequence of phonetic units [31] using a common phone-set.

Fig. 3.
figure 3

Multilingual phone recognition using common multilingual phone set approach.

We have developed Multi-PRSs using six Indian languages - KN, TE, BN, OD, UR and AS. The common multilingual phone-set is derived by grouping the acoustically similar IPAs across the languages together and selecting the phonetic units which have sufficient number of occurrences to train a separate model for each of them. The IPAs which do not have sufficient number of occurrences will be mapped to the closest linguistically similar phonetic units present in the common multilingual phone-set. The common multilingual phone-set thus derived contained 44, 46, 46 phones for 4, 5, and 6 languages, respectively. We have also developed monolingual Phone Recognition Systems (Mono-PRSs) for KN, TE, BN, OD, UR, and AS languages using 36, 35, 34, 36, 35, and 32 phones, respectively. Mono-PRSs are used in second stage of LID-Mono systems as shown in Fig. 2. The Mono-PRSs and Multi-PRSs are trained using CD DNNs.

3.3 Comparison of LID-Mono and Multi-PRS

Phone Error Rate (PER) is determined by comparing the decoded phone labels with the reference transcriptions by performing an optimal string matching using dynamic programming.

Table 3. Phone error rates of multilingual phone recognition systems.

Table 3 shows the PERs of LID-Mono and Multi-PRS approaches. It is found that the Multi-PRS systems based on common phone set approach outperform the traditional LID-Mono systems. As the number of languages increase the benefits of Multi-PRSs will be more. Higher the number of languages more the benefit from Multi-PRSs compared to LID-Mono. The use of Multi-PRS has an additional advantage of decoding more number of phones compared to the LID-Mono. This would help the language models to recognise the words more accurately.

Fig. 4.
figure 4

LID accuracy (%) and PERs (%) of LID-Mono and Multi-PRS approaches.

We show in Fig. 4, the performance of the LID-Mono and Multi-PRSs on test data drawn from 4, 5 and 6 languages in terms of % LID accuracy and % PER. It can be noted that the LID-Mono has an inherently poor performance marked by decreasing %LID accuracy as the number of language classes increase from 4 to 6, which in turn impacts the % PER to increase in going from 4 to 6 languages. When the LID system makes an error, the LID-switched monolingual phone recognition chooses the wrong language phone acoustic models to decode the input speech and naturally incurs an higher %PER. The %PERs of Multi-PRS system, in contrast, have a robust constant performance across the multiple languages clearly arising from its not depending on a front-end LID decision making.

4 LID-Mono and Mutli-PRS Approaches in Code-Switching Application

For comparison of results, in addition to the multilingual phone recognisers based on 4, 5, and 6 languages, we have also developed bilingual phone recognisers using KN and UR languages using both LID-Mono and Multi-PRS approaches. Further, we have analysed how the duration of windows (in seconds) used for performing LID effects the performance of various multilingual phone recognisers. Figure 5 shows a composite display of the performance of LID-Mono and Multi-PRS for different number of test languages, different sizes of the intervals over which the LID makes a decision (called the LID-interval in the x-axis from 500 ms to 5 s and the full utterance), the resulting %LID accuracy and the overall %PER (in the two y-axes).

Fig. 5.
figure 5

Comparison of LID-Mono and Multi-PRS systems in code-switching scenario.

Considering the LID-Mono system curves, for small LID-intervals, the LID accuracy is low (45% to 85% for different number of languages considered 6 to 4), with the LID-Mono having to use a wrong language acoustic-model for decoding the corresponding interval. This naturally leads to a very high % PER (of the order of 60%). As the LID-interval size increases the %LID improves, reaching 80–90% for durations of 3 s and 85–95% for longer durations of 5 s (or the full utterance). The corresponding %PER also shows a marked decrease, since the LID-Mono makes a phone-decoding of the given intervals with the correct language acoustic models with increasing accuracy, reaching 40% at 2 secs and down to 35% at 5 secs and more. What is important to note here is that the LID-Mono’s performance is dictated by the LID-interval size - smaller sizes are good for detecting code-switching instances at high time-resolution, but have inherently poor LID accuracies and corresponding poor PER; larger sizes are good for yielding high LID accuracy, with corresponding lower PER, but the code-switching instances are missed due to poor time resolution of the LID decision intervals, i.e. for instance, at a 2 s LID-interval, a good proportion of Urdu segments would have occurred ‘within’ the 2 s interval (as evident from the code-switching duration distribution in Fig. 1). These segments would then be decoded by the LID decision, which, has no question of being ‘correct’ since the interval in question is ‘mixed’ in its ground truth, and the LID has to yield a single language decision, potentially resulting in a mismatch in the language(s) in the 2-sec interval and the single monolingual phone-recognizer that would have been brought into to decode the speech in the 2 s interval. This problem becomes more acute as the LID-interval size increases.

On the contrary, the KN-UR Multi-PRS using common multilingual phone-set approach offers a consistent performance of 32.7% without having any LID-interval in its pipe-line, and hence is robust to arbitrary code-switching durational distributions (as in Fig. 1). This makes the Multi-PRS the natural choice to recognize code-switched speech, with practically no particular merit to choose the LID-Mono system, which suffers from the trade-off discussed above, higher design complexity of having to design a LID system (to recognize multiple language classes), and having to design multiple monolingual phone recognition systems.

5 Conclusions

We have developed and compared LID-Mono and Multi-PRS approaches of multilingual phone recognition. We have extended the same study to code-switched speech recognition scenario using code-switched utterance of two Indian languages (Kannada and Urdu). We have studied the performance characteristics of LID-Mono and Multi-PRS approaches with respect to several underlying parameters, such as the interval over which the LID makes a decision and the number of languages on which the LID is designed for the LID-Mono systems, the means of arriving at a common phone-set for the Multi-PRS and shown that while the LID-Mono system suffers from inherent trade-off’s between interval sizes, the Multi-PRS offers a robust performance for arbitrary code-switching data.