Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters

Srinivas, N. S. Sai; Sugan, N.; Kar, Niladri; Kumar, L. S.; Nath, Malaya Kumar; Kanhe, Aniruddha

doi:10.1007/s00034-019-01100-6

Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters

Published: 04 April 2019

Volume 38, pages 5018–5067, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters

Download PDF

795 Accesses
13 Citations
Explore all metrics

Abstract

Spoken language identification (LID) or spoken language recognition (LR) is defined as the process of recognizing the language from speech utterance. In this paper, a new Fourier parameter (FP) model is proposed for the task of speaker-independent spoken language recognition. The performance of the proposed FP features is analyzed and compared with the legacy mel-frequency cepstral coefficient (MFCC) features. Two multilingual databases, namely Indian Institute of Technology Kharagpur Multilingual Indian Language Speech Corpus (IITKGP-MLILSC) and Oriental Language Recognition Speech Corpus (AP18-OLR), are used to extract FP and MFCC features. Spoken LID/LR models are developed with the extracted FP and MFCC features using three classifiers, namely support vector machines, feed-forward artificial neural networks, and deep neural networks. Experimental results show that the proposed FP features can effectively recognize different languages from speech signals. It can also be observed that the recognition performance is significantly improved when compared to MFCC features. Further, the recognition performance is enhanced when MFCC and FP features are combined.

A Novel Approach for Spoken Language Identification and Performance Comparison Using Machine Learning-Based Classifiers and Neural Network

Spoken Language Identification of Indian Languages Using MFCC Features

Automatic spoken language identification using MFCC based time series features

Article 03 January 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Spoken language identification (LID) or spoken language recognition (LR) is defined as the process of identifying or recognizing the language from a speech utterance [20]. So far, human beings are considered to be the most highly accurate language recognition systems.^{Footnote 1} It is considered as a trendy research problem for many years and is attracting more attention from the past two decades [18]. It is far different from the traditional speech recognition or speaker identification tasks, for which either the identity of the speaker or the utterance information is unavailable. However, in the task of spoken LID/LR, both the identity of the speaker and the utterance information are not available, which makes an added challenge [2]. It plays a vital role in numerous multilingual speech processing applications as described in [1, 2, 15, 20, 21, 30].

The spoken LID/LR systems can be broadly classified into two types, [25] namely explicit and implicit systems. Explicit spoken LID/LR systems use phoneme sequences derived from speech signals for recognition of language, whereas implicit systems use the derived language-specific speech features. The performance of explicit spoken LID/LR systems is better compared to implicit counterparts, which is achieved at the cost of an increase in the complexity. Implicit systems are of favored choice to many researchers for developing less complex and efficient spoken LID/LR systems.

The motivation for this work comes from the following facts: (1) In the context of Indian languages, few attempts have been reported in the field of spoken language recognition. One of the main reasons is due to the non-availability of standard native speech corpora covering majority of the Indian languages. (2) India is a multilingual nation having 22 official languages and 1650 unofficial languages [18]. These languages can be broadly classified into four major linguistic families, [15] namely (a) Indo-Aryan, (b) Dravidian, (c) Austroasiatic, and (d) Tibeto-Burman. The languages within the respective linguistic families are known to share some common set of scripts and phonemes, thereby exhibiting some similarities among them. Moreover, it is believed that Sanskrit (ancient Indian language) is the main root and many (not all) other Indian languages are evolved from it. The similarity between different languages poses significant challenges to develop spoken language recognition models for Indian languages. The implementation of explicit spoken LID/LR systems is practically not feasible due to the similarity issues among different Indian languages. As a result, the implicit systems become the only choice to proceed with the development of spoken LID/LR systems for Indian languages.

The main objectives of this paper are: (1) to design an implicit, less complex acoustic speech system for spoken language recognition in Indian languages using spectral features and (2) to design spoken LID/LR systems which can perform well even on shorter (10-s duration) speech utterances apart from longer (30-s or 45-s duration) speech utterances.

In this paper, harmonic sequences, named Fourier parameter (FP) features, are proposed to identify the language from the perceptual content of speech signals, instead of using the traditional spectral features. To the best of authors’ knowledge, it is an early attempt to apply this new set of FP features, along with their associated first-order and second-order differences for the task of speaker-independent spoken language recognition. The FP features are evaluated on the Indian Institute of Technology Kharagpur Multilingual Indian Language Speech Corpus (IITKGP-MLILSC). For comparing the performance of FP features, the state-of-the-art legacy features such as the mel-frequency cepstral coefficients (MFCC) are also extracted from the speech signals of the said corpus.^{Footnote 2} Spoken LID/LR models are developed using support vector machines (SVM), artificial neural networks (ANN), and deep neural networks (DNN). Experimental results show that the proposed FP features are effective in recognizing different Indian languages and resulted in improving the recognition performance of the systems when compared to MFCC features. The recognition performance is further improved by combining MFCC and FP features. The performance of the proposed FP features is also evaluated on the Oriental Language Recognition Speech Corpus (AP18-OLR). Significant improvements in the performance of the spoken LID/LR systems using FP features, and the combination of MFCC and FP features are also observed with respect to this database.

The rest of the paper is organized as follows: Sect. 2 presents about the literature survey. Section 3 provides the brief details of two multilingual speech corpora used in this paper for performing the experimental study. Section 4 discusses about various topics, namely the theoretical aspects of the conventional block or frame-level processing for speech signals, FP and MFCC speech features for spoken LID/LR tasks, feature normalization, feature scaling, and finally the feature selection for dimensionality reduction. Section 5 presents the details of SVM, ANN, and DNN classifier architectures used in this paper to develop spoken LID/LR systems. Section 6 presents and discusses about the obtained experimental results. Section 7 concludes with the insights toward future extensions to this work.

2 Literature Review

A detailed literature survey on the state-of-the-art language identification, more specific toward speech features and models, is discussed by Ambikairajah et al. [2]. Most of the approaches employed spectral and prosodic features for spoken language recognition [6]. With respect to Indian languages, few attempts are reported in the area of spoken language recognition. Early attempts have been made on the recognition of Indian languages by Balleda et al. [5], using 17-dimensional MFCC feature vectors with vector quantization (VQ) for four Indian languages. Rao et al. [28] have explored prosodic features to develop language recognition models for four Indian languages. Leena et al. [19] have explored spectral features with auto-associative neural networks (AANN) for language recognition with varying duration of test speech samples for three Indian languages.

Maity et al. [21] have explored two spectral features, namely MFCC and linear predictive cepstral coefficients (LPCC) with Gaussian mixture models (GMM), to develop speaker-dependent and speaker-independent language recognition models for 27 Indian languages using IITKGP-MLILSC database. The corresponding recognition accuracies for both the cases are reported as 96% and 45%, respectively. Reddy et al. [30] have explored multilevel spectral and prosodic features with GMM, to develop language recognition models for 27 Indian languages using IITKGP-MLILSC database. The recognition accuracies reported using MFCC features and the combination of spectral and prosodic features are 51.42% and 62.13%, respectively. Nandi et al. [26] have explored magnitude and phase information of excitation source (represented by Hilbert envelope (HE) and residual phase (RP), respectively) present in the linear prediction (LP) residual signal with GMM, to develop language recognition models for 27 Indian languages using IITKGP-MLILSC database. The evidences of HE and RP from sub-segmental, segmental, and supra-segmental levels are combined in different ways to achieve language-specific excitation source information. The maximum recognition accuracy of $63.70\%$ is reported with respect to these features.

Jothilakshmi et al. [15] have explored spectral features, namely MFCC and shifted delta cepstral (SDC) with hidden Markov models (HMM), GMM and ANN, to develop language recognition models for nine Indian languages. Koolagudi et al. [18] have explored two spectral features (MFCC and SDC) and a set of prosodic features (pitch contour, energy contour, zero-crossing rate, and duration) with ANN, to develop language recognition models for four Indian languages from Dravidian linguistic family. Mounika et al. [24] have explored MFCC features with DNN and DNN with attention (DNN-WA), to develop language recognition models for 12 Indian languages. Veera et al. [41] have explored residual cepstral coefficients (RCC), MFFC and SDC with DNN, DNN-WA, and the state-of-the-art i-vector systems, to develop language recognition models for 13 Indian languages. Improvement in the recognition performance is observed using DNN-WA model with combined RCC and MFCC features. Vuddagiri et al. [42] have explored MFCC features with i-vectors, DNN, and DNN-WA, to develop language recognition models for 23 Indian languages using International Institute of Information Technology Hyderabad—Indian Language Speech Corpus (IIITH-ILSC). It is observed that the performance of DNN-WA architecture is better than i-vector and DNN models. Bhanja et al. [7] have proposed new parameters to model the prosodic characteristics of the speech signal. The extracted prosodic features are combined with MFCC features to develop a two-stage LID system for seven northeast Indian languages. Three classifiers, namely ANN, GMM with universal background model (UBM), and i-vector-based SVM, have been used.

From the state-of the-art literature on spoken language recognition in Indian languages, it is observed that most of the works are focused on traditional spectral and prosodic features for capturing the language-specific information. To the best of authors’ knowledge, none of the studies have analyzed the use of new spectral features for spoken language recognition in Indian languages. In this paper, new FP spectral features with their associated first-order and second-order differences are introduced. These new spectral features are used to develop spoken LID/LR models for 15 Indian languages using IITKGP-MLILSC database. The spoken LID/LR models are developed using the state-of-the-art classifiers, namely SVM, ANN, and DNN. Similar kind of LID/LR models is also developed for ten oriental languages using AP18-OLR database to evaluate the performance of the proposed FP spectral features for the task of spoken language recognition.

3 Multilingual Speech Corpora/Databases

In this paper, two multilingual speech databases, namely the Indian Institute of Technology Kharagpur Multilingual Indian Language Speech Corpus (IITKGP-MLILSC) [21] and the Oriental Language Recognition Speech Corpus (AP18-OLR), are used to develop spoken LID/LR systems in Indian and oriental languages, respectively. These databases are employed to develop and validate the spoken LID/LR systems in Indian and oriental languages using MFCC, FP, and combined MFCC $+$ FP features. The details of these databases are summarized in the following sub-sections.

3.1 IITKGP-MLILSC Database

The IITKGP-MLILSC database was developed by the Indian Institute of Technology Kharagpur.^{Footnote 3} It comprises of recorded speech data in 27 major Indian languages, out of which 15 languages from 3 major Indian linguistic families, namely Indo-Aryan, Dravidian, and Tibeto-Burman, are considered in this paper for the task of speaker-independent spoken language recognition. Out of 15 chosen languages, 9 languages (Bengali, Chhattisgarhi, Gujarati, Hindi, Kashmiri, Punjabi, Rajasthani, Sanskrit, and Sindhi) are chosen from the Indo-Aryan family [30], 3 languages (Konkani, Tamil, and Telugu) are chosen from the Dravidian family [30], and 3 languages (Manipuri, Mizo, and Nagamese) are chosen from the Tibeto-Burman family [30]. On an average, each language in the database has around 5–10 min of speech recordings corresponding to at least ten speakers. More details of this database are provided in [21, 30]. This database is freely available upon request, for non-commercial and academic research purpose.

3.2 AP18-OLR (AP16-OL7 $+$ AP17-OL3) Multilingual Database

The AP18-OLR database [39] was developed to provide support for the oriental language recognition challenge^{Footnote 4} (AP18-OLR) organized by the center for speech and language technologies (CSLT) at Tsinghua University and SpeechOcean. It provides recorded speech data in 10 oriental languages which belong to 5 Asian linguistic families, namely Altaic, Austroasiatic, Austronesian, Indo-European, and Sino-Tibetan. Out of 10 languages, 4 languages (Japanese, Kazakh, Korean, and Uyghur) belong to Altaic family, and 3 languages (Cantonese, Mandarin, and Tibetan) belong to Sino-Tibetan family. The remaining 3 languages, namely Indonesian, Russian, and Vietnamese, belong to Austronesian, Indo-European, and Austroasiatic families, respectively. More details of this database are provided in [39] and can also be found in the challenge Web site.^{Footnote 5} This database is freely available upon request, for non-commercial and academic research purpose.

The AP18-OLR database is a combination of two multilingual databases, namely AP16-OL7 and AP17-OL3. The AP16-OL7 [43] multilingual database was developed by the SpeechOcean.^{Footnote 6} It provides recorded speech data in 7 oriental languages (Cantonese, Indonesian, Japanese, Korean, Mandarin, Russian, and Vietnamese). On an average, each language in this database has around 10 h of speech recordings corresponding to 24 speakers (12 males and 12 females). The data set of each language was divided into an independent training and testing data sets, each containing recorded speech data of 18 and 6 independent speakers, respectively. This database has a variation in the recording environment with respect to languages. More details of this database are provided in [43] and can also be found in the challenge Web site.^{Footnote 7}

The AP17-OL3 [38] multilingual database was developed by NSFC^{Footnote 8}-M2ASR^{Footnote 9} project. It provides recorded speech data in 3 oriental languages (Kazakh, Tibetan, and Uyghur). On an average, each language in this database has around 10 h of speech recordings. Unlike AP16-OL7, this database has much more variations in terms of the recording environment and the number of speakers. More details of this database are provided in [38] and can also be found in the challenge Web site.^{Footnote 10}

4 Frame-Level Acoustic Speech Features for Spoken Language Recognition

4.1 Conventional Block Processing of Speech Signal

The method of conventional block processing (CBP) is used to extract the intrinsic segmental (frame level) and supra-segmental (across frames) acoustic features from speech signal. Prior to CBP, the speech signals are initially subjected to pre-processing, which includes low-pass filtering followed by pre-emphasis [36]. In CBP, the continuous speech signal is divided into a consecutive sequence of individual frames^{Footnote 11} (either in terms of overlapping or non-overlapping format) of short duration, and finally the segmental and supra-segmental features are extracted from them. The method of CBP can be mathematically described as follows:

Consider a discrete-time continuous speech signal, say x(m) of finite duration t s, having sampling frequency $F_\mathrm{s}$. Let C, Q, and R represent the type of the channel, bit resolution, and bit rate of the recorded speech signal, respectively. This speech signal is passed through a digital low-pass FIR filter, having the cutoff frequency $F_\mathrm{c}~\left( F_\mathrm{c}<\frac{F_\mathrm{s}}{2}\right) $ [36]. The corresponding output is the desired low-pass filtered speech signal $x_\mathrm{f}(m)$, given by,

$$\begin{aligned} x_\mathrm{f}(m) = \sum _{n=0}^{P}h(n)x(m-n), \end{aligned}$$

(1)

where P is the order of the low-pass filter, and h is the vector containing the filter coefficients.

The low-pass-filtered speech signal $x_\mathrm{f}(m)$ is then passed through a digital first-order pre-emphasis (high pass) filter to reduce the differences in the power levels of different frequency components present in the speech signal. The corresponding output is the pre-emphasized speech signal $x_\mathrm{pe}(m)$, given by,

$$\begin{aligned} x_\mathrm{pe}(m) = x_\mathrm{f}(m) - \alpha x_\mathrm{f}(m-1), \end{aligned}$$

(2)

where $\alpha $ is the pre-emphasis constant.

The pre-emphasized speech signal $x_\mathrm{pe}(m)$ is then used in CBP, where it is segmented into l finite consecutive overlapping frames of short duration $T_\mathrm{f}$, each having $N_\mathrm{f}$ samples. The corresponding segmented speech is represented in the form of a matrix $x_\mathrm{s}$, given by,

$$\begin{aligned}{}[x_\mathrm{s}] = [s_{1},~s_{2},\ldots ,s_{l}], \end{aligned}$$

(3)

where $s_{1},~s_{2},\ldots , s_{l}$ denotes l vectors, each having dimension of $N_\mathrm{f}\times 1$, containing samples of respective speech segments. The matrix $x_\mathrm{s}$ has a dimension of $N_\mathrm{f}\times l$, indicating that the speech segments are arranged in columns and rows correspond to the individual frame samples.

The number of overlapping frames into which the given speech signal can be segmented, is computed using,

$$\begin{aligned} l = \Biggl \lceil \left( 1 + \left( \frac{N_\mathrm{s}-N_\mathrm{f}}{N_\mathrm{of}}\right) \right) \Biggr \rceil ; ~~ \ni ~~N_\mathrm{f}>N_\mathrm{of} ~~\hbox {or}~~T_\mathrm{f}>T_\mathrm{of}, \end{aligned}$$

(4)

where $N_\mathrm{s}$ is the number of samples in the speech signal, $N_\mathrm{f}$ is the number of samples in individual frames, $N_\mathrm{of}$ is the number of new samples in individual frames after frame shift or overlap, $T_\mathrm{f}$ is the duration of the individual frame in ms, and $T_\mathrm{of}$ is the duration of the frame shift in ms. Here, $N_\mathrm{s}$ is the speech signal-dependent parameter, whereas $N_\mathrm{f}$ and $N_\mathrm{of}$ (or $T_\mathrm{f}$ and $T_\mathrm{of}$) are tunable parameters, whose values are defined during the development phase of the spoken language recognition system. Equation (4) is expressed in terms of sample domain parameters for computing the number of overlapping frames.

The number of samples $N_\mathrm{s}$ available in the recorded speech signal having sampling frequency $F_\mathrm{s}$ and time duration t is given by,

$$\begin{aligned} N_\mathrm{s} = t \times F_\mathrm{s}. \end{aligned}$$

(5)

The number of samples $N_\mathrm{f}$ in each individual frame l can be computed using,

$$\begin{aligned} N_\mathrm{f} = \frac{T_\mathrm{f}}{T_\mathrm{s}}; ~~ \ni ~~T_\mathrm{f}>T_\mathrm{s}, \end{aligned}$$

(6)

where $T_\mathrm{s}$ is the speech signal-dependent parameter which denotes the time duration of a single speech sample in ms, given by, $T_\mathrm{s} = \frac{1}{F_\mathrm{s}}$.

Overlapping the frame by shifting it with a duration of $T_\mathrm{of}$ is equivalent to shifting it by $N_\mathrm{of}$ samples, given by,

$$\begin{aligned} N_\mathrm{of} = \frac{T_\mathrm{of}}{T_\mathrm{s}}; ~~ \ni ~~T_\mathrm{of}>T_\mathrm{s}. \end{aligned}$$

(7)

Equation (4) can also be expressed in terms of time domain parameters by substituting (5), (6), and (7) for computing the number of overlapping frames.

The percentage of frame overlap $F_{ol}$ can be computed using,

$$\begin{aligned} F_{ol} = \left( \frac{N_\mathrm{f}-N_\mathrm{of}}{N_\mathrm{f}}\right) \times 100~\%. \end{aligned}$$

(8)

It is evident from (8) that, if $F_{ol}=50\%$, then $N_\mathrm{of}=\frac{N_\mathrm{f}}{2}$. Similarly, if $F_{ol}<50\%$, then $N_\mathrm{of}<\frac{N_\mathrm{f}}{2}$, and if $F_{ol}>50\%$, then $N_\mathrm{of}>\frac{N_\mathrm{f}}{2}$.

In speech signal segmentation, apart from the overlapping frames format, another format does exist for a special case having $F_{ol}=0\%$ and it is termed as non-overlapping frames format. For this case, $N_\mathrm{of}=N_\mathrm{f}$ (or $T_\mathrm{of}=T_\mathrm{f}$), and the number of such frames into which the given speech signal can be segmented, is computed using the reduced form of (4), given by,

$$\begin{aligned} l = \Biggl \lceil \frac{N_\mathrm{s}}{N_\mathrm{f}}\Biggr \rceil . \end{aligned}$$

(9)

The non-overlapping frames format for signal segmentation is rarely incorporated in speech signal analysis.

It is a usual practice to multiply the individual speech frames with a windowing function (whose window length $N_{w}$ is equal to the frame length $N_\mathrm{f}$) while segmenting the speech signal into a set of individual frames. The windowing operation helps to reduce the edge effects, while taking the discrete Fourier transform (DFT) on the speech segments [22]. The windowed speech segments are represented in the form of a matrix $x_\mathrm{ws}$, given by,

$$\begin{aligned}{}[x_\mathrm{ws}] = [x_\mathrm{s}]\times w, \end{aligned}$$

(10)

where w denotes a vector with windowing function coefficients having dimension of $N_{w} \times 1$ (here, $N_{w}~=~N_\mathrm{f}$), and $\times $ denotes the array multiplication. The matrix $x_\mathrm{ws}$ has a dimension of $N_\mathrm{f}\times l$ (similar to $x_\mathrm{s}$), containing speech segments obtained after multiplying with the coefficients of the windowing function. In general, for speech applications, hamming window is widely used as a windowing function. It is defined as [27],

$$ \begin{aligned} w(m)&= 0.54 - 0.46 \cos \left( \frac{2\pi m}{K}\right) ;~~for~~0\le m \le K, \nonumber \\ K&= N_{w}-1~~ \& ~~N_{w}=N_\mathrm{f}, \end{aligned}$$

(11)

where K is the order of the filter, and $N_{w}$ is the hamming window length.

The matrix $x_\mathrm{ws}$ finally contains the samples corresponding to the windowed speech segments. The segments with speech activity carry information of the language traits as opposed to the leading and trailing segments with silence or non-speech activity. Therefore, it becomes necessary to discard the unwanted leading and trailing silence or non-speech segments from $x_\mathrm{ws}$. This is achieved by performing a simple voice activity detection (VAD) using segment energy estimation (SEE), which identifies the starting and ending boundaries of the entire speech utterance in the given speech signal. The process of VAD by SEE is briefly summarized as follows [2]:

Initially, the energy $E_{l}$ of all the segments in $x_\mathrm{ws}$ is computed as,

$$\begin{aligned} E_{l} = 10 ~ \log _{10}\left( \sum _{m=0}^{N_\mathrm{f}-1}\left| x_\mathrm{ws}^{l}(m)\right| ^{2}\right) , \end{aligned}$$

(12)

where $E_{l}$ is the energy of lth frame in dB.

From energy estimates of $E_{l}$, the maximum energy $E_{\max }$ of the entire speech utterance is determined. Using $E_{\max }$, a threshold energy level $E_\mathrm{th}$ is computed as,

$$\begin{aligned} E_\mathrm{th} = \left( E_{\max } - E_\mathrm{c}\right) , \end{aligned}$$

(13)

where $E_\mathrm{c}$ is a tunable parameter which represents a constant energy in dB. It is used to adjust the level of $E_\mathrm{th}$. Its value is defined during the development phase of the spoken language recognition system.

From (13), it can be noted that $E_\mathrm{th}$ is fixed at $E_\mathrm{c}$ dB below $E_{\max }$. Finally, all the leading and trailing speech segments whose energy fall below $E_\mathrm{th}$ are considered to be silence or non-speech segments and therefore discarded from $x_\mathrm{ws}$. The resultant matrix obtained is denoted as ${{\widehat{x}}}_\mathrm{ws}$. This matrix has $l'$ segments, where $l'<l$. The obtained speech segments in ${{\widehat{x}}}_\mathrm{ws}$ are then processed with the chosen speech feature extraction techniques to extract salient speech features that can be used to develop spoken LID/LR systems. The values of various CBP parameters used in this paper are presented in Table 1.

Table 1 CBP parameters chosen with respect to multilingual speech databases

Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters

Abstract

Similar content being viewed by others

A Novel Approach for Spoken Language Identification and Performance Comparison Using Machine Learning-Based Classifiers and Neural Network

Spoken Language Identification of Indian Languages Using MFCC Features

Automatic spoken language identification using MFCC based time series features

Explore related subjects

1 Introduction

2 Literature Review

3 Multilingual Speech Corpora/Databases

3.1 IITKGP-MLILSC Database

3.2 AP18-OLR (AP16-OL7 \(+\) AP17-OL3) Multilingual Database

4 Frame-Level Acoustic Speech Features for Spoken Language Recognition

4.1 Conventional Block Processing of Speech Signal

4.2 Fourier Parameter Features

4.3 Mel-Frequency Cepstral Coefficient Features

4.4 Normalization and Scaling of Feature Vectors

4.5 Feature Selection

4.5.1 ReliefF Feature Selection

5 Machine Learning Classification

5.1 Support Vector Machine Classification

5.2 Artificial Neural Network Classification

5.3 Deep Neural Network Classification

6 Experimental Results and Discussion

6.1 SVM-Based Spoken LID/LR Systems

6.2 ANN-Based Spoken LID/LR Systems

6.3 DNN-Based Spoken LID/LR Systems

6.4 Performance Comparison of Different Spoken LID/LR Models

7 Conclusion and Future Work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation