A Pre-classification-Based Language Identification for Northeast Indian Languages Using Prosody and Spectral Features

China Bhanja, Chuya; Laskar, Mohammad Azharuddin; Laskar, Rabul Hussain

doi:10.1007/s00034-018-0962-x

A Pre-classification-Based Language Identification for Northeast Indian Languages Using Prosody and Spectral Features

Published: 12 October 2018

Volume 38, pages 2266–2296, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

A Pre-classification-Based Language Identification for Northeast Indian Languages Using Prosody and Spectral Features

Download PDF

Chuya China Bhanja¹,
Mohammad Azharuddin Laskar¹ &
Rabul Hussain Laskar¹

450 Accesses
19 Citations
Explore all metrics

Abstract

This paper is aimed at developing a two-stage language identification (LID) system for Northeast Indian languages. In the first stage, languages are pre-classified into tonal and non-tonal categories, and in the second stage, individual languages are identified from languages of the corresponding category. In this work, new parameters to model the prosodic characteristics of the speech signal have been proposed for pre-classification as well as individual language identification. Also, the effectiveness of spectral features, namely Mel-frequency cepstral coefficient (MFCC) and their combination with prosodic features, has been studied for pre-classification task. The usefulness of MFCC with their delta and acceleration coefficients in combination with prosodic features has been investigated for individual language identification. The performance of the system is analyzed for the features extracted of different analysis units, such as syllable, disyllable, word, and utterance. Comparative performance analysis of three different classifiers, namely artificial neural network (ANN), Gaussian mixture model–Universal background model (GMM–UBM), and i-vector based support vector machine (i-vector based SVM), has been made for pre-classification as well as individual language identification. A new database, NIT Silchar language database (NITS-LD), has been developed for seven NE Indian languages using All India Radio broadcast news. The experimental analysis suggests that the parameters proposed to represent the prosodic characteristics help to improve the performance of both the stages and show improvements over existing parameters by as much as 7.4%, 11.9%, and 9.1% for 30 s, 10 s, and 3 s test data, respectively, in the pre-classification stage. Of the baseline single-stage systems, GMM–UBM provides the highest accuracies of 80%, 76.8%, and 72% for 30 s, 10 s, and 3 s test data, respectively. In the proposed system, the combination of the ANN model in pre-classification stage and the GMM–UBM model in individual language identification stage provides the highest accuracies, and it shows the improvements over the baseline system by 7.2%, 7%, and 4.9% for 30 s, 10 s, and 3 s test data. For OGI-Multilingual (OGI-MLTS) database, improvements of 8.1%, 7.4%, and 5.7% for 30 s, 10 s, and 3 s test data, respectively, are observed over the baseline LID system.

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

Article 20 January 2021

Acoustic Feature Analysis and Discriminative Modeling for Language Identification of Closely Related South-Asian Languages

Article 04 December 2017

Spoken Language Identification of Indian Languages Using MFCC Features

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The main objective of automatic LID systems is to identify the language correctly from a given speech sample [4]. An ideal LID system should accurately utilize different aspects of speech information which are useful for distinguishing languages from a huge number of target languages. In the practical scenario, performance of an LID system largely depends on the number of target languages. In order to get higher accuracy for system involving large number of target languages, pre-classification of languages into different sub-language families or into different categories can be done. Also, to identify closely related languages or the languages of same origin, a highly accurate pre-classification module is required.

In order to address this aspect, Wang et al. [44] outlined a novel system for pre-classifying languages into tonal and non-tonal categories at utterance level, using different parameters of pitch contour and durations features and ANN as classifier. They have extended their work further to show the impact of the pre-classification task on performance of the system [45]. Here they showed that the performance of the system improves by 4–5% when pre-classification of languages into tonal and non-tonal category is done before doing individual language classification. Additionally, they reported that computation time of CPU reduces for the prosody-based two-level language identification systems. However, this system has several disadvantages. The main drawback of this system is that the use of phonetically labeled data makes the system unusable where either any linguistic expert or phonetically labeled data are not available. Also, extending such a system to include a new language would be a nontrivial task. In [44, 45], researchers studied the effectiveness of pre-classification module in distinguishing world’s distinct languages. However, no work has so far studied the usefulness of such a system in distinguishing closely related Indian languages. Also, in [44, 45] feature parameters are first extracted from each of the voiced segments constituting an utterance, and then feature representation of that utterance is estimated. However, the literature confirms that for tonal languages, the tonal events are aligned with segmental events [5]. The peak and valley of pitch contour are aligned to the onset and offset of a segment [46], and therefore, pitch can be utilized to segment the continuous speech into smaller analysis units, which closely correspond to syllable-like units [24]. Accordingly, either open or sonorant closed syllables can be considered as tone bearing units of tonal languages [51]. Thus, for tonal/non-tonal classification of languages, syllable-level analysis may lead to more discriminative feature representation. Besides, the NE Indian languages are known to be syllable centric [39], that is, the language-specific cues are more evident at syllable level itself. This paper therefore proposes a syllable-level tonal/non-tonal pre-classification based LID system for NE Indian languages that may not depend on the use of phonetic engine.

Attributes like pitch, duration, and energy render the naturalness of speech collectively called prosody, are less affected by noise. Prosodic features cannot be derived from the phoneme structure of human utterance and is also very difficult to replicate. Even in speech recognition, human being makes use of prosodic information also to discern the distinctness in the perceived sounds [33]. Also, in several LID task [24, 32], prosodic features have been used as a complementary information with vocal tract information. Literature reveals that a vast population (almost half) of world languages is tonal [6, 13, 22]. For tonal languages, pitch is an important phonological cue, and it changes in a regular manner within a tone bearing unit. Moreover, tone has an effective correlation with other prosodic features like energy profile and duration [31]. However, the parameters of prosodic features proposed in [24, 32] are not sufficient for tonal and non-tonal language discrimination task. Effective parameterization of prosody can prove to be a viable way to improve the performance of the pre-classification system even though prosody-only based LID system is still far from the state-of the art cepstral feature-based LID system.

On the other hand, spectral features, namely MFCC, persist as a de facto feature for any language identification system. It has also been identified to be quite useful for carrying tone information [21, 37]. Also in two-stage language identification system, MFCC persisted as the most useful features probably due to their admissible performance. It has been proven to be quite useful for identification of Indian languages [42]. In [17], also, Jothilakshmi et al. reported a hierarchical LID system for nine Indian languages using MFCC, MFCC along with delta and double delta coefficients (∆ and ∆ − ∆), and shifted delta coefficient (SDC) features, and noticed that GMM–UBM model with MFCC along with ∆ and ∆ − ∆ features provides the highest accuracy among the other features. They also noticed the usefulness of two-level LID system for identifying the languages from same origin. Also, in [2], authors used MFCC and SDC features to identify four under-resourced and closely related South-Asian languages. They reported a good accuracy in identifying the languages from 3 s test utterance. In another approach, Yin et al. [50] proposed a hierarchical LID system (HLID) where a tree structure is followed to identify languages with higher accuracy. Here, instead of using a two-level identification system, a test utterance is classified level by level, depending on the most distinguishing information at each level. Also, they showed that because of hierarchy, system performance improves upon not only baseline system, but also likelihood score fusion-based system. However, the authors did not study the impact of hierarchy-based approach for identification of closely related NE Indian languages. Moreover, in [2, 17, 50], researchers extract MFCC features from the frames constituting an utterance. This type of representation may not be the most suitable way to represent the tonal characteristics that generally lie at the syllable level. In order to overcome these difficulties, this work proposes a syllabic-level representation of MFCC features by fitting individual coefficient of the MFCC vectors across all the frames of a syllable using Legendre polynomial. The existing literatures [2, 17, 21, 37, 42, 50] did not study the effectiveness of MFCC features for tonal/non-tonal language classification and also, pre-classification-based LID task. This paper also studies the complementarity of MFCC features with prosody at syllable level for pre-classification and proposes the use of MFCC with their ∆ and ∆ − ∆ coefficients in combination with prosodic features for pre-classification based LID task.

The systems [25, 32] process the language-specific information lying at different levels (sub-segmental, segmental, and supra-segmental) using individual models and then combine the scores to generate the final decision. In contrast, the proposed syllabic-level MFCC representation enables us to explore feature-level combination of spectral and prosodic features.

As per literature study, both generative and discriminative modeling approaches have been used for LID task [23, 24, 32]. In this study, ANN, a discriminative classifier; GMM–UBM, a generative one, and i-vector based SVM, which exploits the goodness of both the approaches, have been explored. In [23], researchers reported a system where they divided the whole utterance into fixed length segments, and then i-vector corresponding to that utterance is obtained from spectral and prosodic features extracted from the segments constituting that utterance. This approach to segmentation does not consider the actual syllable boundaries and may lead to inaccurate representation of acoustic events within a segment. As tonal events are prominently characterized at syllable level [28], features should preferably be extracted from syllable or syllable-like units. This work thus explores using syllable-level framework with all the three classifiers, namely ANN, GMM–UBM, and i-vector-based SVM.

In this paper, focus has been laid on closely related languages of NE India. The ethnic mix of this region affects the languages that they share to communicate among each other. The language diversity is one of the interesting phenomena in NE states of India. The influence of one language on other as well the languages of bordering countries is very high in NE India, and therefore, distinguishing among these languages with a higher accuracy is difficult as compared to other distinct languages. In India, available language resource hardly includes the NE Indian languages. This necessitates the preparation of a database including NE Indian languages to be used for building a good LID system, and this is quite a challenging task.

The contributions of this paper are as follows:

An automatic tonal/non-tonal language pre-classification based LID system has been proposed for closely related NE Indian languages without using any phonetic information.
A more effective way of parameterization of prosodic features has been proposed so that it helps boost the performance at pre-classification stage as well as individual language identification stage. Syllable-level representation of MFCC features using Legendre coefficients has been proposed. Complementarity of MFCC and prosodic features extracted at syllable level, and also their combination has been explored for pre-classification based LID task.
The syllables are known to be the most appropriate tone bearers for tonal languages. This work therefore explores using syllable-level feature representation for tonal/non-tonal pre-classification based LID system.
NIT Silchar language database (NITS-LD) has been prepared, covering seven NE Indian languages to carry out our experiment. The seven languages are Assamese, Bengali, Indian English, Hindi, Manipuri, Mizo, and Nagamese. The data have been collected from All India Radio news, and a total of 4 h of data for each of the languages is considered. These languages are closely related, and speakers from the regions are usually multilingual.
A comparative performance analysis has been done for pre-classification and individual language identification among three different classifiers, namely GMM–UBM, ANN, and i-vector based SVM using syllable-level features. Experiments have been carried out for the combination of ANN model in pre-classification stage and each one of the three models in the individual language identification stage to obtain the best possible performance of the system.

The rest of the paper is organized as follows: Section 2 describes the proposed system for language identification. Section 3 discusses about the development of the language identification system to perform the experiments. Experimental results and analysis of the proposed system are given in Sect. 4, and Sect. 5 concludes the work by mentioning the future works.

2 Proposed System for Language Identification

This section describes the workings of the proposed pre-classification-based LID system. It consists of a tonal/non-tonal language pre-classification stage, followed by two parallel modules in the second stage, one for identification of tonal languages and the other for non-tonal languages. To make performance analysis, experiments have been carried out considering three different cases, namely Case I, Case II, and Case III. Case I represents the baseline system, where language identification is done in a way similar to a conventional LID system. Here, in training stage, either a separate model (L₁, L₂, …, L_M) is built for each of the M number of languages, or a single discriminative model, like a neural network, is trained to distinguish among different languages. At testing, identification is done by comparing likelihood scores of the trial utterance with respect to the all different models, or simply based on the decision of the discriminative model. The front-end features considered for this system are prosody + MFCC and its ∆ and ∆ − ∆ coefficients.

In Case II and Case III, firstly, languages are pre-classified into tonal and non-tonal categories, and then, individual languages are identified at the next stage. Further, for Case II, irrespective of whether the test trial gets correctly categorized or not at the pre-classification stage, it is processed to the next stage of classification. However, in Case III, only the correctly categorized test trials (separated manually) are processed by the individual language identification stage.

In Case II and Case III, combination of prosody and MFCC is used as front-end feature at pre-classification stage, and the combination of prosody, MFCC, and its ∆ and ∆ − ∆ coefficients is used at individual language identification stage. The block diagram representations of Case II and Case III are shown in Fig. 1.

Figure 2 illustrates the two-stage LID system, highlighting the distinct features used at the different stages. Here, L₁ (Assamese), L₂ (Bengali), L₃ (Indian English), L₄ (Nagamese), L₅ (Hindi), L₆ (Mizo), and L₇ (Manipuri) are the languages involved in the experiment. The details of the pre-classification module are in Sect. 2.1.

2.1 Tonal and Non-tonal Language Pre-classification System

Block diagram representation of the language pre-classification system using prosodic and spectral features extracted from syllables of the speech signal is shown in Fig. 3. Syllables can be treated as context dependent unit, and also it has the ability to capture some co-articulation which is useful for language discrimination [20]. Syllables, in general, follow a common structure like vowel (V), vowel consonant (VC), consonant vowel consonant consonant (CVCC), and vowel consonant consonant (VCC). In case of Indian languages, most of the syllables are CV types [18]. Sometimes, the tonal properties can be associated with the onset or/and offset of the syllables [5, 47]. In order to get prosodic features corresponding to each syllable-like units, firstly, the pitch and energy contours of whole utterances are obtained. In this study, pitch is calculated through autocorrelation method using robust algorithm for pitch tracking (RAPT) algorithm [43]. It detects the unvoiced frames of an utterance of the speech signal. Energy values calculated from each 10 ms frame of an utterance constitute the energy contour of that utterance. The contours are then smoothened using fifth-order median filter, after which the identified vowel onset points (VOPs) [30] are associated with the smoothened pitch and energy contours. The pitch and energy contours between every consecutive VOPs are obtained and then parameterized to obtain the feature vectors. However, the contours whose lengths are less than 50 ms are not considered.

Here, duration of each syllable has been calculated by considering the number frames between two consecutive VOPs. Duration is then parameterized by rhythm parameter and is taken as another feature. Spectral features are then extracted from the overlapping frames of each syllable. The feature vectors for all the frames corresponding to a syllable are stacked together. In the next step, voiced/unvoiced algorithm is used to identify the frames where speech is present and features corresponding to only the voiced frames are retained. In this experiment, contours corresponding to each dimension of the spectral features for a syllable are parameterized. Parameters of prosodic and spectral information, thereby obtained, are then concatenated to form the final feature vector of a syllable. These combined feature vectors are then fed into the classifiers.

3 Development of Language identification System

In this work, a pre-classification based language identification system has been proposed for Northeast Indian languages. Generally, a language identification system consists of two important components: feature set and classifiers. This section describes the features and classifiers considered in this work.

3.1 Extraction and parameterization of different features for language identification

Here, pitch contour, energy contour, duration of the speech segments, and MFCC are used as front-end features for the two-stage LID task. Different parameters of these features are discussed in this section.

3.1.1 Parameterization of Prosody for Language Identification

Existing parameters for tonal and non-tonal language classification:

The existing parameters of prosody, such as, A₁: mean pitch [44], A₂: pitch changing speed [44], A₃: pitch changing level [44] are calculated from each syllables of the speech signals. For utterance-level analysis, these parameters are obtained using the same process as discussed in paper [44].

Proposed prosody parameters for tonal/non-tonal pre-classification based LID system

Parameterization of pitch contour

In this work, following parameters are used to parameterize the pitch contour.

A₄: Amplitude tilt for pitch contour ($ FA_{\text{t}} $).
A₅: Duration tilt for pitch contour (FD_t).

Level tones, namely the high (H) or low (L) tones, and contour tones, such as rise, fall, fall–rise or rise–fall tones, dictate the lexical meaning in case of tonal languages. However, in case of non-tonal languages, the lexical meaning does not change with change of pitch contours. Besides, the different tonal languages have their own fixed set of tones. For example, Mizo language is known to have four tones, Manipuri has two tones, Mandarin has four tones, and Vietnamese has six tones. These contours can therefore help characterize the different languages. To represent the dynamics of these contours, generally, amplitude tilt (A₅) and duration tilt (A6) parameters are used [1]. These quantities are defined as:

$$ FA_{\text{t}} = \frac{{\left| {A_{\text{r}} } \right| - \left| {A_{\text{f}} } \right|}}{{\left| {A_{\text{r}} } \right| + \left| {A_{\text{f}} } \right|}} $$

(1)

$$ FD_{\text{t}} = \frac{{\left| {D_{\text{r}} } \right| - \left| {D_{\text{f}} } \right|}}{{\left| {D_{\text{r}} } \right| + \left| {D_{\text{f}} } \right|}} $$

(2)

where $ A_{\text{r}} $ and $ A_{\text{f}} $ are the rise and fall and fall of the pitch contour, respectively, with respect to the peak of the contour. Similarly, D_r and D_f are the duration corresponding to rise and fall, respectively.

A₆: Change in pitch (ΔF₀)

Several researchers have investigated the relation of tone height (height of the peak of the pitch contour) with lingual articulation and the jaw’s movements and their role in expressing different degrees of emphasis. In case of non-tonal languages, pitch can be freely varied, while, in a tonal language, pitch is phonemically contrastive. Therefore, tone height which will be different for tonal and non-tonal languages and hence can be used as a feature for this system. It is estimated from the difference between the pitch values of peak (F_0p) and valley point (F_0v)

$$ \Delta F_{0} = F_{{0{\text{p}}}} - F_{{0{\text{v}}}} $$

(3)

A₇: Distance of peak of pitch contour with respect to VOP (D_r)

Literature shows that [32] the alignment of the peak of the pitch contour has bearing on the perceptual prominence [31]. For non-tonal languages, like English, Greek, the peak is consistently aligned with the onset of the accented syllable, while, for tonal languages, like Mandarin, it is aligned to the offset of the tone bearing syllables. Therefore, peak locations of the pitch contour with respect VOPs (D_r) may help characterize different languages.

A₈: Distance of 60% of the peak value of the pitch contour with respect to VOP

Researchers observed significant effect of place features of consonants and the manner of articulation of consonants on the tonal onset for languages, like Dimasa and Mizo, of the Tibeto-Burman family [38]. And the effect permeates to a great extent into the contour of the following tone. In another study, it has been noticed that the tonal onset can shift due to the interaction between tones and segments (syllables) [28]. This characteristic behavior of languages can help in distinguishing one language from another. It has been experimentally found that the extent to which these effects on tonal onset, propagate into the segment can be roughly approximated by the location of 60% of the peak value of the pitch contour. This work therefore proposes to use the distance of the location with 60% of the peak value, with respect to VOP, as a feature for language classification.

Parameterization of energy contour

Stress has been assumed to be present up to a certain level in all languages. Some syllables are considered as stressed syllables since they are in some scene perceptually more prominent than others. Stress is parasitic; it can be produced by the phonetic correlates of other phenomena, like pitch and duration. In most of the cases, syllables with higher pitch variation and longer duration are considered as the stressed syllables. The way stress arises in the speech signal is vastly language dependent, and it is quantified by the energy parameter. In case of tonal languages especially, where register tones occur, a direct correlation between tone and stress exists [31]. However, for most of the tonal languages, stress is much less obvious [11] On the other hand, for non-tonal languages, like English, stress is obvious. So, stress is yet another language-dependent trait and can be used to complement the pitch contour cues. Stress is calculated from the energy values of all the voiced frames present within a syllable. Six parameters have been used to quantify the stress characteristics in this work.

A₉: Mean energy

Mean energy is calculated by averaging the energy values of the energy contour corresponding to the syllable.

A₁₀: Change in log energy

Using the quantitative measure quantifying stress characteristics, described in [24], log energy has also been considered in this work. Log energy is more akin to human perception of stress variation.

A₁₁: Energy changing speed

As similar to pitch contour, the energy contour of a syllable is also found to characterize languages [35]. Literature study reveals that that there is an interaction between tone and stress for register tone languages or the languages which contain level tone and contour tone [31] like Standard Chinese language. Also, in case of tonal languages, it is observed that stress does not necessarily coincide with high tone. Hence, like pitch changing speed, energy changing speed may so be used to characterize languages. It is estimated according to the equation:

$$ {\text{EV}}_{j} = \mathop \sum \limits_{i = 1}^{N - 1} \left| {E_{i + 1} - E_{i} } \right| $$

(4)

Here j represents the index of each segment; N represents the number of frames present in the segment, and E₁, E₂, …, E_N represent the energy values of each frame within a segment. The normalized energy changing speed is given by:

$$ {\text{EV}}_{j} = \frac{{{\text{EV}}_{j} }}{{{\text{mean energy}} \times {\text{number of voiced frames}} }} $$

(5)

A₁₂: Energy changing level

The energy changing speed is a local parameter of energy contour and does not provide the gross level change across a syllable. Therefore, to model the global nature of speed change, another parameter called energy changing level is introduced and is given by

$$ \left(\check{\sigma_{\text{e}}}\right)_{j} = \frac{{\sigma_{{{\text{e}}_{j}}}}}{{{\text{mean energy}} \times {\text{number of vowels}}}} $$

(6)

where $ \sigma_{{{\text{e}}_{j} }} $ is the standard deviation of jth segment (syllables), $ \left(\check{\sigma_{\text{e}}}\right)_{j} $ is the normalized energy changing level. Energy changing level is a global parameter, and it can be used to discriminate tonal languages from non-tonal.

A₁₃: Amplitude tilt for energy contour ($ EA_{\text{t}} $)

The dynamics of the energy contour are usually defined using tilt parameters. Amplitude tilt for energy contour is calculated as follows:

$$ EA_{\text{t}} = \frac{{\left| {A_{\text{er}} } \right| - \left| {A_{\text{ef}} } \right|}}{{\left| {A_{\text{er}} } \right| + \left| {Ae_{\text{f}} } \right|}} $$

(7)

Here $ A_{\text{er}} $, $ A_{\text{ef}} $ are the rise and fall point of the energy contour, respectively, with respect to the peak value of the contour .

A₁₄: Duration tilt for energy contour (ED_t)

Duration tilt can also be used for quantitative representation of the energy contour dynamics. It can be expressed as per the equation:

$$ ED_{\text{t}} = \frac{{\left| {De_{\text{r}} } \right| - \left| {D_{\text{ef}} } \right|}}{{\left| {De_{\text{r}} } \right| + \left| {De_{\text{f}} } \right|}} $$

(8)

Here D_er and D_ef are the duration for rise and fall, respectively.

A₁₅: Distance of peak of energy contour with respect to VOP

As similar to the case of pitch, the parameter defined by the distance of the peak of the energy contour with respect to VOP is also used, in this work, as a possible cue for language classification. It may be reasoned that such a parameter may be useful, given that the stress and pitch are correlated in case of certain languages and uncorrelated in other cases.

A₁₆: Distance of 60% of the peak value of the energy contour with respect to VOP

Going by the same reasoning of language-dependent correlation between stress and pitch, yet another parameter is introduced, that is defined in a way, similar to A₉ parameter of pitch contour. Here, the distance of the location with 60% of the peak value of the energy contour with respect to the VOP location is calculated.

Parameterization of duration

In this work, two parameters have been used to parameterize the duration characteristics

A₁₇: Syllable duration

Tone is the phonologically contrastive use of pitch within a segment or a syllable. Tonal contrasts are realized not only by differences in pitch contour, but also by systematic differences in duration [19]. From studies, it can be observed that dynamic tones tend to be confined to phonetically long sonorous segments [3]. Also, the vowels on low tones are longer than those on high tones, and on the contrary, vowels on rising tones are longer than those on falling tones [14]. Thus syllable duration has characteristic information about tonal languages and can therefore be used as a discriminating cue for tonal and non-tonal language classification task. In this experiment, syllable duration is calculated by counting the total number of frames present in a syllable.

A₁₈: Ratio of voiced region duration to total segment duration (Rhythm)

Here, rhythm of each syllable is represented as the ratio of voiced region duration within a syllable to the total syllable duration. This is approximated by the ratio of duration of vowels to the duration between two consecutive vowels are used as rhythm. Mean number of vowel qualities [22] are different for tonal and non-tonal languages. Vowels can be classified as high/low or close/open and duration of each type of vowels will be different for tonal and non-tonal languages. Hence, rhythm can be used as a distinguishing parameter for this system.

A₁₉: Vowel counts

Number of vowel inventories of tonal languages is significantly different from non-tonal, and hence, this can be used as an important parameter in this classification task [22]. Vowel counts are obtained by counting the number of VOPs present in the analysis units (utterance). For a syllable, vowel count would be always equal to 1. Therefore, use of this parameter is insignificant for a syllable. VOPs can be obtained for a spontaneous speech signal by using VOP detection algorithm of [30].

Though some the above-mentioned feature parameters (A₄, A₅, A₆, A₇, and A₁₀) have previously been used in tasks of language identification [24], effect of these parameters for discriminating tonal/non-tonal languages has not been studied so far. This work analyzes the effect of these parameters for tonal and non-tonal language pre-classification-based LID task.

3.1.2 Contour Modeling/Parameterization of Spectral Features

MFCC features represent human auditory perceptions for languages and are a pre-dominantly used feature for LID task. MFCC features are known to represent the vocal tract information. Researchers have observed that the vocal tract changes associated with different tones of languages, like Mandarin and Vietnamese, have strong correlation with MFCCs [12]. Literature study [48] also suggests that proper recognition of tones depends not only upon the tone production process but also on the human perception ability. And as MFCCs model the human auditory perception, it serves as a suitable feature for the system. Besides, MFCCs have complementary information with respect to pitch [21], which is known to be a robust feature for language identification.

Extraction of MFCC features is done using standard algorithm which is explained in [40]. In addition to MFCC features, ∆ and ∆ − ∆ coefficients are also explored. In this experiment, the feature vectors for all the frames of a syllable are stacked together, and the contour corresponding to each cepstral coefficient is modeled as a linear combination of Legendre polynomials according to Eq. (9)

$$ f\left( t \right) = \mathop \sum \limits_{i = 0}^{M} a_{i} P_{i} \left( t \right) $$

(9)

where f(t) is the contour being modeled, P_i(t) is the ith Legendre polynomial and coefficient a_i represents a characteristic of the contour shape [23]; a₀ corresponds to the mean, a₁ to the slope, a₂ to the curvature, and higher-order represents more precise detail of the contour. Here, Legendre polynomials of order four lead to 35-dimensional MFCC feature and 105-dimensional MFCC + ∆ + ∆ − ∆ features for a syllable. In this study, 35-dimensional MFCC features are used at pre-classification stage and 105-dimensional MFCC + ∆ + ∆ − ∆ are used at individual language identification stage. Fitting of Legendre polynomial to the 1st coefficient of MFCC is shown in Fig. 4. In the present study, the parameters representing prosody and MFCC are concatenated in a row to obtain the combined feature vector of a syllable and also other analysis units.

Here, each of the parameters represents a dimension of the feature vector. Dimensions of different feature vectors are shown in Table 1.

Table 1 Dimensions of different feature vectors extracted of different analysis units

A Pre-classification-Based Language Identification for Northeast Indian Languages Using Prosody and Spectral Features

Abstract

Similar content being viewed by others

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

Acoustic Feature Analysis and Discriminative Modeling for Language Identification of Closely Related South-Asian Languages

Spoken Language Identification of Indian Languages Using MFCC Features

Explore related subjects

1 Introduction

2 Proposed System for Language Identification

2.1 Tonal and Non-tonal Language Pre-classification System

3 Development of Language identification System

3.1 Extraction and parameterization of different features for language identification

3.1.1 Parameterization of Prosody for Language Identification

3.1.2 Contour Modeling/Parameterization of Spectral Features

3.2 Database Used in Language Identification

3.3 Feature Modeling for Language Identification

3.4 Data Normalization

4 Experimental Results

4.1 Experimental Setup

4.2 Experimental Results of Pre-classification Stage

4.2.1 Syllable-Level Performance

4.2.2 Comparison Among the Performances for the Features Extracted of Different Analysis Units of the Speech Signal

4.3 Experimental Results for Individual Language Identification

4.3.1 Individual Language Identification Without Pre-classification

4.3.2 Individual Language Identification with Pre-classification

5 Conclusions and Future Scopes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation