Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

China Bhanja, Chuya; Laskar, Mohammad Azharuddin; Laskar, Rabul Hussain

doi:10.1007/s10579-020-09527-z

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

Original Paper
Published: 20 January 2021

Volume 55, pages 689–730, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Language Resources and Evaluation Aims and scope Submit manuscript

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

Download PDF

Chuya China Bhanja¹,
Mohammad Azharuddin Laskar¹ &
Rabul Hussain Laskar¹

225 Accesses
4 Citations
Explore all metrics

Abstract

In this paper an attempt has been made to prepare an automatic tonal and non-tonal pre-classification-based Indian language identification (LID) system using multi-level prosody and spectral features. Languages are first categorized into tonal and non-tonal groups, and then, from among the languages of the respective groups, individual languages are identified. The system uses syllable, word (tri-syllable) and phrase level (multi-word) prosody (collectively called multi-level prosody) along with spectral features, namely Mel-frequency cepstral coefficients (MFCCs), Mean Hilbert envelope coefficients (MHEC), and shifted delta cepstral coefficients of MFCCs and MHECs for the pre-classification task. Multi-level analysis of spectral features has also been proposed and the complementarity of the syllable, word and phrase level (spectral + prosody) has been examined for pre-classification-based LID task. Four different models, particularly, Gaussian Mixture Model (GMM)-Universal Background Model (UBM), Artificial Neural Network (ANN), i-vector based support vector machine (SVM) and Deep Neural Network (DNN) have been developed to identify the languages. Experiments have been carried out on National Institute of Technology Silchar language database (NITS-LD) and OGI Multi-language Telephone Speech corpus (OGI-MLTS). The experiments confirm that both prosody and (spectral + prosody) obtained from syllable-, word- and phrase-level carry complementary information for pre-classification-based LID task. At the pre-classification stage, DNN models based on multi-level (prosody + MFCC) features, coupled with score combination technique results in the lowest EER value of 9.6% for NITS-LD. For OGI-MLTS database, the lowest EER value of 10.2% is observed for multi-level (prosody + MHEC). The pre-classification module helps to improve the performance of baseline single-stage LID system by 3.2% and 4.2% for NITS-LD and OGI-MLTS database respectively.

A Pre-classification-Based Language Identification for Northeast Indian Languages Using Prosody and Spectral Features

Article 12 October 2018

Formants and Prosody-Based Automatic Tonal and Non-tonal Language Classification of North East Indian Languages

New Method for Automatic Recognition of Mexican Indigenous Languages: Comparative Performance of Classifiers

Article 28 August 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The main purpose of LID system is to automatically recognize the spoken language from a given portion of speech. One application of LID systems is to prepare a system for routing an incoming phone call to an appropriate human switchboard operator who is well versed in a particular language. In multilingual countries like, India, a multilingual spoken-dialog system that can serve in multiple languages, finds application in various fields (Mary 2006). In this type of multilingual operation, the machine should be capable of distinguishing among different languages. Several approaches and computationally advanced methods have been proposed in the literature for language distinguishing task with state-of-the-art performance.

The number of target languages has direct bearing on the performance of an LID system. Also, in countries, like India where the languages share common phoneme sets, distinguishing among languages become more challenging. Several researchers attempted to identify the closely related Indian languages. In one such case, Jothilakshmi et al. (2012) presented a hierarchy-based LID system for 9 Indian languages using spectral features, namely MFCC, delta/ double delta and SDCs of MFCC. Here, in the first level, the languages were divided into two language families, namely Indo-Aryan and Dravidian and then individual languages were identified from languages of the corresponding language family. They studied the efficacy of the two-level LID system in discriminating languages having the same origin. The authors also studied the effectiveness of MFCC features and they reported an accuracy of 80.56% (9 target languages) for the GMM-UBM model and (MFCC + delta-double delta features). However, they didn’t study the complementarity of prosodic features with MFCC for hierarchical Indian language identification system.

Reddy et al. (2013) proposed another LID system for 27 languages of Indian origin using spectral (MFCC) and prosodic features. Here, prosodic features extracted for different levels, namely syllable, word, and phrase were used and then the final score was obtained by combining the scores obtained for different levels. Complementary nature of prosodic and spectral features at the utterance level was exploited and the evidences from spectral and prosodic features were fused to obtain better language recognition accuracy. In this case, the average accuracy for the identification of 27 Indian languages was 62.13%. To build a more accurate system with larger number of target languages, a module may be added to initially pre-classify the languages into different categories or sub-language families. Also, to accurately identify languages, which are closely related or are having the same origin, properly, a reliable pre-classification module is required.

In one such case, Wang et al. (2007) outlined a tonal/non-tonal pre-classification-based LID system for 16 world’s distinct languages using prosodic features only and reported accuracies of 77.9% and 49.2% for 30 s and 10 s test data respectively. However, the system is dependent on phonetically labelled data, which is not always available and requires expertise in linguistics. Countries, like India where language diversity is very high, it is even more difficult to obtain the phonetically labelled data for all the languages. Also, in (Wang et al. 2007) the researchers examined the efficacy of pre-classification module with only the world’s distinct languages. No study, however, has been reported on pre-classification of closely related Indian languages.

Additionally, in Wang et al. (2007) only a few parameters of prosodic features, like pitch and duration have been used for both pre-classification and pre-classification-based LID task and the features are extracted considering the whole utterance as a unit. However, literature study confirms the alignment of tonal events with syllables for tonal languages (Atterer and Ladd 2004; Zhang 2014). Also, most of the Indian languages are syllable-centric (Singh 2006), and so language-specific information are manifested at syllable level itself. For tonal languages the pitch changes within a syllable is of a regular pattern (Maddieson et al. 2013). In one of the recent works (China Bhanja et al. 2018), we have observed the usefulness of syllable-level features for pre-classification-based LID system. Some new parameters of prosody have also been proposed to boost the performance of tonal/non-tonal language classification task. However, the system performance was analysed only for seven Northeast (NE) Indian languages. Moreover, the tonal languages included in this database are mostly having monosyllabic words. However, for di-syllabic or poly-syllabic tonal languages, all the syllables may not carry tone information. In case of those languages, features obtained from other levels, like word (three consecutive syllables) or phrases (multi-word) may provide better tone information. In paper (Reddy et al. 2013) researchers observed the complementarity among different levels of prosodic features when identifying individual languages. However, no such study analysed either the complementarity of syllable, word and phrase level prosody or the combining effects of the different levels of prosody for tonal/non-tonal classification or tonal/non-tonal pre-classification-based LID system. Literatures reveals that MFCC carries tone information (Le et al. 2009; Ryant et al. 2014). However, for tonal/non-tonal pre-classification-based LID system, frame level analysis of MFCC has not been explored so far. Since tones in tonal languages lie within a syllable (China Bhanja et al. 2018), MFCC frames corresponding to a syllable are further modelled using Legendre coefficients to obtain syllabic level characterization. However, for di-syllabic or poly-syllabic tonal languages any of the utterance or syllable level MFCC modelling may not be the most suitable in terms of capturing the tone information. To study this aspect, MFCC feature modelling can be analysed at multiple levels so as to explore both the local and global characteristics of the speech signal. Also, its combination with multi-level prosodic features is studied.

MFCCs, though have been the most extensively used features for language identification (Burgos 2012), are sensitive to background noise, acoustic mismatched training and testing environments, room reverberation etc. In another study, researchers showed that the performance of MFCC features reduces significantly with the increase of noise power (Li and Huang 2011). Several researchers (Li and Narayanan 2014; Sadjadi and Hansen 2015) have worked, in recent times, towards developing a front-end system, robust to noise and mismatched acoustic training and testing environments. In Sadjadi and Hansen (2015), the authors proposed another noise robust LID system which works well on noisy data of DARPA-RATS database, utilizing MHEC features extracted from the frames of an utterance. However, no study discusses if MHEC carries tone information that could be useful for discrimination of tonal languages from non-tonal. Since MHEC carries finer details of the information of human auditory perception which may be useful for identifying different tones, therefore it may be more effective for discriminating tonal/non-tonal languages with higher accuracy. Syllable-level representation of MHEC feature may provide better tonal/non-tonal language discriminating information. Multi-level MHEC may provide complementary information and they may be used as complementary features with prosody to improve the system performance at the pre-classification stage which would further improve the overall performance. Further, this paper studies the system performance for two different datasets that have been collected using two different channels. These experiments may thus help study the effectiveness of acoustic features for two different channel conditions.

In the back-end, a significant advancement can be observed in the context of LID task. Several research efforts have been made in the form of Joint Factor Analysis (JFA), i-vector based approach (Dehak et al. 2011) etc. i-vector based approach uses GMM-UBM to model the acoustic features and various scoring methods, namely probabilistic linear discriminant analysis (PLDA) (Prince and Edler 2007), SVM (Dehak et al. 2011) and cosine distance (CD). In recent study, the effectiveness of DNN (Richardson et al. 2015a, b; Mounika et al. 2016) has also been studied in LID task. In existing literature (Dehak et al. 2011; Prince and Edler 2007; Richardson et al. 2015a, b; Mounika et al. 2016) i-vector based SVM and DNN are used to model the frame-level features. Also, in Martinez et al. (2013), researchers presented a method whereby an utterance is first divided into fixed length segments. Then the segment-level features are used to compute the i-vector of that utterance. However, in tonal languages, features should preferably be extracted syllable by syllable. Nevertheless, the use of multi-level features and subsequent score combination can also be helpful. Syllable-level or multi-level features have not been used with DNN or i-vector SVM frameworks so far.

This paper particularly focusses on the identification of closely related Indian languages. The influence of one language on other is very high in India. Also, there are several under-resourced and/or well-resourced languages in India. Very few databases (ciil-spokencorpus 2009; Maity et al. 2012) have been prepared for Indian languages. Moreover, in India, the existing databases either include less number of target languages or are not commercially available. This makes it important to prepare a database which may cover many more Indian languages.

2 Motivation

It can be observed from the literature that the existing tonal/non-tonal pre-classification-based LID system (Wang et al. 2007) for the world’s distinct languages depend on phonetically labelled data. Also, only syllable-level MFCCs and prosody have been explored in a similar system for NE Indian languages (China Bhanja et al. 2018). No study of such a system is available using multi-level MFCCs which carry useful information for di-syllabic or polysyllabic tonal languages and no work studies the effectiveness of MHEC features or their complementarity with prosody in any pre-classification-based LID task. No pre-classification-based LID system has so far explored multi-level analysis of MHECs and prosody. Additionally, most of the LID systems (Sadjadi and Hansen 2015; Dehak et al. 2011; Prince and Edler 2007; Richardson et al. 2015a; Richardson et al. 2015b; Mounika et al. 2016) reported so far, has given emphasis on the modelling of utterance-level features extracted from the frames of an utterance. Modelling of multi-level features using i-vector based SVM or DNN has not been explored. Commercially available databases for Indian languages are very less in number. This paper tries to address the above-mentioned issues. The main contributions of this paper may be summarized as follows:

A tonal/non-tonal pre-classification-based LID system has been developed for languages of Indian origin using multi-level prosody spectral features. This system does not use phonetically labelled data.
Comparative study of the frame-level and syllable-level spectral features has been done. Also, performance analysis of MHEC and MFCC features has been carried out. Complementarity of MHEC and MHEC + SDC with prosody has also been studied.
Comparative performance analysis of systems based on multi-level prosody and multi-level (prosody + spectral) feature with respect to that based on syllable-level feature has been carried out in this work.
A comparative analysis of the various modelling techniques, namely GMM-UBM, i-vector based SVM, ANN and DNN has been done for a pre-classification-based LID system using multi-level features.
NITS-LD (Studio-quality) has been prepared in-house and it covers twelve closely related Indian languages, namely Bengali, Assamese, Indian English, Hindi, Nagamese, Odia, Tamil, Mizo, Punjabi, Manipuri, Bodo, and Gojri of different language families. The data has been acquired from news archives of AIR (All India Radio). Moreover, the systems are also evaluated on OGI-MLTS (telephonic) database that consists of world’s distinct languages of different families.

The rest of the paper is organized as follows: Sect. 3 provides the description of the databases; the description of the proposed language identification system is given in Sect. 4—system architecture, features and the language modelling techniques. Experimental results and analysis are given in Sects. 5 and 6 concludes the work by mentioning the future works.

3 Dataset details

In this work, two databases namely, OGI-MLTS database and NITS-LD have been used for validation of the systems.

3.1 OGI-MLTS database

OGI-MLTS speech (Muthusamy et al. 1992) corpus is made up of spontaneously spoken fixed-vocabulary utterances of 11 different languages: Spanish, Farsi, Mandarin Chinese, French, English, German, Vietnamese, Korean, Japanese, Tamil and Hindi. The Japanese language has been not been considered in the experimentation because of the uncertainty of its tonal/ non-tonal nature (Beckman and Pierrehumbert 1986). 90 speakers of each of the languages have been used to prepare the database. It is collected over a telephone line at a sampling frequency of 8 kHz. It covers two tonal languages (Vietnamese and Mandarin) and nine non-tonal languages. The systems have been evaluated for 10 languages (after omitting the Japanese language). Only two Indian languages have been covered in OGI-MLTS database, which is why the NITS-LD database which covers 12 Indian languages has been prepared.

3.2 NITS-LD

NITS-LD includes 12 Indian languages, namely, Bengali (Be), Assamese (As), Hindi (Hi), Indian English (En), Nagamese (Na), Odia (Od), Tamil (Ta), Manipuri (Ma), Mizo (Mi), Bodo (Bo), Gojri (Go) and Panjabi (Pu).

In this database five languages (Manipuri, Mizo, Bodo, Gojri and Panjabi) are tonal and the rest seven are non-tonal. AIR news archives have been used for data preparation. It involves well matured and highly professional speakers. Thus, the speech extracts are all well-articulated and spoken with standard speaking rate and pronunciation. Table 1 compares the OGI-MLTS and the NITS-LD databases. The database prepared using speech samples of AIR news archives have some issues like lesser number of speakers for some of the languages and lesser variability across different sessions in terms of vocabulary. In short, data variability across speakers and words are limited. In order to have sufficiently large training set, a subset of Indic database is also used in addition to NITS-LD database. From the Indic database, around 5 h data of each of the 5 languages namely Hindi, Bodo, Odia, Tamil and Manipuri have been used for this experiment. The details of Indic database are given in (Baby et al. 2016).

Table 1 Comparison between OGI-MLTS and NITS-LD databases

Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system

Abstract

Similar content being viewed by others

A Pre-classification-Based Language Identification for Northeast Indian Languages Using Prosody and Spectral Features

Formants and Prosody-Based Automatic Tonal and Non-tonal Language Classification of North East Indian Languages

New Method for Automatic Recognition of Mexican Indigenous Languages: Comparative Performance of Classifiers

Explore related subjects

1 Introduction

2 Motivation

3 Dataset details

3.1 OGI-MLTS database

3.2 NITS-LD

4 Language identification system

4.1 Language pre-classification

4.1.1 Description of system-I

4.1.2 Description of system-II

4.2 Pre-classification-based LID system (system-III)

4.3 Feature extraction

4.3.1 Syllable-level prosody

4.3.2 Phrase-level prosody

4.3.3 MFCC and MFCC + SDC features

4.3.4 MHEC and MHEC + SDC features

4.4 Contour modelling of spectral features for system-II

4.5 Data normalization

4.6 Language modelling

4.6.1 i-vector based SVM

4.6.2 DNN

5 Experiments, results and discussions

5.1 Experimental setup

5.2 Results of the tonal and non-tonal language pre-classification system

5.2.1 Results of system-I

5.2.2 Results of system-II

5.3 Results of the tonal and non-tonal pre-classification-based LID system

5.3.1 Case study I (system-III)

5.3.2 Case study II

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation