Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling

Thimmaraja Yadava, G.; Jayanna, H. S.

doi:10.1007/s10772-020-09671-5

Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling

Published: 22 January 2020

Volume 23, pages 149–167, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Speech Technology Aims and scope Submit manuscript

Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling

Download PDF

G. Thimmaraja Yadava¹ &
H. S. Jayanna²

244 Accesses
20 Citations
Explore all metrics

Abstract

In this paper, the improvements in the recently implemented Kannada speech recognition system is demonstrated in detail. The Kannada automatic speech recognition (ASR) system consists of ASR models which are created by using Kaldi, IVRS call flow and weather and agricultural commodity prices information databases. The task specific speech data used in the recently developed spoken dialogue system had high level of different background noises. The different types of noises present in collected speech data had an adverse effect on the on line and off line speech recognition performances. Therefore, to improve the speech recognition accuracy in Kannada ASR system, a noise reduction algorithm is developed which is a fusion of spectral subtraction with voice activity detection (SS-VAD) and minimum mean square error spectrum power estimator based on zero crossing (MMSE-SPZC) estimator. The noise elimination algorithm is added in the system before the feature extraction part. An alternative ASR models are created using subspace Gaussian mixture models (SGMM) and deep neural network (DNN) modeling techniques. The experimental results show that, the fusion of noise elimination technique and SGMM/DNN based modeling gives a better relative improvement of 7.68% accuracy compared to the recently developed GMM-HMM based ASR system. The least word error rate (WER) acoustic models could be used in spoken dialogue system. The developed spoken query system is tested from Karnataka farmers under uncontrolled environment.

Enhancements in Continuous Kannada ASR System by Background Noise Elimination

Article 16 February 2022

Amalgamation of noise elimination and TDNN acoustic modelling techniques for the advancements in continuous Kannada ASR system

Article 29 July 2023

Continuous Kannada Speech Recognition System Under Degraded Condition

Article 15 July 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

There are 6.5 crore people disseminated over Karnataka state under different dialect regions and they are daily deals with different commodities Karnataka Raitha Mitra (2008). The agricultural marketing network (AGMARKNET) website is maintaining by agricultural ministry, Government of India provides agricultural commodity price information for Indian languages Agricultural Marketing Information Network (2011). This website is updated frequently and provides minimum modal and maximum price information of particular commodity in different Indian languages. Many farmers in Karnataka state are uneducated and do not computer savvy but almost all farmers uses mobiles for their interaction purpose India Telecom Online (2013). Therefore, it is less cost to combine the mobile network with automatic speech recognition (ASR) system. Integrating the mobile network with ASR models to build spoken query system gives a good result for the statement of problem demonstrated in Kotkar et al. (2008), Rabiner (1997). An end to end spoken dialogue system consists of three main steps. They are, interactive voice response system (IVRS) call flow, ASR models developed from Kaldi speech recognition toolkit and AGMARKNET commodity price information database management system. The IVRS call flow structure is used for task specific speech data collection. The PHP programming language is used to develop the call flow for the speech data collection and speech recognition system. The ASR acoustic models are created using Kaldi. In Shahnawazuddin et al. (2013), an Assamese spoken dialogue system is demonstrated to spread the price information of agricultural commodities in Assamese language/dialects. The acoustic ASR models were developed by using speech data which was gathered from the real farmers of Assam. The constrained speech data unseen speaker adaptation method was derived and it was known to give a significant development by 8% over initial evaluation. In Ahmed et al. (2014), an Arabic ASR system is developed using Arabic language resources and data sparseness. The basic modeling techniques such as Gaussian mixture model (GMM) and hidden Markov model (HMM) were used to build an acoustic ASR models for Arabic speech recognition system.

The 36 phonetic symbols and 200 h of speech corpus were used. A Russian ASR system was developed using syntactico-statistical modeling algorithm with big Russian dictionary Karpov et al. (2014). The standard IPA phonemes were used as quality phoneme set to build ASR models. It includes phonetic symbols of 55, consonants symbols of 38 and vowels of 17 in dictionary. The Russian language speech corpus was recorded in clean environment. The 16 kHz sampling rate and 26 h of speech corpus was given for Kaldi system training and decoding. The obtained word error rate (WER) was 26.90%. The improvements in Assamese spoken dialogue system are implemented in Dey et al. (2016). Foreground speech segmentation enhancement algorithm was used for the suppression of different background noises. The noise elimination algorithm was introduced before the Mel frequency cepstral coefficients (MFCC) features extraction part. Recently introduced modeling techniques such as subspace Gaussian mixture model (SGMM) and deep neural networks (DNN) were used for the development of acoustic models. The developed spoken dialogue system was verified by the farmers of Assam under degraded condition and it enables the farmers/users to obtain the on-time price information of agricultural commodities in Assamese language/dialects.

The ASR system using IVRS is one of the stupendous applications of speech processing Rabiner (1994). The amalgamation of IVRS and ASR systems are called spoken query systems which are used to decode the user input Glass (1999), Trihandoyo et al. (1995) and the needed information is spread by the system. The recent advancement in the speech recognition domain is that the touch tones used in the earlier ASR systems have been completely removed. A spoken query system has been developed recently to access the prices of agricultural commodities and weather information for Kannada language/dialects Thimmaraja and Jayanna (2017). This work is an ongoing sponsored project by DeitY, Government of India, targeted to develop an user friendly spoken dialogue system by addressing the needs of Karnataka farmers. The developed system gives an option to the user to make his/her own query about what he/she wants over mobile/land-line telephone network. The query which is uttered by user/farmer is recorded, checked the price/weather information in database through ASR models and communicate the on time price/weather information of particular commodity in particular district through saved messages (voice prompts). The earlier spoken query system Thimmaraja and Jayanna (2017) was developed using GMM-HMM modeling techniques. In this work, we demonstrate an enhancements to the recently implemented ASR system. A noise elimination algorithm is proposed which is used to reduce the noise in speech data collected under uncontrolled environment. We have also investigated the two different acoustic modeling approaches reported latterly subjected to spoken query system. The training and testing speech data used in Thimmaraja and Jayanna (2017) for the creation of ASR models was collected from the farmers under real time environment. Therefore, the collected speech data was adulterated by different types of background noises and when the user/farmer makes a query to the system also happens to have high level of background noise. This totally decrease the entire spoken query system performance. To overcome this problem, we have introduced the noise elimination algorithm before feature extraction part. This algorithm eliminates the different types of noises in both training and testing speech data. The removal of background noises leads to a good modeling of phonetic contexts. Therefore, an improvement in the on line and off line speech recognition accuracy is achieved compared to the earlier spoken query system.

The process of enhancing the degraded speech data using various noise elimination techniques is called speech processing Loizou (2007). The modified spectral subtraction algorithm was proposed for improvements in speech Bing-yin et al. (2009). This algorithm was implemented by using MCRA method Cohen and Berdugo (2002). The conducted experimental results analysis were evaluated and compared with existing methods. In Liu et al. (2012), to suppress the musical noise in corrupted speech data, a modified spectral subtraction algorithm was proposed and it was compared with traditional spectral subtraction algorithm.

Few years back, two advanced acoustic modeling approaches namely SGMM and DNN have been described in Povey et al. (2011), Dahl et al. (2012), Hinton et al. (2012) and Hinton et al. (2006). These two techniques provide better speech recognition performance than GMM-HMM based approach. The GMM in acoustic space is called as SGMM. Therefore, the SGMM is best suitable for the moderate training speech data. Furthermore, DNN consists of more hidden layers in multi layer perception to capture nonlinearities of training set. This gives a good improvement in modeling of variations of acoustics leading to better performance of speech recognition. The process of identifying the human speech in to its equivalent text format is called speech recognition. The speech recognition output can be used in various applications. Nowadays many artificial intelligence techniques are available to build robust ASR models for the development of end to end ASR systems Derbali et al. (2012). The authors have used the Sphinx toolkit to model the sequential structure and its classification of patterns. The recently developed speech recognition toolkit is the Sphinx-4 which is an added value to CMU repository. It was jointly implemented by various universities and laboratories. The Sphinx-4 is having more advantages compared earlier versions Sphinx systems. They are in terms of flexibility, reliability, modularity and algorithmic aspects. Sphinx-4 supports different languages/dialects and it uses different searching strategies. The entire package of Sphinx-4 is developed by using JAVA programming language. It is very user friendly, portable, flexible and very easy to work with multi threading concepts Lamere et al. (2003).

The implementation and design of natural Arabic continuous speech recognition system was developed in Abushariah et al. (2010) using Sphinx tool. The authors have explored the effectiveness of Sphinx models and developed a robust new continuous Arabic ASR system. The newly developed Arabic speech recognition system is compared with the baseline Arabic ASR. The implemented ASR system was used the Sphinx and HTK tools to build language and acoustic models. The speech signal feature vectors are extracted from MFCC technique. The system was used five state HMM and 16 GMMs. The number of senons used in this work are 500 and the developed model used 7 h of transcribed and validated speech data. One hour of speech data was used for decoding purpose. The achieved accuracy of speech recognition is 92.67%.

The ASR system was developed and designed in Al-Qatab and Ainon (2010) using HTK tool. The system was developed for both isolated and continuous speech recognition purpose. The lexicon and phoneme set was created first for Arabic language. The MFCC technique was used for the speech signal features extraction. The speech database was collected from the native Arabic speakers. The achieved speech recognition accuracy was 97.99%. The three important components play an important role in the development of ASR system. They are, lexicon or dictionary, language model and acoustic model. The authors have developed a robust ASR system without using lexicon for English language Harwath and Glass (2014). The impact of subspace based modeling techniques is investigated in Rose et al. (2011). The SGMM is the new level of modeling technique to build acoustic models for the development of robust ASR systems. The SGMM technique was used to develop continuous speech recognition system and achieved WER was 18%. An ASR system was developed for the Odia language in Karan et al. (2015). The ASR models were developed by using Kaldi. The speech database was collected from the farmers of Odisha in real time environment. The ASR models were developed for the district names of Odisha state. The ASR models were developed by using monophone and triphone training and decoding. The asterisk server was used to develop an ASR system. The enhancements in IITG spoken query system was described in Abhishek et al. (2017). The models were built by using Kaldi. An end to end spoken dialogue system consist of IVRS flow, IMD and AGMARKNET databases and ASR models. The earlier IITG spoken dialogue system leads to a maximum WER due different noises were present in the collected speech data. In order to overcome from the problem less accuracy, the authors have developed a robust noise elimination algorithm and introduced it before the MFCC features extraction part. The earlier developed system was modeled by using GMM-HMM based technique. To improve the accuracy, the authors have built the ASR models using SGMM-DNN base modeling technique. The comparison of on line and off line speech recognition accuracies is also done in their work. An added version of Sphinx called Sphinx-4 framework was developed in Walker et al. (2004). The developed Sphinx-4 version is very robust in the development of language and acoustic models. The Sphinx-4 is modular, extend, portable and flexible. The Sphinx-4 and its additional packages are freely available on the Internet. The continuous ASR system for large vocabulary based on DNN was presented in Popovic (2015). The ASR models were built by using Kaldi. The DNNs are mainly implemented based on the principle of restricted Boltzmann machines. The GMMs were used to represent the HMM state’s emission densities. 90 h of continuous Serbian speech data was used for training the system. The performances of GMM-HMM based models and DNN models were compared and the best model could be used in continuous ASR system. The improvements in Arabic ASR system was presented in Nahar and Squeir (2016). The authors have introduced a novel hybrid approach to increase the accuracy of speech recognition. The hybrid technique was a combination of learning vector quantization and HMM. This algorithm was mainly intended to recognize the phonemes in the continuous speech vocabulary. The TV news speech corpus was taken for system training and testing. Therefore the achieved speech recognition accuracies are 98.49% and 90% for independent and dependent training respectively. A new modeling technique was developed Ansari and Seyyedsalehi (2016)for the development of robust ASR models. The modular deep neural network was introduced for the recognition of speech. The pre-training of network is mainly depends on its structure. The two important speaker adaptation techniques were also implemented in this work. For the system training and testing, two speech databases were used and achieved WERs were 7.3% and 10.6% for MDNN and HMM respectively. The ASR models were built by using recurrent neural networks for Russian language was presented in detail Kipyatkova and Karpov (2017). These ANNs were used for the development of robust continuous ASR system for Russian language. The models of neural network are constructed based on the principle of number of elements in the hidden layer. An unsupervised learning models were introduced in Sailor and Patil (2016) which are based on RBMs. The authors have experimented both MFCC features and filterbank features on large vocabulary of continuous speech data. An AURORA-4 speech dataset was used for the experimental conduction. The proposed filterbank provides better performance compared to traditional MFCC feature extraction technique. The traditional methods for speech enhancement based on different frames or segments needs much knowledge about different noises. A new algorithm was introduced for the elimination of different types of noises in the corrupted speech data Ming and Crookes (2017). The authors have used zero mean normalized correlation coefficient as a measure of comparison. This algorithm overcomes the problem of knowing the knowledge about different noises present in the speech data. The proposed method outperforms the conventional traditional speech enhancement methods. The performances of conventional and proposed methods were done by considering some objective measures as well as ASR. A speech enhancement algorithm called missing feature theory was developed in Van Segbroeck and Van Hamme (2011) to improve the accuracy of ASR system. The missing feature theory algorithm was applied on log spectral domain and static and dynamic features. To compute the channel, a maximum likelihood computation technique was integrated with missing feature theory. The Aurora-4 speech database was used for experimental conduction. For the structured classification work, a discriminative models were used for speech recognition Zhang et al. (2010). The authors have developed a set structured log linear models to develop a robust ASR system. The main advantage of log linear models is its features. The proposed method was the combination of kernels, efficient lattice margin training, discriminative models and noise compensation technique. An Aurora-2 speech database was used for the experimental conduction.

The main contributions are made in this work are as follows:

Deriving and studying the effectiveness of existing and newly proposed noise elimination techniques for practically deployable ASR system.
The size of the speech database is increased by collecting 50 h of farmers speech data and created entire dictionary and phoneme set for Kannada language.
Exploring the efficacy of the SGMM and DNN for moderate ASR vocabulary.
Improving the on line and off line (word error rates (WERs) of ASR models) speech recognition accuracy in Kannada spoken dialogue system.
Testing the newly developed spoken dialogue system from farmers of Karnataka under uncontrolled environment.

The rest of the work is summarized as follows: Sect. 2 describes the collection speech database and its preparation. The background noise elimination by fusing SS-VAD and MMSE-SPZC estimator is described in Sect. 3. The creation of ASR models using Kaldi is described in Sect. 4. The effectiveness of SGMM and DNN is described in Sect. 5. The experimental results and analysis are discussed in Sect. 6. The Sect. 7 gives the conclusions.

2 Speech database collection from farmers

The basic building block diagram of different steps involved in the development of improved Kannada spoken dialogue system is given in Fig. 1.

An Asterisk server is used in this work which acts as an interface between the user/farmer and the IVRS call flow is shown in Fig. 2.

To increase the speech database, another 500 farmers speech data is collected in addition to the earlier speech database. The training and testing speech data set consists of 70757 and 2180 isolated word utterances respectively. The training and testing data set includes the names of districts, mandis and different types of commodities as per AGMARKNET list under Karnataka section. The performance estimation of entire ASR system is done by overall speech data.

3 Combination of SS-VAD and MMSE-SPZC estimator for speech enhancement

A noise reduction technique is proposed for speech enhancement which is an amalgamation of SS-VAD and MMSE-SPZC estimator. Consider an original speech signal s(n) which is degraded by background noise d(n). Therefore, the degraded speech c(n) can be represented as follows:

$$\begin{aligned} c(t)=s(t)+d(t) \end{aligned}$$

(1)

The resultant corrupted speech signal in frequency domain can be written as follows:

$$\begin{aligned} c(n)=c(nT_s) \end{aligned}$$

(2)

where ${T_s}$ is the sampling interval which can be given as

$$\begin{aligned} f_s=\frac{1}{T_s} \end{aligned}$$

(3)

the STFT of c(n) can be written as

$$\begin{aligned} C(w_k)= S(w_k)+D(w_k) \end{aligned}$$

(4)

the polar form representation of the above equation can be written as follows:

$$\begin{aligned} C_ke^{j\theta _c(k)}=S_ke^{j\theta _s(k)}+D_ke^{j\theta _d(k)} \end{aligned}$$

(5)

where $\{ C_k, S_k, D_k \}$ represents the magnitudes and $\{\theta _c(k), \theta _s(k), \theta _d(k)\}$ represents the phase. The power spectrum of degraded speech signal can be written as follows:

$$\begin{aligned} P_c(w)= P_s(w)+P_d(w) \end{aligned}$$

(6)

The equation (6) can be written as follows:

$$\begin{aligned} C_k^2 \approx S_k^2+D_k^2 \end{aligned}$$

(7)

The pdf of ${S_k^2}$ and ${D_k^2}$ are given as follows:

$$\begin{aligned} f_{S_k^2}= & {} \frac{1}{\sigma _s^2(k)} \; e^-{\frac{S_k^2}{\sigma _s^2(k)}} \end{aligned}$$

(8)

$$\begin{aligned} f_{D_k^2}= & {} \frac{1}{\sigma _d^2(k)} \; e^-{\frac{D_k^2}{\sigma _d^2(k)}} \end{aligned}$$

(9)

where $\sigma _s^2(k)$ and $\sigma _d^2(k)$ can be written as follows:

$$\begin{aligned} \sigma _s^2(k) \equiv E\{S^2_k\}, \; \; \sigma _d^2(k) \equiv E\{D^2_k\} \end{aligned}$$

(10)

the posteriori probabilities of speech signal magnitude squared spectrum is evaluated by using Bayes theorem as shown below.

$$\begin{aligned} f_{S_k^2}\big (S_k^2\mid C_k^2\big )= & {} \frac{f_{C_k^2}(C_k^2\mid S_k^2) f_{S_k^2}(S_k^2) }{f_{C_k^2}(C_k^2)} \end{aligned}$$

(11)

$$\begin{aligned} f_{S_k^2}\big (S_k^2\mid C_k^2\big )= & {} \left\{ \begin{array}{cl} \Psi _k e^-\frac{S_k^2}{\lambda (k)} &{} if\; \sigma _s^2(k) \ne \sigma _d^2(k) \\ \frac{1}{C_k^2} &{} if\; \sigma _s^2(k) = \sigma _d^2(k) \end{array}\right. \end{aligned}$$

(12)

where $S_K^2 \in \big [0, C_k^2\big ]$ and $\lambda (k)$ can be written in the below equation.

$$\begin{aligned} \frac{1}{\lambda (k)} \equiv \frac{1}{\sigma _s^2(k)}- \frac{1}{\sigma _d^2(k)} \; if \; \sigma _s^2(k) \ne \sigma _d^2(k) \end{aligned}$$

(13)

and

$$\begin{aligned} \Psi _k \equiv \frac{1}{\lambda (k)\Bigg \{ 1-exp\bigg [\frac{C_k^2}{\lambda (k)}\bigg ]\Bigg \}} \end{aligned}$$

(14)

Note: If $\sigma _s^2(k) > \sigma _d^2(k)$, then $\frac{1}{\lambda (k)}$ is less than 0 and it is reversible. Hence, $\Psi _k$ in equation (12) is positive.

3.1 Spectral subtraction method with VAD

The block diagram of spectral subtraction algorithm is given in Fig. 3. The spectral subtraction algorithm is mainly incorporated with VAD which is used to find the voiced regions in the speech signal. Consider s(n), d(n) and c(n) are the original, additive noise and corrupted speech signal respectively. The steps shown in the Fig. 3 are followed neatly to enhance the corrupted speech data. The linear prediction error can also be called as L which is mainly incorporated with Energy of the speech signal E and zero crossing rate (Z). Therefore, the term Y can be represented as follows:

$$\begin{aligned} Y= & {} E(1-Z)(1-E) \;for\; single \;frame \end{aligned}$$

(15)

$$\begin{aligned} Y_{max}= & {} Y\; for\; all\; frames \end{aligned}$$

(16)

The output of spectral subtraction with VAD algorithm can be represented as follows:

$$\begin{aligned} |X_i(w)|=|C_i(w)|-|\mu _i(w)| \end{aligned}$$

(17)

The negative values of output speech spectrum are set to zero using half wave rectification if they have negative values. In order to attenuate the signal further during non speech activity, the residual noise reduction process is used. It improves the enhanced speech signal quality.

3.2 MSS estimators

Te following three types of magnitude squared spectrum estimators (MSSE) are studied, implemented and their performances are compared.

3.2.1 MMSE-SP estimator

In Wolfe and Godsill (2001), authors have proposed a technique called MMSE-SP estimator. The clean speech signal can be written as follows:

$$\begin{aligned} \hat{S^2_K}= & {} E\big \{{S^2_k|C(w_k)}\big \} \end{aligned}$$

(18)

$$\begin{aligned} \hat{S^2_K}= & {} \int _0^\infty \!\! S^2_k f_{S_k}(S_k|C(w_k) \; dS_k \end{aligned}$$

(19)

$$\begin{aligned} \hat{S^2_K}= & {} \frac{\xi _k}{1+ \xi _k}\bigg (\frac{1}{\gamma _k}+ \frac{\xi _k}{1+ \xi _k}\bigg ) C^2_k \end{aligned}$$

(20)

where the terms $\xi _k$ and $\gamma _k$ represents a priori and a posteriori SNRs respectively.

$$\begin{aligned} \xi _k\equiv \frac{\sigma ^2_s(k)}{\sigma ^2_d(k)}, \; \gamma _k \equiv \frac{C^2_k}{\sigma ^2_d(k)} \end{aligned}$$

(21)

The function $f_{(S_k)}(S_k|Y(w_k))$ can be written as follows:

$$\begin{aligned} f_{S_k}(S_k|C(w_k)) = \frac{S_k}{\sigma ^2_k} exp \Bigg ( \frac{S^2_k+u^2_k}{2\sigma ^2_k}\Bigg ) I_0 \Bigg ( \frac{S_ku_k}{\sigma ^2_k}\Bigg ) \end{aligned}$$

(22)

where

$$\begin{aligned}&\frac{1}{\lambda '(k)}\equiv \frac{1}{\sigma ^2_s(k)} + \frac{1}{\sigma ^2_d(k)} \end{aligned}$$

(23)

$$\begin{aligned}&v_k\equiv \frac{\xi _k}{1+\xi _k} \gamma _k \end{aligned}$$

(24)

$$\begin{aligned}&\sigma ^2_k \equiv \frac{\lambda '(k)}{2}\; and \; u^2_k \equiv v_k \lambda '(k) \end{aligned}$$

(25)

3.2.2 MMSE-SPZC estimator

An another important magnitude squared spectrum estimator is MMSE-SPZC. By using the work which was presented in Jounghoon and Hanseok (2003) and Cole et al. (2008), the MMSE-SPZC estimator is derived Lu and Loizou (2011).

$$\begin{aligned} \hat{S^2_k}= & {} E\big \{S^2_K|C^2_k\big \} \end{aligned}$$

(26)

$$\begin{aligned} \hat{S^2_K}= & {} \int _0^{C^2_k} \!\! S^2_k f_{S^2_k}\big (S^2_k|C^2_k\big ) dS^2_k \end{aligned}$$

(27)

$$\begin{aligned} \hat{S^2_K}= & {} \left\{ \begin{array}{cl} \Big ( \frac{1}{v_k} -\frac{1}{e^v_k-1}\Big )C^2_k, &{}\quad if\; \sigma ^2_s(k) \ne \sigma ^2_d(k)\\ \frac{1}{2} {C^2_k}, &{}\quad if \; \sigma ^2_s(k) = \sigma ^2_d(k) \end{array}\right. \end{aligned}$$

(28)

where $v_k$ can be written as

$$\begin{aligned} v_k\equiv \frac{1-\xi _k}{\xi _k} \gamma _k \end{aligned}$$

(29)

The gain equation of the above estimator can be represented mathematically as follows:

$$\begin{aligned} G_{MMSE} (\xi _k, \gamma _k)= \left\{ \begin{array}{cl} \Big ( \frac{1}{v_k}- \frac{1}{e^v_k-1}\Big )^ \frac{1}{2} &{}\quad if \; \sigma ^2_s(k) \ne \sigma ^2_d(k)\\ \big (\frac{1}{2}\big )^ \frac{1}{2} &{}\quad if \; \sigma ^2_s(k) = \sigma ^2_d(k) \end{array}\right. \end{aligned}$$

(30)

3.2.3 MAP estimator

The MAP estimator can be written as shown below:

$$\begin{aligned} \hat{S^2_k}= arg max f_{S^2_K} \big (S^2_k|C^2_k\big ) \end{aligned}$$

(31)

maximization with respect to $S^2_K$.

$$\begin{aligned} \hat{S^2_k}= & {} \left\{ \begin{array}{cl} C^2_k &{}\quad if\; \frac{1}{\lambda (k)}<0\\ 0 &{}\quad if \; \frac{1}{\lambda (k)}>0 \end{array}\right. \end{aligned}$$

(32)

$$\begin{aligned} \hat{S^2_k}= & {} \left\{ \begin{array}{cl} C^2_k &{}\quad if\; \sigma ^2_s(k) \ge \sigma ^2_d(k)\\ 0 &{}\quad if \; \sigma ^2_s(k) < \sigma ^2_d(k) \end{array}\right. \end{aligned}$$

(33)

The MAP estimator gain function can be represented as follows:

$$\begin{aligned} G_{MAP}(k)= \left\{ \begin{array}{cl} 1 &{}\quad if \; \sigma ^2_s(k) \ge \sigma ^2_d(k)\\ 0 &{}\quad if \; \sigma ^2_s(k) < \sigma ^2_d(k) \end{array}\right. \end{aligned}$$

(34)

by using equation (22), the above MAPs gain function can also be represented as:

$$\begin{aligned} G_{MAP}(\xi _k)= \left\{ \begin{array}{cl} 1 &{}\quad if \; \xi _k \ge 1\\ 0 &{}\quad if \; \xi _k < 1 \end{array}\right. \end{aligned}$$

(35)

3.3 Measures of performance and analysis

The standard measures are used to evaluate the performances of proposed and existing speech enhancement methods. They are composite measures and perceptual evaluation of speech quality (PESQ).

The scales of ratings of composite measures is shown in Table 1. The three important composite measures Hu and Loizou (2007) are,

Speech signal distortion (s).
Background noise distortion (b).
Overall speech signal quality (o).

3.4 Performance evaluation of existing methods

The TIMIT and Kannada speech databases are used for the experiments conduction. The speech sentences are corrupted by musical, car, babble and street noises. The evaluation of performances of existing and proposed methods are presented in this section.

3.4.1 SS-VAD method performance evaluation

The Tables 2 and 3 shows the performance evaluation of SS-VAD technique in terms of PESQ for both databases. There is a less suppression of musical noise for both databases using SS-VAD technique. The performance evaluation of the same method using composite measures for both databases is shown in Tables 4 and 5. It can be inferred from the tables that there is a less improvement in the speech databases which are degraded by musical noise.

3.4.2 Performance evaluation of MSS estimators

The Tables 6 and 7 shows the performance measurement of MSS estimators using PESQ for both databases. The speech quality is very less for the speech databases which are degraded by babble noise compared to other types of noises as shown in Tables 8 and 9. Among three MSS estimators, the MMSE-SPZC estimator has given better performance for the degraded speech data shown in Tables 6, 7, 8 and 9. Therefore, from the experimental analysis, it can be inferred that, an SS-VAD method gives poor results for the speech database which is degraded by musical noise and MMSE-SPZC technique gives poor speech quality for the database which is degraded by babble noise. The speech databases which are degraded by both musical and babble noises can be easily eliminated by combining an SS-VAD and MMSE-SPZC estimator.

3.5 Combined SS-VAD and MMSE-SPZC estimator for speech enhancement

The flowchart of the proposed method is shown in Fig. 4. The output of SS-VAD is given as input to an MMSE-SPZC estimator because of less improvement in musical noise.

The output of SS-VAD can be represented as follows:

$$\begin{aligned} |X_i(w)|=|C_i(w)|-|\mu _i(w)| \end{aligned}$$

(36)

The MMSE-SPZC estimator receives the output of SS-VAD and it can be derived by considering its pdf is shown in below equation.

$$\begin{aligned} \hat{X}^2_{k}= & {} E\big \{X^2_K|Y^2_k\big \} \end{aligned}$$

(37)

$$\begin{aligned} \hat{X}^2_{K}= & {} \int _0^{X^2_k} \!\! X^2_k f_{X^2_k}(X^2_k|Y^2_k) dX^2_k \end{aligned}$$

(38)

$$\begin{aligned} \hat{X}^2_{K}= & {} \left\{ \begin{array}{cl} \Big ( \frac{1}{v_k} -\frac{1}{e^v_k-1}\Big )Y^2_k &{} if \; \sigma ^2_x(k) \ne \sigma ^2_d(k)\\ \frac{1}{2} Y^2_k &{} if \; \sigma ^2_x(k) = \sigma ^2_d(k) \end{array}\right. \end{aligned}$$

(39)

The gain function of proposed estimator can be represented as follows:

$$\begin{aligned} G_{MMSE} (\xi _k, \gamma _k)= \left\{ \begin{array}{cl} \Big ( \frac{1}{v_k}- \frac{1}{e^v_k-1}\Big )^ \frac{1}{2} &{}\quad if\;\; \sigma ^2_x(k) \ne \sigma ^2_d(k)\\ \big (\frac{1}{2}\big )^ \frac{1}{2} &{}\quad if\;\; \sigma ^2_x(k) = \sigma ^2_d(k) \end{array}\right. \end{aligned}$$

(40)

The Tables 10 and 11 shows the experimental results of proposed method using PESQ for both databases. The speech quality and its intelligibility is improved after the speech enhancement using proposed method in terms of composite measure is shown in Tables 12 and 13. The experimental results show that the SS-VAD algorithm yielded better results for the speech data corrupted by background noise, vocal noise, car noise, street noise and babble noise but not for musical noise. The MMSE-SPZC estimator has given better results for the speech data corrupted by musical noise, background noise, and street noise but not for babble noise. This is due to the collected speech data was much degraded by musical and babble noises. Therefore, in order to improve the quality of speech data which was degraded by babble, musical and other types of noises, we have combined both algorithms. From the tables, it can be inferred that, the proposed method significantly reduced the babble, musical and other different types of noises in both speech databases compared to individual methods. Hence the proposed noise elimination algorithm could be used for speech enhancement in Kannada spoken query system to improve the speech recognition accuracy under uncontrolled environment.

Table 1 The scales of ratings of composite measures

Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling

Abstract

Similar content being viewed by others

Enhancements in Continuous Kannada ASR System by Background Noise Elimination

Amalgamation of noise elimination and TDNN acoustic modelling techniques for the advancements in continuous Kannada ASR system

Continuous Kannada Speech Recognition System Under Degraded Condition

Explore related subjects

1 Introduction

2 Speech database collection from farmers

3 Combination of SS-VAD and MMSE-SPZC estimator for speech enhancement

3.1 Spectral subtraction method with VAD

3.2 MSS estimators

3.2.1 MMSE-SP estimator

3.2.2 MMSE-SPZC estimator

3.2.3 MAP estimator

3.3 Measures of performance and analysis

3.4 Performance evaluation of existing methods

3.4.1 SS-VAD method performance evaluation

3.4.2 Performance evaluation of MSS estimators

3.5 Combined SS-VAD and MMSE-SPZC estimator for speech enhancement

4 Creation of ASR models using Kaldi for noisy and enhanced speech data

4.1 Transcription and validation

4.2 Kannada phoneme set and corresponding dictionary creation

4.3 MFCC features extraction

5 SGMM and DNN

5.1 SGMM

5.2 DNN

6 Experimental results and analysis

6.1 Call flow structure of spoken dialogue system

6.2 Testing of developed spoken query system from farmers in the field

7 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation