Abstract
Automatic speech recognition is generally analyzed for two types of word utterances; isolated and continuous-words speech. Continuous-words speech is almost natural way of speaking but is difficult to be recognized through machines (speech recognizers). It is also highly sensitive to environmental variations. There are various parameters which are directly affecting the performance of automatic speech recognition like size of datasets/corpus, type of data sets (isolated, spontaneous or continuous) and environment variations (noisy/clean). The performance of speech recognizers is generally good in clean environments for isolated words, but it becomes typical in noisy environments especially for continuous words/sentences and is still a challenge. In this paper, a hybrid feature extraction technique is proposed by joining core blocks of PLP (perceptual linear predictive) and Mel frequency cepstral coefficients (MFCC) that can be utilized to improve the performance of speech recognizers under such circumstances. Voice activity and detection (VAD)-based frame dropping formula has been used solely within the training part of ASR (automatic speech recognition) procedure obviating its need in actual implementations. The motivation to use this formula is for removal of pauses and distorted elements of speech improving the phonemes modeling further. The proposed method shows average improvement in performance by 12.88% for standard datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Automatic speech recognition (ASR) is a practice to convert a sequence of words spoken by human being into text by means of machines. Automatic speech recognition can be divided in many forms depending on the type of utterances like Isolated words, connected words, continuous speech and spontaneous speech. Continuous-words speech recognition is almost near to natural way of speaking. So, several studies have been conducted to improve the accuracy of recognition of this speech format, but it is very difficult task to design such recognizers due to absence of efficient techniques for the detection of start and end points of such speech [1].
The selection of language is priority while designing a speech recognizer that is suitable for a country/region. For the wide acceptance of speech recognition system, it is required to be designed in local language. In country like India, Hindi language has wide acceptance to connect people across the country.
Most of speech recognition systems have been designed in foreign languages like English and Japanese etc. In India, majority of the people live in villages and adopt farming to earn their livelihoods. Indian government is running many schemes to benefit villagers for overall development of the country, but most of the people are not aware about them due to lack of education and availability of computer knowledge in English. Hence, automatic speech recognition in Hindi has the potential to solve their problems [2].
The objective of ASR is to behave as a medium between man and machine and it is expected to remain robust in varying environments [3]. This task becomes too complex when continuous speech recognizers are designed for noisy environments. It is a serious problem to define speech boundaries in noisy environments due to the occurrence of non-speech events [4]. Speech recognition systems which are trained in noisy environments are often affected by ambient acoustic noise thereby reducing their performance. This dilapidation is generally due to the gap between clean acoustic models and noisy speech data. Significant research efforts have undergone to lessen this mismatch and recover recognition accuracy in noisy conditions [5].
In automatic speech recognition (ASR), noise robustness is ensured by several methods. First method is to train the system directly on a noise that interferes during the recognition phase. This kind of system is known as matched system. This system is probably far better compared to several noise compensation methods. Changing the system to adapt for new type of noises is a complex and time-consuming task since its re-training needs a lot of time. A more real-world substitute to the matched training is multi-condition training, in which the system is trained right on noisy speech occurring in the most common noise environments.
Therefore, the necessity for re-training the system each time the background noise changes can be avoided [3]. In [6], the authors explained the effects of noise in communication systems according to the sources of noise, the numbers and the types of talkers and listener’s hearing ability and provided research guidance for effective recognition in noisy environments. The authors in [7], outlined a set of challenges where optimization formulations and algorithms play an important role. Authors have also described various approaches in speech recognition and their optimization. The authors of [8], proposed the optimization technique based on Stochastic Gradient Descent algorithm to upgrade the performance of speech recognizers in noisy environments.
In [9], authors developed an algorithm based on binary masking to separate speech from noise. This binary masking was different from the ideal binary mask which needs priori information about premixed signals. The authors in [10], addressed the problem of distant speech recognition in noisy environment and proposed non-negative matrix factorization (NMF) enhancement method to improve the robustness of Automatic Speech Recognition (ASR) systems. The Modified K-NN based algorithm was suggested for classification of the large database to improve the accuracy of detection by reducing the impact of noise in [11]. The author of [12], designed a real time educational software for signal and speech processing applications developed using MATLAB. In [13], the authors compared different feature extraction methods in noisy environments for isolated words. Kalman filter was used to remove the background noise and enhance the speech signal. The authors in [14], developed a noise robust distributed speech recognizer for real world applications using cepstral mean normalization (CMN) for robust feature extraction. The authors presented a modified framework using support Vector Machine algorithm to detect different keyloggers installed or available on PC for security purpose of information and datasets in [15]. In [16], authors analysed the influence of window length and frame shift on speech recognition. It was concluded that a window length of 10 ms with the frame shift between (7.5–10) ms can increase the recognition rate up to 2.5%. The authors of [17], provided the comparison of different feature extraction techniques using neural network as classifier. The self -adapted diversity-based parameters were applied to particle swarm optimization algorithm to obtain improved form of clusters for better detection in [18]. In [19], authors showed that signal acquired through throat microphone can improve the speech recognition in noisy environment as compared to conventional microphone. The authors in [20], presented a comparative analysis of different feature extraction techniques for isolated words in noisy environments. In [21], authors reported Social Spider Algorithm for global optimization between class variance to get improved thresholding. The authors of [22], proposed a novel method of speech segregation for unlabelled stationary noisy audio signals using the deep belief network (DBN) model. The proposed method successfully segregates a music signal from noisy audio streams. The local featured-based supervised learning was presented using Support Vector Machine classification to recognize Thai character in [23]. In [24], authors provided a thorough study of work done using deep learning between 2006 and 2018 in field of automatic speech recognition.
It is observed from above that not much work has been done in ASR of Hindi speech in noisy environment. Most of the reported researches were based on Hidden Markov Model (HMM), Gaussian Mixture Model and their Hybridizations. Therefore, there is an ample scope of using Deep Neural Network (DNN) with hybrid features to further improve the accuracy of automatic speech recognition.
There are various parameters which effect the performance of automatic speech recognition. Noise is one such parameter. It is a tedious task to improve the recognition rate in noisy environments especially for continuous speech, since here each word is highly dependent on each other to produce its meaning. So, their recognition degrades further in the presence of noise. Speech presence probability (SPP) [25] based noise assessment method is preferred for noise power spectrum estimation. It is a good estimator to find presence of speech in stationary and non- stationary environments. First 20 frames are usually considered to get the power spectrum for better approximation of the speech. The objective of this study is to increase the stability of speech recognition systems in real-time reverberant environments. MFCC and PLP are widely used to estimate the concept of human auditory system in automatic speech recognition. Both shows almost comparable results for small parameters. But, PLP performs better for large number of parameters [26]. In this paper, coefficients of these two sets are combined to get better results.
This paper is organized in four different sections. First section introduces different types of speech recognition systems and discusses the present state-of-the-art. Second section presents the proposed methodology. Third section covers results and discussion. Fourth section concludes the findings followed by future scope.
2 Proposed methodology
Proposed methodology constitutes acquisition of data sets along with noisy ones, feature extraction and proposed algorithm as presented in the following subsections.
2.1 Speech datasets
Hindi speech signals of different speakers are recorded by using Audacity 2.3.2. It is an open source software for audio editing and recording applications. For this purpose, 3 males and 3 females of different age group are selected. The acquired datasets are described in Table 1.
A total of 600 voice samples were recorded. These voice samples were divided into two sets; first set consisting of 75% of the total speech samples that is used for training and remaining 25% of the samples are considered for testing purpose. A WO Mic client interface was used to connect with Audacity tool for sound recording. Audacity is a free sound recording tool with various options for clipping, storing and mixing of sounds. Speech signals were recorded using the following parameters; Sampling frequency = 16 kHz, Coding Technique PCM, Mode of recording Mono and bit rate = 16 bits/s.
Waveforms of a speech signal in Audacity (with many specifications of parameters) and MATLAB are shown in Figs. 1 and 2, respectively. The speech data collected is mixed with various noises that usually exist in the environment (e.g., car, fan and diesel engine noise) to study their effects on ASR. Therefore, different noise files are obtained from online sources as explained in next subsection.
2.2 Noisy database
Different type of noise samples from car, diesel engine and fan are obtained from www.freesound.org.
This website has different types of sounds that can be used for research purpose. Noises from different sources is mixed with speech samples from clean environment to obtain noisy speech having SNR value lying between (0–15) db at equal intervals of 5 db. These noisy speech signals are used to train and test the performance of speech recognizers. Waveform of various noises are shown in Figs. 3, 4 and 5. Noise reduction option under effect menu bar of Audacity is used to change the Signal to Noise ratio.
A portion of noise can be added to signal to obtain noisy speech data using two separate windows for the two. Voice Activity and Detection (VAD) is applied as a filter to separate speech from non-speech signals and noise. After filtering speech data, its features are extracted to train the recognizer using Deep neural network (DNN) that is used as a classifier in MATLAB. MFCC and PLP features are utilized individually and collectively to analyze the speech recognition rate with and without VAD (Voice Activity and Detection). The steps to extract different features are presented in next subsection.
2.3 Feature extraction
Mel frequency cepstral coefficient (MFCC) and perceptual linear perceptron features are used collectively in this paper for feature extraction. Both types of feature extraction techniques behave well to map human auditory system. The mechanism to extract these features is presented in the following subsections.
2.3.1 Mel frequency cepstral coefficients (MFCC)
Mel frequency cepstral coefficient (MFCC) is widely accepted frequency domain feature extraction technique to map human auditory system [27]. Human speech is not linear in nature. It is linear below 1 Khz and non-linear above it and can be well estimated using Mel-scale as shown by Fig. 6.
Mel-scaled frequency domain features provide better modeling as compared to time domain features [28, 29]. MFCC features are extracted using the following steps;
-
1.
Pre-emphasis: The recorded speech signal needs pre-emphasis to boost the energy of signal at high frequencies. The high frequency components are more affected by noise as compared to lower ones. Hence, a proper high pass filter is needed to maintain signal to noise ratio at high frequencies. The transfer function of this filter in z domain H(z) is given as
$$ H\left( z \right) = 1 - 0.97z^{ - 1} . $$(1)Here 0.97 is Pre-emphasis factor.
-
2.
Framing and windowing: The speech signal is non-stationary in nature. It is stationary for short interval of time so framing is required to resolve it into small overlapping pieces known as frames. Then windowing is performed to eliminate discontinuities at edges. The Hamming Window [30] performs this action as
$$ W\left( n \right) = \left\{ {\begin{array}{*{20}l} {0.54 - 0.46\cos \left( {\frac{2\pi n}{N} - 1} \right) \quad\quad 0 \le n \le N} \\ { 0 \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\,\, {\text{otherwise,}}} \\ \end{array} } \right. $$(2)where W(n) is Hamming Window, n is considered samples out of total N samples.
-
3.
Fast Fourier Transform (FFT): The Fast Fourier Transform is applied for transformation into frequency domain as \(X(K)\) defined by
$$ X(K) = \sum\limits_{n = 0}^{N - 1} {x(n)W^{nk} } .....0 \le \, n \, \le \, N - 1, $$(3)where N is the size of FFT and x(n) is input signal.
Mel-scale conversion: Mel-scale has better adaptability to human auditory system so frequency is converted to Mel-scale filter bank as
where f is frequency on linear scale and M(f) is Mel-scale frequency.
Discrete Cosine Transform (DCT): DCT is performed on Log Mel- spectrum of previous output to decorrelate the filter outputs. Its filter coefficients are grouped with log energy coefficients for final preparation of vector coefficients. The DCT transform is given as
where M is number of Triangular filter and the m is number that lies between 1 to M (1 < m < M).
A set of 13 features (coefficients) are generated from the above steps out of which 12 features are computed using DCT transform and one energy feature is appended to it. Next 13 feature are obtained from delta method which are produced from first order derivative. In this way, a feature set of 26 coefficients is framed. The delta coefficients help to map dynamic nature of speech [31]. The 20 coefficients are finally selected for analysis purpose to reduce the complexity and define non -uniform nature of the speech.
2.3.2 Perceptual linear perceptron (PLP)
The perceptual linear prediction (PLP) is another feature extraction technique which follows the concept of psychophysics of hearing. It is similar to LPC with a difference that it uses its spectral characteristics to map the human auditory system using three steps: 1. critical band analysis 2. equal loudness curve 3. intensity loudness (power–law relation). Speech signal cannot be estimated on linear scale as done by LPC, therefore, PLP is preferred [32,33,34].
PLP involves the following computational steps as shown in Fig. 7
-
Initially steps similar to (1), (2) and (3) used for MFCC are also followed for PLP.
-
After that Band pass filtering is done to approximate the power spectrum of each frequency band as
$$ P\left( \omega \right) = {\text{Re}} \left( {s\left( \omega \right)} \right)^{2} + Im\left( {s\left( \omega \right)} \right)^{2} $$(6) -
Then audio frequency is converted to Bark Scale for better mapping of human auditory process as
$$ f\left( {Bark} \right) = 6\ln \left[ {\frac{f}{600} + \left[ {\left( \frac{f}{600} \right)^{2} + 1} \right]^{0.5} } \right] $$(7)The Bark filter bank is used for better sense of hearing under equal loudness operation. Further, these matched values are boosted according to the power Law.
-
Finally, LP Model is applied to predict the feature coefficients by mapping the power spectrums \(P(\omega )\) and \(P^{\prime}(\omega )\) as
$$ \frac{1}{M}\sum\limits_{m = 1}^{M} {\frac{P(\omega )}{{P^{\prime}(\omega )}}} = 1 $$(8)
where \(P(\omega )\) and \(P^{\prime}(\omega )\) are input and predicted power spectrum of the speech signal.
The initial steps such as windowing and Fourier transform are same for both MFCC and PLP. with the difference of use of Mel-scale in MFCC and bark scale in PLP. The equal loudness function is applied before linear prediction to amplify weak signal components. In PLP, trapezoidal filters are incorporated in place of triangular filters used in MFCC. The recursive cepstrum computation is applied to compute first 13 coefficients of PLP features. This process is similar to that used for MFCC to extract a separate 13 feature vector as explained above under MFCC. The 20 features are selected from this combined feature vector to reduce the computational complexity.
2.4 Proposed algorithm
This section presents the proposed algorithm. The performance of speech recognition system mainly depends upon the efficiency of VAD (voice activity and detection) to sort speech using different steps.
-
Step 1
Apply silence indicator to find signal idleness and the noise scales is updated for the duration of these stages.
-
Step 2
Apply Short-time Fourier Transform (STFT) for transforming time domain signal to frequency domain. STFT is trailed by a magnitude operator. Speech signal is processed for short interval of time (10-50 ms) so DFT is performed after windowing, which is also known as STFT as
$$ X_{n} \left( {{\text{e}}^{j\omega } } \right) = \mathop \sum \limits_{m = - \infty }^{\infty } x\left( m \right)w\left( {n - m} \right)e^{ - jwm} , $$(9)where w(n − m) is window which select the portion of input x(n) for further computation.
-
Step 3.
Apply High pass filter (HPF) to reduce the noise variance. This is essential to decrease the mis-representations due to noise deviations. The function of High Pass Filter is to suppress the noise and increase the energy at high frequency given in (1).
-
Step 4.
Apply post-processor for eliminating the misrepresentations by spectral subtraction.
-
Step 5.
Apply an Inverse Short-time Fourier Transform (ISTFT) for transforming the treated signal back to the time domain as
$$ x(n) = \frac{1}{2\pi w(0)}\int\limits_{ - \pi }^{\pi } {X_{n} (e^{jw} )} e^{j\omega n} d\omega , $$(10)where x(n) is time domain signal and w (0) is real window sequence.
-
Step 6.
A grouping rule is applied to categorize a signal’s section as speech or non-speech. The grouping rule compares the output obtained from the VAD using threshold defined in terms of speech parameters. If value exceeds the threshold, it indicates a speech signal otherwise it belongs to a non-speech category. A value close to threshold is uncertain that reduces the performance of speech recognizer.
-
Step 7.
Compute a set of features to distinguish speech and non-speech.
-
Step 8.
Combine the evidence from the features in a classifier for classification.
The multilayer training is provided on the data set by dividing it in to three subsets. The first subset is training set that is used to compute gradient and biases of network. The second set is validation set. It is used to monitor validation error. The third subset is test set error. The validation error and test set error are compared to choose the appropriate value. The default value of training, validation and testing are 0.7, 0.15 and 0.15. But in this paper, values used are 1.0, 0.15 and 0.15, respectively. The value of test set error varies according to the different iteration number this may happen due to the poor division of dataset. The function of this regression layer is used to compute mean squared error to get targeted output [35].
3 Results and discussion
The accuracy of speech recognizer is computed in three stages. In first part, MFCC features are used with and without VAD to classify the speech signal for various noises at different signal to noise ratio. Secondly, PLP features are utilized and finally, the two features are combined to form a hybrid feature set. Voice Activation and Detection (VAD) is universally used in all cases.
3.1 Performance analysis using MFCC
The accuracy of speech recognition system using MFCC feature with VAD and without VAD is shown below in Tables 2 and 3. It is observed from the table that as signal to noise ratio increases, the performance of recognition also improves. The recognition rate is maximum in the presence of car noise and minimum for the case of diesel engine noise. The average recognition rate between 0 to 15 db also follow the same pattern.
It is observed from Fig. 8 that there is a considerable improvement in accuracy of speech signal using VAD. Maximum recognition is noted from fan noise at 10 and 15 dB. It is also noticed that (0–5) dB and (5–10) dB show almost similar improvement. But after 10 dB improvement in performance is comparatively low. It is found from Fig. 8 and Table 4 that Diesel Engine noise shows maximum improvement in accuracy by 13.88%. and car has least one. The accuracy of speech recognition system in presence of fan noise lies in accuracy of these two noise sources.
3.2 Performance analysis using PLP (perceptual linear perceptron)
It is observed from Tables 2, 3, 4, 5 and Fig. 9, that MFCC follow the same pattern as PLP. But the performance of PLP in Noisy environment is somewhat higher than the MFCC both with and without VAD. The average recognition in case of PLP is 51.7% as compared to 51.28% without using VAD in MFCC. On the other hand, using VAD this performance increases to 63.48% as compared to 63.31% for MFCC. In case of PLP, there is an increase by 11.78% as compared to 12.03% for MFCC.
3.3 Performance analysis using hybrid features i.e., MFCC_PLP
It is observed from Tables 2, 3, 4, 5, 6, 7 and Fig. 10, that MFCC_PLP hybrid feature extraction has better performance as compared to MFCC and PLP used individually. The average performance in this case is 53.54% with VAD as compared to 51.7% for PLP and 51.28% for MFCC. The average percentage increases to 64.58% as compared to 63.48% for PLP and 63.31% for MFCC. The Proposed methodology improves the recognition by 11.78% in PLP, 12.03% in MFCC and 12.88% on an average as compared to system without VAD.
4 Conclusion
A hybrid MFCC and PLP features based ASR has been implemented successfully for Hindi speech in noisy environment. Both of this techniques show comparable results in noise free environments when applied individually. But in noisy environment, PLP provides slightly better results as compared to MFCC.
VAD can differentiate well between speech and non-speech data in noisy environments. The proposed hybrid technique based on VAD increases the efficiency of ASR system in noisy environments. The proposed methodology shows average improvements by 12.88% with VAD as compared to the case without it. This work can be further extended by integrating VAD and Deep Neural Networks with some evolutionary algorithms like particle swarm optimization (PSO), differential evolution (DE) etc. to further improve the system performance by optimizing number of filterbanks.
References
Kurzekar PK, Desmukh RR, Waghmare VB, Shrishrimal P (2014) Continuous speech recognition system: a review. Asian J Comput Sci Inform Technol (AJCSIT) 4:(6): 62–66
Agarwal RK, Dave M (2008) Implementing a speech recognition interface for Indian Languages. In: Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages. pp. 105–112
Keronen S, Remes U, Palomaki KJ, Virtanen T, Kurimo M (2010) Comparison of noise robust methods in large vocabulary speech recognition. In: 18th European Signal Processing Conference (EUSIPCO-2010), 1973–1977
Li Q, Zheng J, Tsai A, Zhou Q (2002) Robust endpoint detection and energy normalization for real-time speech and speaker recognition. IEEE Trans Speech Audio Process 10(3):146–157
Cui X, Alwan A (2005) Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR. IEEE Trans Speech Audio Process 13(6):1161–1172. https://doi.org/10.1109/TSA.2005.853002
Le Prell CG, Clavier OH (2017) Effects of noise on speech recognition: Challenges for communication by service members, www.elsevier.com/locate/heares. Hearing Res 349:76–89
Wright SJ, Kanevsky D, Deng L, He X, Heigold G, Li H (2013) Optimization algorithms and applications for speech and language processing. IEEE Trans Audio Speech Lang Process 21(11):2231–2243
Nasef A, Marjanovic-Jakovlijevic M, Njegus A (2017) Optimization of the speaker recognition in noisy environments using a stochastic gradient descent. Intern Sci Conf Inform Technol Data Relat Res Sinteza 2017:369–373
Healy EW, Yoho SE, Wang Y, Wang D (2013) An algorithm to improve speech recognition in noise for hearing-impaired listeners. J Acoust Soc Am 134(4):3029–3038. https://doi.org/10.1121/1.4820893
Geiger JT, Weninger F, Gemmeke JF, Wollmer M, Schuller B, Rigoll G (2014) Memory-enhanced neural networks and NMF for robust ASR. IEEE/ACM Trans Audio Speech Lang process 22(6):1037–1046. https://doi.org/10.1109/TASLP.2014.2318514
Sahu SK, Kumar P, Singh AP (2018) Modified K-NN algorithm for classification problems with improved accuracy. Intern J Inform Technol 10:65–70. https://doi.org/10.1007/s41870-017-0058-z
Bouafif L, Ouni K (2012) A speech tool software for signal processing applications. In: 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT). pp. 788–791
Sumithra MG, Ramya MS, Thanuskodi K (2011) Speech recognition in noisy environment using different feature extraction techniques. Intern J Computat Intell Telecommun Syst 2(1):57–62
Rahman MM, Saha SK, Hossain MK, Islam MB (2012) Performance evaluation of CMN for Mel-LPC based speech recognition in different noisy environments. Intern J Comput Appl 58(10):6–10. https://doi.org/10.5120/9316-3548
Pillai D, Siddavatam I (2019) A modified framework to detect keyloggers using machine learning algorithm. Int J Inf Technol 11:707–712. https://doi.org/10.1007/s41870-018-0237-6
Eringis D, Tamulevicius G (2014) Improving speech recognition rate through analysis parameters. Electr Contr Commun Eng 5(1). https://doi.org/10.2478/ecce-2014-009
Dave N (2013) Feature extraction methods LPC PLP and MFCC in speech recognition. Intern J Adv Res Eng Technol 1(6):1–5
Patil S, Anandhi RJ (2020) Diversity based self-adaptive clusters using PSO clustering for crime data. Int J Inf Technol 12:319–327. https://doi.org/10.1007/s41870-019-00311-z
Dekens T, Verhelst W, Capman F, Beaugendre F (2010) Improved speech recognition in noisy environments by using a throat microphone for accurate voicing detection. In: 18th European Signal Processing Conference (EUSIPCO-2010), 1978–1982
Sharma K, Sinha HP, Agarwal RK (2010) Comparative study of speech recognition system using various feature extraction techniques. Intern J Inform Technol Knowl Manage 3(2):695–698
Rahkar Farshi T, Orujpour M (2019) Multi-level image thresholding based on social spider algorithm for global optimization. Intern J Inform Technol 11:713–718. https://doi.org/10.1007/s41870-019-00328-4
Qazi KA, Nawaz T, Mehmood Z, Rashid M, Hafiz AH (2018) A hybrid technique for speech segregation and classification using a sophisticated deep neural network. PLoS ONE 13:e0194151. https://doi.org/10.1371/journal.pone.0194151
Joseph FJJ (2020) Effect of supervised learning methodologies in offline handwritten Thai character recognition. Int J Inf Technol 12:57–64. https://doi.org/10.1007/s41870-019-00366-y
Nassif AB, Shanin I, Attili I, Azzeh M, Shaalan K (2019) Speech recognition using deep neural networks: a systematic review. IEEE Access 7:19143–19165
Gerkmann T, Hendriks RC (2011) Noise power estimation based on the probability of speech presence. In: 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 145–148
Psutka J, Muller L, Psutka JV (2001) Comparison of MFCC and PLP Parameterizations in the speaker independent continuous speech recognition task, Eurospeech 2001, Scandinavia
Xie L, Liu ZQ (2006) A comparative study of audio features for audio to visual cobversion in MPEG-4 compliant facial animation. In: Proc. of ICMLC, Dalian, 13–16 Aug-2006
Leong ATK (2003) A music identification system based on audio content similarity. In: Thesis of Bachelor of Engineering, Division of Electrical Engineering, The School of Information Technology and Electrical Engineering, The University of Queensland, Queensland
Murugappan M, Selvaraj J (2012) DWT and MFCC based human emotional speech classification using LDA. In: International Conference on Biomedical Engineering (ICoBE), Penang, pp. 203–206
Prithvi P, Kumar TK (2016) Comparative analysis of MFCC, LFCC, RASTA-PLP. In: International Journal of Scientific Engineering and Research (IJSER) 4(5): 4–7
Dua M, Agarwal RK, Biswas M (2018) Performance evaluation of hindi speech recognition using optimized filter banks. Eng Sci Technol Intern J 21(2018):389–398. https://doi.org/10.1016/j.jestch.2018.04.005
Hermansky H (1990) Perceptual linear predictive (PLP) analysis for speech. J Acoust Soc Am 87(4):1738–1752. https://doi.org/10.1121/1.399423
Hermansky H., Hanson B. and Wakita H (1985) Perceptually based linear predictive analysis of speech, acoustics, speech, and signal processing. In: IEEE International Conference on ICASSP 85, 10:509–512
Hermansky H, Morgan N, Bayya A, Kohn P (1991) The challenge of inverse-E: the RASTA-PLP method. IEEE 2:800–804. https://doi.org/10.1109/ACSSC.1991.186557
Kim Phil, MATLAB Deep Learning. https://doi.org/10.1007/978-1-4842-2845-6
Acknowledgements
A special note of thanks is due to ECE Department, NIT Kurukshetra, Haryana, India for providing the required infrastructure and research environment.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kumar, A., Mittal, V. Hindi speech recognition in noisy environment using hybrid technique. Int. j. inf. tecnol. 13, 483–492 (2021). https://doi.org/10.1007/s41870-020-00586-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-020-00586-7