Speech Recognition Using Enhanced Features with Deep Belief Network for Real Time Application

Kaur, Gurpreet; Srivastava, Mohit; Kumar, Amod

doi:10.1007/s11277-021-08610-0

Speech Recognition Using Enhanced Features with Deep Belief Network for Real Time Application

Published: 16 June 2021

Volume 120, pages 3225–3242, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Wireless Personal Communications Aims and scope Submit manuscript

Speech Recognition Using Enhanced Features with Deep Belief Network for Real Time Application

Download PDF

219 Accesses
1 Citation
Explore all metrics

Abstract

Speech is a natural way used by humans to communicate information. Speech signal conveys the linguistic information (the message) and lot of information about the speaker himself: Gender, age, regional origin, health and emotional state. Speech recognition is the technology of letting a machine understand human speech. Speaker recognition is the technology by which a machine distinguishes different speakers from each other. In real life, speaker and speech recognition have been used very frequently which vary from healthcare, military to applications pertaining to daily use. These may include but are not limited to commanding electronic devices through speech. All existing speech recognition system perform efficiently in control environment. But for real time applications, the performance gets affected because of the variability in speaking style and background noise. In order to deal with all these effects, enhanced Mel Frequency Cepstral Coefficients (MFCC) are calculated. Deep belief network (DBN) of stacked restricted Boltzmann machine (RBM) is used for training and testing. The proposed system is implemented using standard TIDIGITS dataset giving 97.29% accuracy.

Acoustic Modeling with Deep Belief Networks for Russian Speech Recognition

Survey of Deep Learning Paradigms for Speech Processing

Article 04 March 2022

Persian speech recognition using deep learning

Article 06 November 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Speech signal conveys many levels of information. At primary level, speech signal gives the message; at secondary level, speech signal conveys the information about speaker. Speech production is very complex phenomenon. It comprises of many levels of processing. First of all, message planning is done in our mind and then language coding is done. Based upon the coding, neuromuscular command is generated. After this, sound is produced through vocal cords. Every human speech is different from each other because of different parameters like linguistics (lexical, syntactic, semantic, pragmatics), para linguistics (intentional, attitudinal, stylistic) and non-linguistics (physical, emotional). Therefore, speech signal contains different segmental and supra segmental features which can be extracted for the speaker as well as speech recognition [1]. Speech signals are considered as highly non-stationary signals as they are not only difficult to observe but also more prone to noise. A number of feature extraction techniques have been developed whose primary focus is to preserve the variations along with relevant speech signal information that is compact as well as reasonable. On the basis of biological classification, features can be derived into two parts such as production as well as perception based. On the basis of domain of processing, they can be divided into temporal, eigen, cepstral and frequency domains. The temporal features included amplitude, power and zero crossing rate which are used to take voicing decisions. Other features such as brightness, tonality, loudness as well as pitch are considered under perceptual features. In speech recognition systems, detecting the presence of speech in a noisy environment is considered as a crucial problem in case of real time applications. Speech recognition is a wide field in terms of different feature extraction techniques, recording environment, databases and classification methods. There are many feature extraction techniques like linear predictive coding coefficient (LPCC), perceptual linear prediction (PLP), relative spectra filtering (RASTA) and Mel frequency cepstral coefficients (MFCC). There are various types of databases available for speech and speaker recognition systems from isolated to continuous speech like TIDIGITS, RSR 2015, TIMIT etc. There are basically three approaches for matching/modeling; acoustic phonetic approach, pattern recognition approach (dynamic time warping, Hidden Markov model, vector quantization) and artificial Intelligence approach (neural network, deep neural network). In this paper, combined speaker and speech recognition system is proposed using enhanced MFCC features with DBN for real time applications.

2 Related Work

Research in speaker and speech recognition field has made tremendous progress in the last 60 years. Speech recognition systems can be divided into four generations. In 1st generation (1950s–1960s), the work was based upon the acoustic phonetics approaches. Template matching techniques like LPC, DTW etc. was used in 2nd generation (1960s–1970s). In 3rd generation (1970s–2000s), statistical modeling techniques like HMM were mostly used by the researchers. However, in the current age also known as 4th generation (2000s onwards), the focus is on deep learning [2,3,4,5]. Nascimento, T.P., et al., 2011 [6] have proposed a speech recognition system using ANN and HMM for English words. The recognition rate achieved for HMM is 96% and for ANN is 97%. In this paper, adaptive learning rate and large dataset have increased the recognition rate. Guojiang, F., 2011 [7] has implemented the system using two types of classifiers; multiple layer perceptron (MLP) and radial basis function (RBF). LPC Coefficients are used as features. RBF classifier performance is superior than MLP for 16 LPCC speaker dependent coefficients. Seyedin, S., et al. 2013 [8] have proposed a new type of feature based upon MVDR spectrum of filtered autocorrelation sequence. They have used TIDIGITS database for the implementation and achieved 76.6% of accuracy. Initially there were many problems faced by researchers in training hidden layers due to propagating training errors to hidden layers and getting stuck in local minima etc. [9, 10]. But in 2006, G. Hinton et al. [11] introduced Deep Belief Network (DBN), with layer wise training. Many researchers have used DBN with supervised or unsupervised training algorithms [12,13,14]. Literature reported that systems based on DBN are better than HMM, GMM, GMM-HMM techniques [15,16,17,18,19,20,21,22]. Jaitly, N. et al. 2011[23] used DBN using Restricted Boltzmann Machine (RBM) and contrastive divergence (CD) training algorithm on speech signals. They have tested on TIMIT corpus and achieved better performance in terms of position independent word error rate (PER) [24, 25]. Then this work was carried forward by Mohamed, A. et al. [26] and they have used MFCC features with DBN to improve performance of 20.7% PER as compared to using raw speech features. Dhanashri, D., and Dhonde, S.B., 2016 [27] have used HMM for acoustic modeling and DBN is used as classifier. They have used TIDIGITS as database and achieved 96.58% accuracy.

The speech recognition systems have been deployed in various domains such as for domestic use, educational purpose, for the purpose of entertainment, in medical science for healthcare and medical transcriptions etc. It is observed that less work has been done for rehabilitation application of speech recognition system [28]. Thus, this study aims to develop an application-oriented research work for handicapped persons for combined speaker and speech recognition for any activities like voice operated wheelchair.

3 Proposed Work

All existing speech recognition system perform efficiently for stored database. But for real time applications, the performance gets affected because of the variability in speaking style and background noise. In order to deal with all these effects, enhanced MFCC features are calculated. In this, two types of variations are calculated that can be there in generalized features of MFCC, named as tolerance 1 and tolerance 2. The features which are used in the proposed model are fusion of two-level mathematical analysis of the feature extraction method. Features are calculated in three phases: calculation of tolerance1, calculation of tolerance 2 and PCA fusion.

3.1 Calculation of Tolerance1

Despite the recording of samples in same environment and with same speakers, variations in samples were observed, due to intra speaker variability. TIDIGITS dataset is used which is available for eleven isolated words (zero, one, two, three, four, five…ten) for 326 speakers. First of all, speech signal is converted into frequency domain to know about the frequencies of the speech signal using FFT. The output of FFT contains a lot of data that is not required because at higher frequencies there is not any difference between the frequencies. This is based upon the phenomenon of human hearing. The scale is linear until 1000 and logarithmic after it. So, to calculate the energy level at each frequency, MEL scale analysis is done using MEL filters. Then energy is calculated. After that, logarithmic of filter bank energies is taken. This operation is done to match the features closer to human hearing. At last, DCT of the log filter bank energies is taken to decorrelate the overlapped energies. First 13 coefficients are selected known as MFCC features. This is because higher features degrade the recognition accuracy of the system. Those features don't carry speaker and speech related information.

Due to variations in all words, the feature values are also different for every word, in spite of being spoken by same speaker. The difference between all samples of same word is calculated and is represented as shown in Eq. (1).

$$D_{{ij~}}^{K} = \sqrt {\left( {X_{i} - X_{j} } \right)^{2} ~,} \;1 \le i,j \le n,$$

(1)

where n is the no. of samples of each word, $X_{i}$ and $X_{j}$ are voice samples, K = Number of isolated words.

Hence distance matrix for all words is calculated as represented in Eq. (2).

$$Dist^{k} ~ = ~\begin{array}{*{20}c} {D_{{11}} } \\ {\begin{array}{*{20}c} . \\ . \\ D \\ \end{array} _{{ij}} } \\ \end{array}$$

(2)

where i and j varies with no. of samples.

Then maximum and minimum values are calculated as ${\text{max~}}\left( {D_{{ij}} } \right)and~min~\left( {D_{{ij}} } \right)$ After this, variations are calculated by subtracting minimum value from its maximum value for each word as represented in Eq. (3).

$$Var^{k} = \left[ {max~\left( {D_{{ij}} } \right) - ~min~\left( {D_{{ij}} } \right)} \right]$$

(3)

After this, mean of all samples of every word is calculated as given in Eq. (4).

$$M^{k} = \frac{1}{n}~\mathop \sum \limits_{{i = 1}}^{n} X_{i} ~,$$

(4)

where X_i is the voice sample of each word. This value is subtracted from the mean value of each sample and this is called tolerance 1 as shown in Eq. (5).

$$Tol~1^{k} = ~M^{k} - Var^{k}$$

(5)

3.2 Calculation of Tolerance 2

In second step, mean variations are calculated. This is called tolerance 2. For calculating the tolerance 2, instead of taking differences between the individual samples, here difference between mean value and samples of same word is calculated as represented in Eq. (6).

$$MD_{{ij~}}^{K} = \sqrt {\left( {M_{{ij}} - X_{{ij}} } \right)}$$

(6)

where $M_{{ij}}$ = mean of word, $X_{{ij}}$ = voice sample.

Rest of the procedure is same as tolerance 1 is calculated as explained above from Eqs. (2) to (5).

3.3 PCA Fusion

Now there are features calculated from tolerance 1 and tolerance 2 method and hence total 26 features are there for each word. So, to decide which features are selected out of tolerance 1 and tolerance 2 features, principal component analysis (PCA) is used [29]. This algorithm is based upon how much amount of the tolerance 1 features and tolerance 2 features will be taken to get final tolerance features. It is a mathematical based procedure that involves the transformation of features into principal components that computes a reduced and important feature set. Figure 1is showing the flow chart for the PCA fusion process.

3.4 Deep Neural Network

In deep neural network there are many hidden layers in it. The base is taken from visual cortex. The brain processes the information through several sections of brain. The neurons in each section behave differently. So neural network can be modeled as multilayer network consisting of lower level to higher level of features [30,31,32]. Training is the major issue in deep neural networks because optimization is difficult. There may be under-fitting and over-fitting in the system. Under-fitting is due to vanishing gradient problem and over-fitting is because of high variance and low bias situation. There is one solution for this, that is unsupervised pre training approach. Unsupervised pre training is done one layer at a time. Features are fed to first hidden layer, then second layer takes the combinations of features from first layer. This process goes on till the last layer. After that, supervised training is done for entire network.

4 Results and Discussion

In this research work, standard (TIDIGITS) database is used. TIDIGITS is available for eleven isolated words (zero, one, two, three, four, five…ten) for 326 speakers. Total 2260 isolated words are taken, spoken by 57 women and 56 men speakers. First of all, MFCC features are calculated. Following Table 1 is showing the MFCC feature coefficients (Cf.1 to Cf.13) of speaker 1 for first five words i.e., ‘ONE’, ‘TWO’, ‘THREE’, ‘FOUR’, ‘FIVE’. The results are shown for five words of one speaker.

Table 1 MFCC features of speaker 1 for first set of words

Full size table

When same speaker uttered same words then also there is variability in speech signal. The following Table 2 shows the variations in MFCC features when same set of words are spoken by same speaker 1.

Table 2 MFCC features of speaker 1 for second set of words

Full size table

To deal with these variations enhanced MFCC features are calculated. Firstly, tolerance 1 is calculated which is the variations of the single word with other words spoken by the same speaker. Therefore, difference between all samples of same word is calculated. Table 3 shows the difference of MFCC features for word set 1 and word set 2.

Table 3 Difference of MFCC features of speaker 1

Full size table

Above results shows the distance matrix ($D_{{i~}}^{K}$) represented as in Eq. 1. There are five words, so five distances are found out. Then maximum and minimum values are calculated from each distance matrix represented as max (D_i) and min (D_i). In Table 3 maximum values are represented in italic and minimum values are represented in bold.

Then variation is calculated by subtracting minimum value from its maximum value and represented as Var^k as per Eq. 3. The values for variations for all words is shown in Table 4.

Table 4 Variations in words

Full size table

After this, mean (M^k) of all sample words is calculated as shown in Table 5.

Table 5 Mean of MFCC features for speaker 1

Full size table

Then tolerance 1 is calculated by subtracting variations from its mean value as shown in Table 6.

Table 6 Tolerance 1 for speaker 1

Full size table

Hence based upon the number of speakers, tolerance values can be calculated. In second scenario for calculating the tolerance 2, mean variations are calculated instead of words variations. In this method, firstly the difference between mean value and sample value is calculated as per Eq. 6. These difference values depend upon the number of samples for each word. Following Table 7 shows the tolerance 2 features.

Table 7 Tolerance 2 for speaker 1

Full size table

Now there are features for tolerance 1 and tolerance 2 and total 26 values are there for each word. So PCA algorithm is used to fuse these values to get 13 important features. In PCA algorithm firstly, covariance matrix is found out as shown below for the word ONE.

$$C = \left[ {\begin{array}{*{20}c} {~272.2445} & {272.4905~} \\ {272.4905} & {272.8740} \\ \end{array} } \right]$$

Eigen values and Eigen vectors are calculated for the word ONE as shown below.

$$Eigen~Value = \left[ {\begin{array}{*{20}c} {~0.0686} & {0~} \\ 0 & {545.0499} \\ \end{array} } \right]$$

$$Eigen~Vector = \left[ {\begin{array}{*{20}c} {~~~ - 0.7075} & {0.7067~} \\ {~0.7067~} & {~0.7075} \\ \end{array} } \right]$$

Then principal components P₁ = 0.4993 and P₂ = 0.5003 are calculated. Similarly, all these steps are repeated to get the principal components for all words. Final fused tolerance features are shown in Table 8 for all words.

Table 8 Fused tolerance features for speaker 1

Full size table

Now there are 13 final enhanced features for every word. The mean of MFCC features and enhanced MFCC features is taken and after normalization these are fed to deep belief network for training. Two hidden layers with 200 and 300 neurons are used in DBN with Contrastive Divergence learning rule for the 20 epochs. Final DBN input is shown in Table 9 for speaker 1.

Table 9 Input to DBN for speaker 1

Full size table

Similarly, enhanced features are calculated for all speakers and data is fed to DBN for training.

A comparison is done between MFCC features and enhanced MFCC features. The results showed that when system is trained with enhanced MFCC features, the accuracy is much better at different SNRs as compared to the baseline MFCC features. The reason is that enhanced MFCC features are calculated by taking care of the variation in speech signal. It may be due to the intra speaker variability or due to the environmental effects. For clean speech signal the accuracy is about 94% for MFCC features and 97% is for enhanced MFCC features. At 15 dB, when MFCC features are used the accuracy is 56% and when enhanced MFCC features are used, the accuracy is 96%. It showed that accuracy decreases about half when there is variability in speech signals using MFCC features, but for enhanced MFCC trained DBN, the accuracy is almost same as for clean signals. This showed that system trained with enhanced MFCC features achieved good accuracy even in noisy conditions as shown in Table 10.

Table 10 Comparison of % accuracy between MFCC and enhanced MFCC features on TIDIGITS at different SNR

Full size table

A comparison is done between proposed system and baseline system on standard dataset (TIDIGITS). Seyedin Sanaz et.al [8] have used features which are computed from minimum variance distortion less response (MVDR) spectrum by modifying the PLP technique known as PMSR features. In this weighting of sub band is modified to get MVDR spectrum and then LP coefficients are transformed to get robust PMSR (R-PMSR) features. A comparison is done between the R-PMSR, MFCC and proposed system using enhanced MFCC features at different SNRs as shown in Fig. 2.

The results showed that proposed system has achieved better accuracy at different SNRs when compared with baseline systems. As in baseline systems, the accuracy decreases sharply with high noise environment. But enhanced MFCC features worked well in noisy environment. Therefore, for real time applications where both intra speaker as well as environment effects are of greatest interest, enhanced MFCC features are the best.

5 Conclusion

In proposed work, enhanced MFCC features are calculated in terms of tolerance 1 and tolerance 2. The recognition accuracy improves around 3% when enhanced MFCC features are used as compared to baseline MFCC features. Experimentation is done on TIDIGITS. Deep belief network is used for the classification purpose. Comparison is done between proposed technique and existing techniques. Baseline system using MFCC features has given 94.59% accuracy, R-PMSR feature based system has given 95.12% accuracy and enhanced MFCC based system has given 97.29% accuracy.

Availability of data and material

TIDIGITS dataset is an open source.

Code availability

Readers may ask.

References

Rabiner, L., & Juang, B. H. (1993). Fundamentals of speech recognition. Hoboken: Prentice-hall publishers.
Google Scholar
Siniscalchi, S. M., Svendsen, T., & Lee, C. H. (2014). An artificial neural network approach to automatic speech processing. Neurocomputing, 140, 326–338.
Article Google Scholar
Dede, G., and Sazlı, M. H. (2015). "Speech recognition with artificial neural networks," Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 20(3), 763–768.
Richardson, F., Member, S., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.
Article Google Scholar
Yao, Z., Wang, Z., Liu, W., Liu, Y., & Pan, J. (2020). Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Communication, 120, 11–19.
Article Google Scholar
Nascimento, T. P., & Stefanoy, D. (2011). Speech Recognition using Artificial Neural Networks. Simposio Brasileiro de Automaçao Inteligente, 10, 1316–1321.
Google Scholar
Guojiang, F. (2011) "A novel isolated speech recognition method based on neural network". 2^nd International Conference on Networking and Information Technology, Singapore, 17, 64–69.
Seyedin, S., Mohammad, A., and Gazor, S. (2011) "New features using robust MVDR spectrum of filtered autocorrelation sequence for robust speech recognition". The Scientific World Journal, p.1–11.
Salakhutdinov, R., & Hinton, G. (2012). An efficient learning procedure for deep Boltzmann machines. Neural Computation, 24(8), 1967–2006.
Article MathSciNet Google Scholar
Huang, X., & Deng, L. (2010). Handbook of natural language processing. An Overview of Modern Speech Recognition., second edition (pp. 339–367). CRC publishers.
Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). Fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
Article MathSciNet Google Scholar
Keyvanrad, M. A., & Homayounpour, M. M. (2014) “A brief survey on deep belief networks and introducing a new object-oriented MATLAB toolbox (DeeBNet)”, p. 1–25.
Le Cun, Y., Yoshua, B., & Geoffrey, H. (2015). Deep learning. Nature, 521(7553), 436–444.
Article Google Scholar
Cai, M., & Liu, J. (2016). Maxout neurons for deep convolutional and LSTM neural networks in speech recognition. Speech Communication, 77, 53–64.
Article Google Scholar
Dighe, P., Asaei, A., & Bourlard, H. (2016). Sparse modeling of neural network posterior probabilities for exemplar-based speech recognition. Speech Communication, 76, 230–244.
Article Google Scholar
Sarikaya, R., Hinton, G., & Deoras, A. (2014). Application of deep belief networks for natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4), 778–784. https://doi.org/10.1109/TASLP.2014.2303296.
Article Google Scholar
Mirsamadi, S., and Hansen, J. (2015) "A study on deep neural network acoustic model adaptation for robust far-field speech recognition," Proceedings of Interspeech, pp. 2430–2434.
Cutajar, M., Micallef, J., Casha, O., Grech, I., & Gatt, E. (2013). Comparative study of automatic speech recognition techniques. IET Signal Processing, 7(1), 25–46.
Article Google Scholar
Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep Recurrent Neural Network. IEEE International Conference, 3, 6645–6649.
Google Scholar
Bourouba, E. H., Bedda, M., & Djemili, R. (2006). Isolated words recognition system based on hybrid approach DTW/GHMM. Informatica, 30(3), 373–384.
MATH Google Scholar
Deng, L., Hinton, G., and Kingsbury, B. (2013) “New types of deep neural network learning for speech recognition and related applications: an overview”. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603.
Chandra, B., & Sharma, R. K. (2016). Fast learning in deep neural networks. Neurocomputing, 171, 1205–1215.
Article Google Scholar
Jaitly, N., and Hinton, G. E. (2011) "Learning a better representation of speech sound waves using Restricted Boltzmann Machines". IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5887.
Tanaka, M., and Okutomi, M., (2014), “A Novel Inference of a Restricted Boltzmann Machine”. Proceedings - International Conference on Pattern Recognition, pp. 1526–1531.
Farahat, M., & Halavati, R. (2016). Noise robust speech recognition using deep belief networks. International Journal of Computational Intelligence and Applications, 15(1), 1–17.
Article Google Scholar
Mohamed, A., Dahl, G., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transaction on Audio, Speech and Language Processing, 20(1), 14–22.
Article Google Scholar
Dhanashri, D., and Dhonde, S. B. (2017). "Isolated word speech recognition system using deep neural networks". International Conference on Data Engineering and Communication Technology, Singapore, pp. 9–17.
Kaur, G., Srivastava, M., & Kumar, A. (2018). Integrated speaker and speech recognition for wheel chair movement using artificial intelligence. Informatica, 42, 587–594.
Article Google Scholar
Trang, H., Loc, T.H., and Nam, H.B. (2014) "Proposed combination of PCA and MFCC feature extraction in speech recognition system". IEEE International Conference on Advanced Technologies for Communications, Hanoi, Vietnam, pp. 697–702.
Nikoskinen, T. (2015) "From neural network to deep neural network". Alto University School of Science, pp 1–27.
Gavat, I., and Militaru, D. (2015) "Deep learning in acoustic modeling for automatic speech recognition and understanding - an overview”. IEEE International Conference on Speech Technology and Human-Computer Dialogue, Bucharest, Romania, pp. 1–8.
Sharmadha, S., Shivani, K., Shruthi, K., Bharathi, B., and Kavitha, S. (2020) “Automatic speech recognition using Deep Neural Network”. International Conference on Soft Computing and Signal Processing, Hyderabad, India, pp. 353–361.

Download references

Author information

Authors and Affiliations

University Institute of Engineering & Technology, Panjab University, Chandigarh, 160025, India
Gurpreet Kaur
Chandigarh Engineering College, Landran, Mohali, 140307, India
Mohit Srivastava
National Institute of Technical Teachers Training and Research, Chandigarh, 160019, India
Amod Kumar

Authors

Gurpreet Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Mohit Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
Amod Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Gurpreet Kaur, Mohit Srivastava or Amod Kumar.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Ethics approval

Punjab University, Chandigarh has given the permission to do research work.

Consent for publication

The authors have given the consent for publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kaur, G., Srivastava, M. & Kumar, A. Speech Recognition Using Enhanced Features with Deep Belief Network for Real Time Application. Wireless Pers Commun 120, 3225–3242 (2021). https://doi.org/10.1007/s11277-021-08610-0

Download citation

Accepted: 07 June 2021
Published: 16 June 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s11277-021-08610-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Speech Recognition Using Enhanced Features with Deep Belief Network for Real Time Application

Abstract

Similar content being viewed by others

Acoustic Modeling with Deep Belief Networks for Russian Speech Recognition

Survey of Deep Learning Paradigms for Speech Processing

Persian speech recognition using deep learning

1 Introduction

2 Related Work