1 Introduction

Speech signal conveys many levels of information. At primary level, speech signal gives the message; at secondary level, speech signal conveys the information about speaker. Speech production is very complex phenomenon. It comprises of many levels of processing. First of all, message planning is done in our mind and then language coding is done. Based upon the coding, neuromuscular command is generated. After this, sound is produced through vocal cords. Every human speech is different from each other because of different parameters like linguistics (lexical, syntactic, semantic, pragmatics), para linguistics (intentional, attitudinal, stylistic) and non-linguistics (physical, emotional). Therefore, speech signal contains different segmental and supra segmental features which can be extracted for the speaker as well as speech recognition [1]. Speech signals are considered as highly non-stationary signals as they are not only difficult to observe but also more prone to noise. A number of feature extraction techniques have been developed whose primary focus is to preserve the variations along with relevant speech signal information that is compact as well as reasonable. On the basis of biological classification, features can be derived into two parts such as production as well as perception based. On the basis of domain of processing, they can be divided into temporal, eigen, cepstral and frequency domains. The temporal features included amplitude, power and zero crossing rate which are used to take voicing decisions. Other features such as brightness, tonality, loudness as well as pitch are considered under perceptual features. In speech recognition systems, detecting the presence of speech in a noisy environment is considered as a crucial problem in case of real time applications. Speech recognition is a wide field in terms of different feature extraction techniques, recording environment, databases and classification methods. There are many feature extraction techniques like linear predictive coding coefficient (LPCC), perceptual linear prediction (PLP), relative spectra filtering (RASTA) and Mel frequency cepstral coefficients (MFCC). There are various types of databases available for speech and speaker recognition systems from isolated to continuous speech like TIDIGITS, RSR 2015, TIMIT etc. There are basically three approaches for matching/modeling; acoustic phonetic approach, pattern recognition approach (dynamic time warping, Hidden Markov model, vector quantization) and artificial Intelligence approach (neural network, deep neural network). In this paper, combined speaker and speech recognition system is proposed using enhanced MFCC features with DBN for real time applications.

2 Related Work

Research in speaker and speech recognition field has made tremendous progress in the last 60 years. Speech recognition systems can be divided into four generations. In 1st generation (1950s–1960s), the work was based upon the acoustic phonetics approaches. Template matching techniques like LPC, DTW etc. was used in 2nd generation (1960s–1970s). In 3rd generation (1970s–2000s), statistical modeling techniques like HMM were mostly used by the researchers. However, in the current age also known as 4th generation (2000s onwards), the focus is on deep learning [2,3,4,5]. Nascimento, T.P., et al., 2011 [6] have proposed a speech recognition system using ANN and HMM for English words. The recognition rate achieved for HMM is 96% and for ANN is 97%. In this paper, adaptive learning rate and large dataset have increased the recognition rate. Guojiang, F., 2011 [7] has implemented the system using two types of classifiers; multiple layer perceptron (MLP) and radial basis function (RBF). LPC Coefficients are used as features. RBF classifier performance is superior than MLP for 16 LPCC speaker dependent coefficients. Seyedin, S., et al. 2013 [8] have proposed a new type of feature based upon MVDR spectrum of filtered autocorrelation sequence. They have used TIDIGITS database for the implementation and achieved 76.6% of accuracy. Initially there were many problems faced by researchers in training hidden layers due to propagating training errors to hidden layers and getting stuck in local minima etc. [9, 10]. But in 2006, G. Hinton et al. [11] introduced Deep Belief Network (DBN), with layer wise training. Many researchers have used DBN with supervised or unsupervised training algorithms [12,13,14]. Literature reported that systems based on DBN are better than HMM, GMM, GMM-HMM techniques [15,16,17,18,19,20,21,22]. Jaitly, N. et al. 2011[23] used DBN using Restricted Boltzmann Machine (RBM) and contrastive divergence (CD) training algorithm on speech signals. They have tested on TIMIT corpus and achieved better performance in terms of position independent word error rate (PER) [24, 25]. Then this work was carried forward by Mohamed, A. et al. [26] and they have used MFCC features with DBN to improve performance of 20.7% PER as compared to using raw speech features. Dhanashri, D., and Dhonde, S.B., 2016 [27] have used HMM for acoustic modeling and DBN is used as classifier. They have used TIDIGITS as database and achieved 96.58% accuracy.

The speech recognition systems have been deployed in various domains such as for domestic use, educational purpose, for the purpose of entertainment, in medical science for healthcare and medical transcriptions etc. It is observed that less work has been done for rehabilitation application of speech recognition system [28]. Thus, this study aims to develop an application-oriented research work for handicapped persons for combined speaker and speech recognition for any activities like voice operated wheelchair.

3 Proposed Work

All existing speech recognition system perform efficiently for stored database. But for real time applications, the performance gets affected because of the variability in speaking style and background noise. In order to deal with all these effects, enhanced MFCC features are calculated. In this, two types of variations are calculated that can be there in generalized features of MFCC, named as tolerance 1 and tolerance 2. The features which are used in the proposed model are fusion of two-level mathematical analysis of the feature extraction method. Features are calculated in three phases: calculation of tolerance1, calculation of tolerance 2 and PCA fusion.

3.1 Calculation of Tolerance1

Despite the recording of samples in same environment and with same speakers, variations in samples were observed, due to intra speaker variability. TIDIGITS dataset is used which is available for eleven isolated words (zero, one, two, three, four, five…ten) for 326 speakers. First of all, speech signal is converted into frequency domain to know about the frequencies of the speech signal using FFT. The output of FFT contains a lot of data that is not required because at higher frequencies there is not any difference between the frequencies. This is based upon the phenomenon of human hearing. The scale is linear until 1000 and logarithmic after it. So, to calculate the energy level at each frequency, MEL scale analysis is done using MEL filters. Then energy is calculated. After that, logarithmic of filter bank energies is taken. This operation is done to match the features closer to human hearing. At last, DCT of the log filter bank energies is taken to decorrelate the overlapped energies. First 13 coefficients are selected known as MFCC features. This is because higher features degrade the recognition accuracy of the system. Those features don't carry speaker and speech related information.

Due to variations in all words, the feature values are also different for every word, in spite of being spoken by same speaker. The difference between all samples of same word is calculated and is represented as shown in Eq. (1).

$$D_{{ij~}}^{K} = \sqrt {\left( {X_{i} - X_{j} } \right)^{2} ~,} \;1 \le i,j \le n,$$
(1)

where n is the no. of samples of each word, \(X_{i}\) and \(X_{j}\) are voice samples, K = Number of isolated words.

Hence distance matrix for all words is calculated as represented in Eq. (2).

$$Dist^{k} ~ = ~\begin{array}{*{20}c} {D_{{11}} } \\ {\begin{array}{*{20}c} . \\ . \\ D \\ \end{array} _{{ij}} } \\ \end{array}$$
(2)

where i and j varies with no. of samples.

Then maximum and minimum values are calculated as \({\text{max~}}\left( {D_{{ij}} } \right)and~min~\left( {D_{{ij}} } \right)\) After this, variations are calculated by subtracting minimum value from its maximum value for each word as represented in Eq. (3).

$$Var^{k} = \left[ {max~\left( {D_{{ij}} } \right) - ~min~\left( {D_{{ij}} } \right)} \right]$$
(3)

After this, mean of all samples of every word is calculated as given in Eq. (4).

$$M^{k} = \frac{1}{n}~\mathop \sum \limits_{{i = 1}}^{n} X_{i} ~,$$
(4)

where Xi is the voice sample of each word. This value is subtracted from the mean value of each sample and this is called tolerance 1 as shown in Eq. (5).

$$Tol~1^{k} = ~M^{k} - Var^{k}$$
(5)

3.2 Calculation of Tolerance 2

In second step, mean variations are calculated. This is called tolerance 2. For calculating the tolerance 2, instead of taking differences between the individual samples, here difference between mean value and samples of same word is calculated as represented in Eq. (6).

$$MD_{{ij~}}^{K} = \sqrt {\left( {M_{{ij}} - X_{{ij}} } \right)}$$
(6)

where \(M_{{ij}}\) = mean of word, \(X_{{ij}}\) = voice sample.

Rest of the procedure is same as tolerance 1 is calculated as explained above from Eqs. (2) to (5).

3.3 PCA Fusion

Now there are features calculated from tolerance 1 and tolerance 2 method and hence total 26 features are there for each word. So, to decide which features are selected out of tolerance 1 and tolerance 2 features, principal component analysis (PCA) is used [29]. This algorithm is based upon how much amount of the tolerance 1 features and tolerance 2 features will be taken to get final tolerance features. It is a mathematical based procedure that involves the transformation of features into principal components that computes a reduced and important feature set. Figure 1is showing the flow chart for the PCA fusion process.

Fig. 1
figure 1

PCA fusion

3.4 Deep Neural Network

In deep neural network there are many hidden layers in it. The base is taken from visual cortex. The brain processes the information through several sections of brain. The neurons in each section behave differently. So neural network can be modeled as multilayer network consisting of lower level to higher level of features [30,31,32]. Training is the major issue in deep neural networks because optimization is difficult. There may be under-fitting and over-fitting in the system. Under-fitting is due to vanishing gradient problem and over-fitting is because of high variance and low bias situation. There is one solution for this, that is unsupervised pre training approach. Unsupervised pre training is done one layer at a time. Features are fed to first hidden layer, then second layer takes the combinations of features from first layer. This process goes on till the last layer. After that, supervised training is done for entire network.

4 Results and Discussion

In this research work, standard (TIDIGITS) database is used. TIDIGITS is available for eleven isolated words (zero, one, two, three, four, five…ten) for 326 speakers. Total 2260 isolated words are taken, spoken by 57 women and 56 men speakers. First of all, MFCC features are calculated. Following Table 1 is showing the MFCC feature coefficients (Cf.1 to Cf.13) of speaker 1 for first five words i.e., ‘ONE’, ‘TWO’, ‘THREE’, ‘FOUR’, ‘FIVE’. The results are shown for five words of one speaker.

Table 1 MFCC features of speaker 1 for first set of words

When same speaker uttered same words then also there is variability in speech signal. The following Table 2 shows the variations in MFCC features when same set of words are spoken by same speaker 1.

Table 2 MFCC features of speaker 1 for second set of words

To deal with these variations enhanced MFCC features are calculated. Firstly, tolerance 1 is calculated which is the variations of the single word with other words spoken by the same speaker. Therefore, difference between all samples of same word is calculated. Table 3 shows the difference of MFCC features for word set 1 and word set 2.

Table 3 Difference of MFCC features of speaker 1

Above results shows the distance matrix (\(D_{{i~}}^{K}\)) represented as in Eq. 1. There are five words, so five distances are found out. Then maximum and minimum values are calculated from each distance matrix represented as max (Di) and min (Di). In Table 3 maximum values are represented in italic and minimum values are represented in bold.

Then variation is calculated by subtracting minimum value from its maximum value and represented as Vark as per Eq. 3. The values for variations for all words is shown in Table 4.

Table 4 Variations in words

After this, mean (Mk) of all sample words is calculated as shown in Table 5.

Table 5 Mean of MFCC features for speaker 1

Then tolerance 1 is calculated by subtracting variations from its mean value as shown in Table 6.

Table 6 Tolerance 1 for speaker 1

Hence based upon the number of speakers, tolerance values can be calculated. In second scenario for calculating the tolerance 2, mean variations are calculated instead of words variations. In this method, firstly the difference between mean value and sample value is calculated as per Eq. 6. These difference values depend upon the number of samples for each word. Following Table 7 shows the tolerance 2 features.

Table 7 Tolerance 2 for speaker 1

Now there are features for tolerance 1 and tolerance 2 and total 26 values are there for each word. So PCA algorithm is used to fuse these values to get 13 important features. In PCA algorithm firstly, covariance matrix is found out as shown below for the word ONE.

$$C = \left[ {\begin{array}{*{20}c} {~272.2445} & {272.4905~} \\ {272.4905} & {272.8740} \\ \end{array} } \right]$$

Eigen values and Eigen vectors are calculated for the word ONE as shown below.

$$Eigen~Value = \left[ {\begin{array}{*{20}c} {~0.0686} & {0~} \\ 0 & {545.0499} \\ \end{array} } \right]$$
$$Eigen~Vector = \left[ {\begin{array}{*{20}c} {~~~ - 0.7075} & {0.7067~} \\ {~0.7067~} & {~0.7075} \\ \end{array} } \right]$$

Then principal components P1 = 0.4993 and P2 = 0.5003 are calculated. Similarly, all these steps are repeated to get the principal components for all words. Final fused tolerance features are shown in Table 8 for all words.

Table 8 Fused tolerance features for speaker 1

Now there are 13 final enhanced features for every word. The mean of MFCC features and enhanced MFCC features is taken and after normalization these are fed to deep belief network for training. Two hidden layers with 200 and 300 neurons are used in DBN with Contrastive Divergence learning rule for the 20 epochs. Final DBN input is shown in Table 9 for speaker 1.

Table 9 Input to DBN for speaker 1

Similarly, enhanced features are calculated for all speakers and data is fed to DBN for training.

A comparison is done between MFCC features and enhanced MFCC features. The results showed that when system is trained with enhanced MFCC features, the accuracy is much better at different SNRs as compared to the baseline MFCC features. The reason is that enhanced MFCC features are calculated by taking care of the variation in speech signal. It may be due to the intra speaker variability or due to the environmental effects. For clean speech signal the accuracy is about 94% for MFCC features and 97% is for enhanced MFCC features. At 15 dB, when MFCC features are used the accuracy is 56% and when enhanced MFCC features are used, the accuracy is 96%. It showed that accuracy decreases about half when there is variability in speech signals using MFCC features, but for enhanced MFCC trained DBN, the accuracy is almost same as for clean signals. This showed that system trained with enhanced MFCC features achieved good accuracy even in noisy conditions as shown in Table 10.

Table 10 Comparison of % accuracy between MFCC and enhanced MFCC features on TIDIGITS at different SNR

A comparison is done between proposed system and baseline system on standard dataset (TIDIGITS). Seyedin Sanaz et.al [8] have used features which are computed from minimum variance distortion less response (MVDR) spectrum by modifying the PLP technique known as PMSR features. In this weighting of sub band is modified to get MVDR spectrum and then LP coefficients are transformed to get robust PMSR (R-PMSR) features. A comparison is done between the R-PMSR, MFCC and proposed system using enhanced MFCC features at different SNRs as shown in Fig. 2.

Fig. 2
figure 2

Comparison of % accuracy with SNR

The results showed that proposed system has achieved better accuracy at different SNRs when compared with baseline systems. As in baseline systems, the accuracy decreases sharply with high noise environment. But enhanced MFCC features worked well in noisy environment. Therefore, for real time applications where both intra speaker as well as environment effects are of greatest interest, enhanced MFCC features are the best.

5 Conclusion

In proposed work, enhanced MFCC features are calculated in terms of tolerance 1 and tolerance 2. The recognition accuracy improves around 3% when enhanced MFCC features are used as compared to baseline MFCC features. Experimentation is done on TIDIGITS. Deep belief network is used for the classification purpose. Comparison is done between proposed technique and existing techniques. Baseline system using MFCC features has given 94.59% accuracy, R-PMSR feature based system has given 95.12% accuracy and enhanced MFCC based system has given 97.29% accuracy.