Keywords

1 Introduction

Speech is the most natural and efficient form of exchanging information between human. Automatic speech recognition (ASR) is defined as the process of converting a speech signal to a sequence of words by means of an algorithm implemented by a computer program. Speech recognition systems help users who cannot be able to use the traditional Input and Output (I/O) devices [1, 2]. The man–machine interface using speech recognition has helpful ways to for enable the visually impaired and computer laymen to use the updated technologies [3]. There are growing interests in developing machines that can accept speech as input. Given the substantial research efforts in the speech recognition worldwide and the rate at which computer becomes faster and smaller, we can expect more applications of speech recognition. The concept of machines being able to interact with people in natural form is very interesting. It is desirable to have machine interaction in voice mode in one’s native language. This is exclusively important in multilingual country such as India, where a majority of the people is not comfortable with speaking, reading, and listening English language [4]. Research in ASR by machine has attracted a great deal of attention over the past six decades. For centuries, researcher has tried to develop machines that can produce and understand speech as humans do so naturally in native language. Some successful applications of speech recognition are virtual reality, multimedia searches, auto-attendants, IVRS, natural language understanding and many more applications [57].

This paper is organized as follows: Sect. 2 describes the related work of this experiment. Section 3 describes the speech recognition with feature extraction techniques. Section 4 explains the design of the Marathi speech interface system. Section 5 describes the experimental results and their extensive application for talking calculator and Sect. 6 gives the concluding remark of paper followed by references.

2 Related Work

The main goal of speech recognition is to develop techniques and recognition systems for speech input to the machine. Human beings are comfortable speaking directly with computers rather than depending on primitive interfaces such as keyboards and pointing devices. The primitive interfaces like keyboard and pointing devices require a certain amount of skill for effective usage. Use of mouse requires good hand-eye coordination. It is very difficult for visually handicapped person to use the computer. Moreover, the current computer interface assumes a certain level of literacy from the user. It expects the user to have certain level of proficiency in English apart from typing skill. Speech interface helps to resolve these issues [8]. Computers which can recognize speech in native languages enable the common man to make use of those benefits in education technology [9]. The researcher turns toward performance improvement of speech recognition system for significant innovation of real-time applications [10]. Many factors are affecting the speech recognition such as regional, sociolinguistic, or related to the environment. These create a wide range of variations that may not be modeled correctly (speaker, gender, speaking rate, vocal effort, regional accent, speaking style, nonstationary, etc.), especially when resources for system training are infrequent. Performance is affected by speech variability [11]. The majority of technological changes have been directed toward the purpose of increasing robustness of the recognition system [12].

N.S. Nehe et al., proposed an efficient feature extraction method for speech recognition. The features were obtained Linear Predictive Codding (LPC) and Discrete Wavelet Transformation (DWT). The proposed approach provides effective (better recognition rate), efficient (reduced feature vector dimension) features. Continuous Density Hidden Markov Model (CDHMM) has been implemented for system classification. The proposed algorithms were evaluated using an isolated Marathi digits database in the presence of white Gaussian noise [13]. Rajkumar S. Bhosale et al., worked on speech-independent recognition system using multi-class support vector machine and LPC [14].

The research work for Indian languages in speech recognition has not yet grasped to a critical level for a real-time communication as compare to other languages of developed countries. Countable attempts to develop a speech recognition system had been attempted by HP Labs India and IBM research lab, Google, IIT Powai, CDAC Pune [15, 16].

However, there is lots of opportunity to develop a speech recognition system for Indian languages. To achieve such aspiring motivation, the research is to develop Marathi speech interface system for talking calculator application.

3 Speech Recognition and Feature Extraction Techniques

For the development of Marathi speech activated calculator (MSAC) system, the recognition of speech is necessary. The speech is recognized and then the action is performed. For the speech recognition system speech recording (database creation), feature extraction, training, classification, and testing are fundamental steps. The step by step flow diagram of speech recognition system is described in Fig. 1.

Fig. 1
figure 1

Working of speech recognition system

3.1 Feature Extraction Techniques

The MSAC system is trained and tested using three approaches of feature extraction, such as basic feature extraction techniques, fusion approach of feature extraction techniques and Wavelet Decomposed Cepstral Coefficients (WDCC): Proposed Approach. In the basic feature extraction techniques, the Mel Frequency Cepstral Coefficient (MFCC), Linear Discriminative Analysis (LDA), Principal Component Analysis (PCA), LPC, Rasta-PLP Analysis, and Discrete Wavelet Transformation were implemented.

  1. (a)

    Mel Frequency Cepstral Coefficient

The enriched literature available on speech recognition, hence reported that the MFCC is most popular and robust technique for feature extraction [17, 18]. The MFCC is based on the known variation of the human ear’s critical bandwidth frequencies with filters spaced linearly at low frequencies [19]. In this experiment, we extracted the 13 and 39 features of MFCC. Figure 2 shows the graphical representation of shows 39 Mel Frequency Cepstral coefficients (MFCC) for गणकयंत्र of first seven frames. The MFCC extracted the basic feature, variation of energy feature.

Fig. 2
figure 2

Graphical representation of MFCC 39 coefficients of word गणकयंत्र

  1. (b)

    Linear Discriminative Analysis (LDA)

LDA algorithm provides better classification compared to principal components analysis [20]. From the literature the Linear Discriminant Analysis (LDA) is commonly used technique for data classification and dimensionality reduction, but in this research we used for feature extraction. Figure 3 shows the graphical representation of extracted LDA feature of speech signal गणकयंत्र for first 10 frames.

Fig. 3
figure 3

The LDA feature set for the speech signal

The LDA feature is the combination of the Projection matrix feature, eigenvalues, Mean square representation error, Bias feature, and mean of training data.

  1. (c)

    Principal Component Analysis (PCA)

The principal component analysis is the techniques for classification, but here we are used as feature extraction and dimension reduction. Figure 4 represents the graphical representation of interclass identification of PCA feature.

Fig. 4
figure 4

PCA with class classification

The principal component analysis extracts the 30 feature which is the combinations of projection feature, in class variation, the mean and eigenvalues of the speech signal.

  1. (d)

    LPC

The LPC is one of the robust and dynamic speech analysis techniques. In this research, we used the LPC for feature extraction. The graphical representation of the mean extracted LPC coefficient of speech signal गणकयंत्र is described in Fig. 5. The LPC 7 coefficients contain the Pitch, gain, and duration coefficient parameters of energy.

Fig. 5
figure 5

Mean of extracted LPC coefficient of the speech signal

  1. (e)

    RastaPLP

Rasta—PLP techniques is used for feature extraction. This extracted feature is the combination of the graphical band of voice signal. Figure 6 represents the graphical representation of the extracted Rasta PLP coefficients.

Fig. 6
figure 6

RASTA-PLP coefficient of speech signal गणकयंत्र

The Rasta feature contains the pitch, gain and duration of energy and short-term noise coefficients.

  1. (f)

    DWT

The wavelet series is just a sampled version of continuous wavelet transformation and its computation may consume a significant amount of time and resources, depending on the resolution required [21]. The DWT is also used for feature extraction as well as dimension reduction approach. The graphical representation of DWT approximation coefficient is described in Fig. 7.

Fig. 7
figure 7

The extracted approximation coefficient of the speech signal गणकयंत्र

DWT coefficient is calculated at approximation and detail level. Extracted DWT coefficients include the frequency variation of each frequency band in approximation level and energy variation with time duration in detail coefficients.

3.1.1 Fusion-Based Feature Extraction Techniques

The fusion approach means combination of different techniques. Total 13 MFCC features were extracted and feature vector was formed. The formed feature vector was passed to fusion technique as an input. The detail fusion approach of different techniques with MFCC and their properties is explained in Table 1.

Table 1 Fusion approach with MFCC and their properties

For the fusion-based approach, this research implemented the Mel Frequency Linear Discriminative Analysis (MFLDA), Mel Frequency Principal Component Analysis (MFPCA), Mel Frequency Discrete Wavelet Transformation (MFDWT), Mel Frequency Principal Discrete Wavelet Transformation (MFPDWT), and Mel Frequency Linear Discrete Wavelet Transformation (MFLDWT) fusion approach as feature extraction techniques. Figure 8 represents the graphical representation of extracted MFLDA feature for word गणकयंत्र”

Fig. 8
figure 8

The fusion approach of MFLDA (MFCC and LDA) of speech signal “गणकयंत्र”

3.1.2 WDCC: Proposed Approach

In proposed WDCC, the original speech signal is decomposed to second level. The packet coefficient offers different time, frequency representation qualities and consequently potential, for adaptation of the time series phenomenon. This strategy of decomposition offers richest analysis of signal [2226]. In the WDCC techniques, the original speech signal is decomposed second level. The approximation and detail coefficient is a distinguished output from decomposition step. The DCT operation is performed on horizontal coefficient, which is fused with basic acoustic coefficient are derived to first and second derivation where we got 18, 36, and 54 WDCC coefficients. The graphical representation of the extracted WDCC 18, 36, and 54 coefficient is shown in Figs. 9, 10 and 11 respectively.

Fig. 9
figure 9

The WDCC extracted 18 features of word “गणकयंत्र”

Fig. 10
figure 10

The WDCC extracted 36 features of word “गणकयंत्र”

Fig. 11
figure 11

The WDCC extracted 54 features of word “गणकयंत्र”

4 Design of Marathi Speech Interface System

For the speech recognition system speech recording (database creation), feature extraction, training, classification, and testing are fundamental steps.

4.1 Database Design

The collection of utterances in the proper manner is called the database. We implemented this prototyping application as a speaker dependent. The total number of words with probability 372 utterance is 20, and the data were collected in 03 session so the overall 22,320/- vocabulary size are collected in the database. The sampling frequency for all recordings was 16,000 Hz at the room temperature and normal humidity. The speech data are collected with the help of microphone realtech and matlab software using the single channel. The preprocessing is done with the help of computerized Speech Laboratory (CSL).

4.2 Marathi Speech Activated Calculator (MSAC)

In this research, our objective is to develop MSAC application. Figure 12 describes the basic structural diagram for talking calculator. The voice is recognized and specific action taken toward voice commands. The MSAC is the speaker-dependent interface system.

Fig. 12
figure 12

The basic structural diagram for talking calculator

5 Experimental Analysis

This application deals with defined set of experiments related to calculator applied on the database designed for this research work.

5.1 Performance of the Marathi Speech Activated Calculator (MSAC) System

The performance of the system is calculated on the basis of accuracy as well as a real time factor. The real-time factor (RTF) is the time required for recognition in response to the operation. The accuracy is calculated on the basis of confusion matrix in which number of token was passed randomly.

$$ {\text{Accuracy}} = \frac{N - C}{N}*100 $$

where N is a number of token passed and C is a number of token confuse. The RTF is a common metric for computing the speed of an ASR system. If it takes time P to process an input of duration I, the RTF is defined as

$$ {\text{RTF}} = \frac{P}{I} $$

5.2 Training of MSAC System

  • MSAC system was trained using individually for the MFCC (13 feature), MFCC (39 feature), LDA, PCA, LPC, Rasta-PLP, DWT techniques, and tested the performance on the basis of the Euclidian distance approach.

  • MSAC system was also trained for MFPCA, MFLDA, MFDWT, MFPDWT, and MFLDWT techniques. The performance of MSAC was tested on the basis of the Euclidian distance approach.

  • MSAC system was trained using WDCC with 18, 36, and 54 coefficients separately and evaluated for the performance.

5.3 Testing of MSAC System

  1. (a)

    Basic Feature Extraction

In this approach, MSAC system was tested using MFCC (13 feature) and MFCC (39 feature), LDA, PCA, LPC, Rasta-PLP, and DWT techniques. The 13 isolated words are used for testing. The performance of these techniques is calculated on the basis of average accuracy and RTF.

  • MFCC based MSAC performance

    The performance of MSAC using MFCC is considered on the basis of 18 and 39 coefficients. Total 13 words were tested for 32 trials. The average performance of MFCC for 18 and 39 coefficients are calculated as 75.78 and 78.03, respectively. The RTF for MFCC 18 and 39 coefficients is the 26 and 38 s, respectively.

  • LDA-based MSAC performance

    Total 13 words were tested for 32 trials. The average performance of LDA is calculated as 67.17 %. The responding time (RTF) for the action taken is 46 s.

  • PCA-based MSAC performance

    Total 13 words were tested for 32 trials. The average performance of PCA is calculated as 62.19 %. The RTF for MSAC using PCA techniques is 38 s.

  • LPC-based MSAC performance

    Total 13 words were tested for 32 trials. The average performance of LPC is calculated as 61.23 %. The RTF for recognition and action taken in calculator is 51 s.

  • Rasta-PLP-based MSAC performance

    Total 13 words were tested for 32 trials. The average performance of Rasta-PLP is calculated as 68.27 %. The RTF for recognition and action taken in calculator is 48 s.

  • DWT based MSAC performance

    Total 13 words were tested for 32 trials. The average performance of Rasta-PLP is calculated as 71.02 %. The responding time for the calculator for performing the action is 32 s.

The comparative performance of the MSAC with different feature extraction techniques is described in Table 2.

Table 2 The performance of MSAC of available feature extraction technique in literature

From Table 2, it is observed efficient accuracy is achieved with MFCC for 39 coefficients but RTF is bit increased than MFCC with 13 coefficients. MFCC 13 coefficient proved to be effective in term of accuracy and RTF.

  1. (b)

    Fusion-Based Feature Extraction Techniques

In the fusion feature extraction techniques base MSAC testing, we have tested the system using MFLDA, MFPCA, MFDWT, MFLPDWT, and MFLDWT techniques. We explored fusion approach for system implementation. If the dimension of the features is reduced without loss of information, this will reduce the RTF and the MFCC provides higher accuracy so we fuse these techniques with MFCC.

  • MFLDA-based MSAC performance

Total 13 words were tested for 35 trials. The average performance of MFLDA is calculated as 87.90 %. The responding time for the calculator for performing the action is 22 s.

  • MFPCA-based MSAC performance

Total 13 words were tested for 35 trials. The average performance of MFPCA is calculated as 87.12 %. The responding time for the calculator for performing the action is 20 s.

  • MFDWT-based MSAC performance

. Total 13 words were tested for 35 trials. The average performance of MFPCA is calculated as 86.17 %. The responding time for the calculator for performing the action is 6 s.

  • MFPDWT-based MSAC performance

Total 13 words were tested for 35 trials. The average performance of MFPCA is calculated as 89.09 %. The responding time for the calculator for performing the action is 12 s.

  • MFLDWT-based MSAC performance

Total 13 words were tested for 35 trials. The average performance of MFPCA is calculated as 90.12 %. The responding time for the calculator for performing the action is 10 s.

The comparative performance of MFLDA, MFPCA, MFDWT, MFPDWT, and MFLDWT are described in Table 3.

Table 3 Performance of the MSAC system using fusion approach

The MSAC Application is the real interface application which requires lowest real time factor. From the above results, it is observed that the MFLDWT is proven a higher accuracy than other techniques with acceptable RTF.

In the above fusion approach, we observed that the wavelet play an important role for the reducing the RTF so we proposed our own techniques on the basis of wavelet basic properties known as WDCC.

  1. (c)

    WDCC: A proposed approach

This wavelet decomposed Cepstral Coefficient is used for basic feature extraction. This technique extracted basic feature, first derivative as well as second derivative. From this technique extracted the 18 basic features, 36 features (18 + First Derivative) and 54 features (18 + First derivative + Second derivatives).

The performance of MSAC is tested on the basis of the proposed WDCC approach. The MSAC system is tested using WDCC 18, 36, and 54 coefficients approach. Total 13 words were tested for 35 trials. The average performance of WDCC with 18, 36, and 54 coefficients is calculated as 94.03, 95.01, and 98.04, respectively. The responding time for the calculator for performing the action using WDCC 18, 36, and 54 coefficients are 9, 20, and 27 s. The comparative performance of the MSAC system using WDCC technique of different features is described in Table 4.

Table 4 Performance of MSAC system using WDCC approach

From the above result, the WDCC with 54 coefficients gives the high accuracy with the RTF is slightly greater than other 18 and 36 coefficient. We observe that WDCC with 18 coefficients provided lowest RTF so we tested the MSAC system using WDCC using 18 coefficients.

For this MSAC interface system, we did 6 h training of the dataset. The testing environment varied the system performance. We test the MSAC interface system real time in the air conditioners in office environment, system processor sound, and group talking environment.

Error Rate in Noisy Environment for MSAC System

This MSAC system is the speaker-dependent system. The experiment is tested in noisy environment. The testing environment varied the system performance. We test the MSAC interface system real time in the following noisy environment. The error rate according to the acquisition environment is presented in Table 5.

Table 5 Error rate of MSAC in the different testing environment
  • Air conditioners in an office environment (when the air conditions are started in the office and the sound of the air conditions is mixed with test recorded samples).

  • System Processor sound (The system processor sound is become maximum and it mixed with the test sample)

  • Group talking environment (This system tested in real laboratory, number of people talking of each other naturally, they not knowing the system testing background).

5.4 Significance of MSAC

The salient feature of this research as below…

  • In the current era of technology, the evolution of speech recognition is done day by day so it is very necessary to bring this technology to the societies in regional language. It is observed that work done in Marathi has not received much more attention. Thus, we have attempted to design and developed Marathi Speech Activated Calculator (MSAC).

  • The MSAC system is directly beneficial for society people, where no need of computer literacy and knowledge of English.

  • The clustering approach such as PCA, LDA, and LPC was considered for feature extraction, and this gives a new path for speech researcher towards an implementation of real-time application.

  • From 1939 to till date, MFCC is one of the robust and dynamic feature extraction techniques; in this experiment, we proposed WDCC feature extraction technique, which gives better performance than MFCC.

  • It is very important to adapt research in real-time application; MSAC is the real-time application in Marathi regional language. This will be a chance for Marathwada people to adapt new technology in speech recognition through which they become the part of today’s modern technology.

6 Conclusion

From the enrich literature, it is observed that MFCC, PCA, LDA, LPC Rasta-PLP, and much more techniques available for feature extraction. Individual techniques have their own limitation. We tried to come up with these limitations using fusion approach. The database for the said application is recorded by standard protocol.

The MSAC system is tested on the basis of individual feature extraction techniques, fusion approach of MFCC and proposed WDCC approach. From the analysis, we observed the following:

  • Efficient accuracy is achieved with MFCC for 39 coefficients but RTF is bit increased than MFCC with 13 coefficients. MFCC 13 coefficient proved to be effective in term of accuracy and RTF.

  • MFLDWT is proven a higher accuracy than other fused techniques with acceptable RTF.

  • WDCC with 54 coefficients gives the high accuracy but the real time factor is slightly greater than other 18 and 36 coefficients.

  • WDCC with 18 coefficients provided acceptable accuracy with lowest RTF.

  • The testing environment such as air conditioners in office, system processor sound and group talking environment varies the performance of MSAC.

From the above study the author recommended the WDCC feature extraction techniques are robust and dynamic as compare to MFCC, LPC, Rasta, LDA, and PCA.