Keywords

1 Introduction

Speech recognition is the conversion of spoken words into text so that the machine can understand natural voice input to take further action based on the input. Different sub-word models are used in speech recognition to overcome the constraint of large training sample requirements of word-based models. A phoneme-based system is mostly used to overcome the constraint of word-based models due to large training data requirements [1]. Phoneme recognition is one of the fundamental issues in speech recognition. A phoneme is a basic unit for spoken language. The limited number of phonemes can be combined to generate all possible words in a language. Further, due to fewer phonemes, different rules and procedures can be applied for phonemes. Phoneme recognition technologies are applied in speech recognition, speaker identification, language identification, and speech synthesis also[2].

Figure 1 shows the process of speech recognition. The speech recognition systems based on phoneme obtain the highly probable phoneme series for given input speech features [3]. Researchers applied different feature extraction methods and classification methods to recognize the phonemes. Further phoneme recognition can be categorized based on vowel recognition and consonant recognition. The vowels are generated with an open vocal tract at any point above the glottis. Three articulatory features are high, backness, and roundness. Researchers presented reviews on evolving theories of vowel recognition [4]. Consonants are an important part of any phoneme-based system. Consonants are articulated in the presence of constriction or closure at some point along the vocal tract. The categories of consonants on the basis of place of articulation are bilabial, alveolar, velar, postalveolar, labiodental, and retroflex. Consonants are further categorized on the manner of articulation and voicing. The plosive or stop, fricative, approximant, and closure are grouped under the manner of articulation. The voicing sounds indicate the presence of a voiced region [5]. The recognition of consonants is a challenging task in speech recognition due to their production methods. Researchers made efforts to recognize the consonants using different techniques in the literature.

Fig. 1.
figure 1

Speech recognition block diagram

However, phoneme-based speech recognition faces contextual effects. The phonemes are categorized into vowels and consonants. The stop consonants are produced by obstruction of airflow and release. All the periods of stop consonants, such as during which the articulator moves, during which the articulator obstructs the airflow, and when articulators separate to release the air, are important and need to be addressed in the context of speech recognition. Further, recognizing semivowels is also challenging due to acoustically similar characteristics to vowels [6]. Nasalization also causes difficulty in detection due to antiresonance. Researchers attempted to reduce the contextual effect by using context-dependent triphones [7,8,9]. Researchers also used longer acoustic units to overcome the contextual effects [10].

Hindi speech recognition systems are discussed in the literature using various feature extraction techniques [2, 11,12,13]. Different researchers also presented phoneme recognition in Hindi to explore different categories of phonemes such as vowels and consonants. Researchers [2, 14,15,16,17,18,19,20] worked for phoneme recognition to find out different issues such as recognition for vowel, consonants, stop consonants, structural analysis, phoneme aspiration detection, retroflex effect, and dental using different feature extraction techniques and classification methods.

Earlier reviews presented for Hindi speech recognition are related to overall speech recognition. The research review in this paper presents different issues related to Hindi phoneme recognition so that researchers can understand underlying issues. The presented review on phoneme recognition will enable researchers to understand different underlying concepts of phoneme recognition related to feature extraction and classification methods to improve Hindi speech recognition. Different Hindi speech characteristics were defined. Further details of feature extraction techniques and classification techniques are provided.

The structure of the paper is detailed as below. The related work is discussed in Sect. 2, while Sect. 3 highlights Hindi language issues. The feature extraction methods are explained in Sect. 4. Section 5 illustrates the classifier, and Sect. 6 is related to results and analysis. Section 7 is the last section with future direction and conclusion.

2 Related Work

The comprehensive review was presented to explore Bangla phonemes for Hidden Markov Model (HMM) and multilayer neural network over a single layer neural network provided [21]. Comparative analysis was also presented. A review of the study of phoneme recognition was presented. Three classifiers HMM, neural network (NN) and vector quantization (VQ) reviewed [22]. Methods described classification and feature extraction techniques for phonemes and isolated word recognition [23]. A survey for phoneme recognition in popular speech corpus was presented for recent deep neural network (DNN) based methods. It was concluded that simple feed-forward DNN provides less phone error rate (PER) compared to other DNN based methods [24]. The study was conducted on important characteristics of pronunciation issues for speech recognition evaluation and comparison [25]. A survey on cross-language for voice onset time (VOT) was addressed for two-stop consonants /d/ and /t/. Finally, VOT was compared for the investigated languages to explore the differences for these stop consonants [26]. A survey was presented for phoneme recognition by the classifier support vector machine (SVM) [27]. A study was presented by using recurrent neural networks (RNNs). It was concluded that RNNs improved speech recognition [28]. A review was presented for landmarks detection such as VOT and burst release to detect stop consonants. It was stated that stop consonants are difficult to recognize due to low energy values, high variabilities, and random behavior [29]. Discriminative phonetic features (DPF) such as the way of phoneme representation were reviewed. It was concluded that Arabic is a Semitic language and needs more research related to DPF. Monophone and hybrid subword units were used in creating a speech recognition system [30]. The domain-based syntactic structures were also applied to improve speech recognition by reducing the search space during the recognition process. It was observed that maximum word accuracy of 88.54% was achieved with PLP with energy coefficients and a hybrid model. Research findings reveal that substitution errors mostly occurred. Hindi vowel recognition was explored in [2]. The speech recognition framework was implemented using MFCCs with five states of HMM-based modeling. The vowels were subgrouped into the front, back, and middle vowels. The average recognition score of 83.19% was recorded. The results show that accurate prediction of the consonant score was obtained for a broad range of signal-to-noise ratios. The researchers also presented a study to show how vowels and consonants shape the recognition process. It was demonstrated that lexical processing is more strongly connected to consonants in comparison to vowel processing. Acoustically vowels are continuous and long, while consonants are transitory in nature [31]. An extensive study on Hindi phoneme confusion analysis was presented by the researchers [32]. Experiments were carried out using HMM and PLP coefficients on Hindi continuous speech utterances. The results were reported for both consonants and vowels. The vowels attained 70% recognition accuracy. The palatal phonemes achieved the maximum recognition score of 94%.

3 Hindi Language

Hindi alphabets are properly defined [33]. Alphabets in the Hindi language are divided into consonants and vowels. Hindi has about fifty-eight phonemic letters, which include ten vowels, thirty-seven consonants, and an additional five nuktas taken from Farsi/Arabic [34]. Some of the dominant features in the Hindi language are aspiration, gemination, nasalization, and retroflexive [14, 35]. The sounds are voiced and unvoiced in the Hindi language. Hindi vowels are divided into short and long vowels. Table 1 shows the Hindi vowel acoustic classification. Table 2 presents Hindi consonants with IPA symbols [36]. Table 3 presents Hindi semivowels and fricatives with IPA symbols [37].

Table 1. Hindi vowel acoustic classification
Table 2. Hindi consonants
Table 3. Hindi semivowels and fricatives

4 Feature Extraction Methods

The vowel recognition was presented using Mel Frequency cepstral coefficients (MFCCs) in [2]. Hindi consonants were classified using EMG-based sub-vocal features in [38] and Linear prediction coefficients [17]. For the recognition of Hindi, phoneme features used are wavelet sub-band based temporal features in [19] and MFCCs in [14]. Researcher work was also presented to develop a phoneme-based system for the Hindi language by using MFCCs, PLPs, and LPCs with their variants [39]. The results show that PLPs and MFCCs performed better than LPCs. The work was also presented using hybrid subword units using PLPs to improve phoneme-based speech recognition [30].

5 Classification Methods

For classification in speech recognition, two models, generative and discriminative, are generally used. The generative models learn from the joint probability distribution of the observed acoustic features and respective speech labels using Bayes rules. In contrast, discriminative training is used to optimize the model parameter [40]. The HMM is simple in design and practical in use for representing variability in speech. However, HMMs are not efficient when modeling nonlinear functions. In contrary to HMM, the ANNs allow discriminative training efficiently. Other research work also presented deep neural networks (DNNs) for improving speech recognition [41]. Several different methods, such as Gaussinization and based on discriminative training, were experimented [42]. Sequence to sequence acoustic modeling proposed for speech recognition [43]. Researchers applied different classification methods for speech recognition based on HMMs. Artificial neural network (ANNs) based classification was applied in [14]. Other works reported Gaussian Mixture Modeling (GMM) for vowel recognition [44]. For recognition of Hindi consonants, vector quantization was used in [17]. The researchers also used context-dependent HMM (CDHMM) for vowel classification [45]. Different matrices have been applied by the researchers to evaluate phoneme recognition. The matrices, such as phoneme error rate (PER), phoneme accuracy, and phoneme correctness, were applied by most of the developers. The phoneme accuracy and phoneme error rate (PER) were used by most of the researchers. The phoneme accuracy and PER are defined is as given below [5, 46, 47]. Hindi phonemes were characterized using time-delay neural networks (TDNNs) [48]. The queries related to Indian railways consisting of 207 Hindi vocabulary words were used in the experiment. Features used in the study were the MFCCs and cepstral mean normalization using the frame. Different TDDNs were trained and tested for Hindi phoneme categorization. Studies also presented to predict consonant recognition and confusion in background noise by using microscopic speech recognition [49].

6 Results and Analysis

The results indicate that researchers worked for the recognition of Hindi phonemes, consonants, and vowels. Most of the works are reported for vowel recognition. Some research findings indicate results for a small group of phonemes. The results were reported for different categories of phonemes. Researchers used different speech recognition systems. Phoneme recognition was also reported for different environmental conditions. The works were reported for a clean and noisy environment.

MFCCs, LPCs, and wavelet-based methods were applied. Feature extraction techniques based on wavelet sub-band and the combination of wavelet cepstral features with harmonic energy features improved speech recognition. The idea was presented that stops in speech signals are most difficult due to short-duration frequency bursts. Research work also presented different segmentation techniques. The researchers also experimented with subvocal speech recognition based on electromyography signal (EMG). Research also applied Gammatone frequency cepstral coefficients (GFCCs). The following Table 4 shows the list of features used in different research works. The comparative analysis was made based on extracted features, classification methods, phonemes types, and accuracy.

Table 4. Phoneme recognition comparative analysis

Research findings reveal that researchers applied different classification methods. The classification methods used are HMM-based, ANN-based, GMM based, and using vector quantization. Different methods, such as based on backpropagation and time-delay neural network, were applied. Recognition results were presented using accuracy. It was also observed that vowel recognition was mostly explored.

7 Conclusion

A review of Hindi phoneme recognition is presented to understand the issues related to Hindi speech recognition. Different issues related to Hindi phonemes such as Hindi speech characteristics, features used in phoneme recognition, and classification with related work described. The classifiers based on HMM, ANN, GMM, and VQ were experimented. Feature extraction methods improved phoneme recognition. Researchers also worked on subcategories such as vowels and consonants in addition to phonemes. The research work on Hindi speech recognition was also presented using the deep learning method. Researchers also presented studies for phoneme confusion analysis to understand and improve speech recognition. It was also revealed that substitution errors have mostly occurred. Further research may include more studies exploring Hindi phonology and applying hybrid feature extraction methods and classification methods. The outcome of the study consists of that researchers worked mainly on the recognition of the vowels. Further research work may include more studies related to Hindi phonology.