Keywords

1 Introduction

Human voice has been the major mode of communication, interacting with machines has evolved a lot from identifying digits to the complex Automatic Speech Recognition (ASR) till date. The desire to automate simple tasks to complex tasks has necessitated human-machine interactions. Over the past decades, a lot of research has been carried out in order to create an ideal system which can understand and analyse continuous speech in real time and perform tasks accordingly. Some of which are Speech-to-text conversions, biometric identifications, home automation and has also highly benefited disabled persons. Advancements in the deep neural networks has made it all possible. Hidden Markov Model (HMM) hydride with Deep neural networks (DNN) and Recurrent Neural Networks has achieved remarkable performance in many large vocabulary speech recognition tasks [42].

But this was not as easy as what we see today. In terms of evolution, it can be organised and shown in Table 1.

Table 1. Evolution of Speech Recognition Systems

1.1 First Generation

Earlier attempts in the field of ASR were made between 1950s to 1960s, when researchers were experimenting with the fundamental ideas of acoustic phonetics. In 1952, at Bell Laboratories in USA, David, Balashek and Biddulph built a system which could recognize digits for a single isolated speaker using formant frequencies measured during vowel regions of each digit. Further, at University College in England, Fry and Denies built a system which could recognize four vowels and nine consonants. This was the benchmark at that point of time in recognizing phonemes with much better accuracy as before. In the 1960s computers were not fast enough which was the limitation for the hardware. Other than this non-uniformity of time scales in speech was also a hurdle. To overcome this problem Martin and his colleagues at RCA labs developed a set of elementary time-normalised methods. This helped in reliably recognizing the start and end of a speech that reduced the variability of the recognition scores.

1.2 Second Generation

During the late 1960s and 1970s, ASR achieved many benchmarking milestones. Dynamic programming methods or Dynamic Time Wrapping (DTW) was introduced which helped in aligning a pair of speech utterances and also algorithms for connected word recognition. Many attempts were made during this time, for example, by IBM labs, AT & T Bell Labs, DARPA program and many more.

1.3 Third Generation

During the late 1980s, the focus was on building a more robust system which could recognize a fluently spoken string. A wide range of algorithms and experiments were performed to obtain a concatenated string of different words spoken. One of the key technologies developed during this time was the Hidden Markov Model (HMM). This technique was then boosted and was widely applied in every speech recognition research laboratory. Also, the idea of Neural networks was reintroduced in speech recognition. However, the concept was introduced earlier in the 1950s but was not useful because of practical limitations [44]. In the 1990s, after neural networks were reintroduced, many new innovations came in the area of pattern recognition.

1.4 Fourth Generation

For the past two decades, after so many successful attempts of improvising speech recognition systems, Deep Neural networks took speech recognition systems from just experimenting it on desk to some real-world applications for users. People can now interact and can make many tasks done just through voice command. For example, Ok Google, Siri, Alexa.

2 Speech Utterances

Based on the types of utterances, different speech recognition approaches are categorized into various groups in the way that they are capable to identify. Various types of speech utterances include Isolated speech, connected words, Continuous speech and spontaneous speech are shown in Fig. 1 and discussed below:

Fig. 1.
figure 1

Type of utterances

2.1 Isolated Speech

The recognizers which work with isolated speech requires every word to have noiselessness such as the absence of an audial signal on every side of the trial window. It accepts individual utterances at a specific point of time. This procedure includes two states named “Listen and Not-Listen”, in which the user is required to pause among the words consistently carrying out the processing during the period of pauses. It can also be termed as Isolated Utterance.

2.2 Connected Words

In the case of Connected word, it requires a least gap among utterances to permit the flow of speech smoothly. These type of speech utterances are slightly similar to isolated speech.

2.3 Continuous Speech

The recognizers which work with Continuous speech allows the operators to speak nearly in a natural way, whereas the processor chooses the context. Primarily, it characterizes the computer transcription. This type of Recognizers which work with the continuous speech are supplementary hard to produce as they implement exclusive procedures to choose on the utterance boundaries.

2.4 Spontaneous Speech

Spontaneous speech is a kind of speech which can be considered as a natural speech, not as the trained one. An Automatic Speech Recognition system with this type of speech dimensions has to be capable to identify the owner of normal speech characters such as utterances which work altogether for instance the “ums” and “ahs”, involves the minor stammers.

3 Speech Recognition Overview

Automatic Speech Recognition or ASR, is the procedure that permits humans to utilize their speeches to communicate with a system in such a way that, in its utmost cultured distinctions, be similar to natural conversation of humans. It can be divided into five different components shown in Fig. 2 and discussed as follows:

3.1 Pre-processing

In this step, some basic functions are performed before extracting any features. For Example, noise removal, endpoint detection, pre-emphasis and normalisation.

3.2 Feature Extraction

Features which will be used to differentiate between different phonemes and eventually to words and sentences are extracted. Most commonly extracted feature for ASR is Mel frequency cepstral coefficients (MFCCs) [43]. Since the mid-1980s MFCCs are the most widely used feature in ASRs. Discrete Wavelet Transform (DWT), Wavelet Packet Transform (WPT), Linear prediction cepstral coefficients (LPCC) and many more features are available with their strengths and weaknesses which can be used as required [40].

3.3 Classification

Numerous approaches have been done in order to find an optimal classifier which could correctly recognize speech segments under various conditions. Some of the classification techniques used are Artificial Neural Networks (ANNs), Hidden Markov Model (HMM).

3.4 A Language Model

Contains the knowledge specific to a language. This model is required to recognise phonemes and eventually represent meaningful representations of the speech signal [41].

3.5 Acoustic Modelling

Acoustic modelling establishes a relationship between acoustic information and language construct in SR [35].

Fig. 2.
figure 2

Automatic speech recognition process

4 Speech Recognition Approaches

Speech Recognition has various techniques which can be further categorized into three major categories shown in Fig. 3 and discussed below:

Fig. 3.
figure 3

ASR approaches

4.1 Acoustic Phonetic Approach

Acoustic phonetic approach, postulates that there exists a phoneme unit which is a building block of a speech and can be characterised in a set of acoustic properties. These properties are highly variable with respect to the speaker and the environment [36]. The very first step is speech spectral analysis followed by feature extraction which translates spectral dimensions into a conventional feature that defines the phonetics characteristics. After speech spectral analysis the segmentation and labelling of the speech signal is performed which generate isolated regions by segmenting the speech signal. Finally, it identifies the appropriate word from the produced sequences of phonetic labels. However, this approach is not widely used.

4.2 Pattern Recognition Approach

The pattern recognition approach is utilized to identify patterns grounded on convinced conditions which is used to categorize into various the classes. It involves various steps namely:

  • feature measurements,

  • pattern training,

  • pattern classification and

  • decision logic.

Various measurements are taken place to outline a test pattern on an input speech signal. Reference patterns are generated for each speech sound identified. Reference patterns can be generated with the help of speech templates or by using a statistical model such as HMM. The model can be functional to an utterance, a term or an idiom [46]. Finally, a comparison is performed among the unknown patterns and reference patterns in this pattern classification. And in Decision logic the identity of the unknown is determined. This approach is primarily used in ASR systems.

4.2.1 Template Based Approach

The essential thought of the Template based approach is elementary. A compendium of ideal patterns of speech is combined as reference patterns describing the vocabulary of candidate utterances. Afterwards, by toning the unnamed words with every reference pattern the recognition of word is performed and selects the kind of the finest matching form. All the words of the templates are configuring. One of the vibrant origins in this approach is to reach at a distinguishable sequence of speech frames for a word by means of firm averaging procedure and to be contingent on the arrangement of limited spectral distance metrics to evaluate and distinguish among the patterns [45]. An alternative vital concept is to implement a specific form of dynamic programming to briefly line up the patterns to reason for the alterations in the talking rates transversely the speakers as well across the repetitions of the term by the matching speaker.

Depending on the context, this approach handles the varying form of a nation-wide contour. It postulates an only sensible quantity of situations. This approach is considerable sturdier and long-range forecasters. The implementation of patterns is planned to detect and replicate the utmost significant syllable-level structures [39] of the outline without doing much smoothing. To depict the template form, this approach exploits the supple utterance arrangement. The benefit of using this approach is that, it can avoid the faults happened due to classification or segmentation of smaller adjustable components of illustration phonemes [38].

4.2.2 DTW, HMM Based Approach

In Dynamic time wrapping, various templates are used to represent every class which is to be recognized. To improve the speaker variability or the pronunciation modelling it is preferred to utilize two or more reference templates for each class. All through the process recognition, a gap among an experimental sequence of speech and class template is considered. The stretched and wrapped forms of the reference templates are also implemented in the gap calculation, to disregard the influence of duration discrepancy among the experimental sequence of speech and class template. The predictable word matches to the track over the model that reduces the total gap [41]. To improve the performance of dynamic time wrapping the number of class template variants can be increased and the wrapping constraints can be loosed but on the outflow of storing space and computational needs. Due to improved generalization features and lower memory necessities, HMM based approach is more widely used instead of dynamic time wrapping approach in various state of the art systems.

4.3 Artificial Intelligence Approach

Artificial Intelligence approach is the combination of both the pattern recognition and acoustic phonetic approaches. Few scholars established knowledge base of acoustic phonetic features for speech recognition system which classifies the rules for sound of the speech [37]. The methods based on templates provides slight intuition about humanoid speech processing, nevertheless these procedures have been very efficient in the development of a diversity of automatic speech recognition systems. On the contrary, verbal and phonic works providing intuitions about humanoid speech processing. Though, this method had only fractional accomplishment due to the complexity in computing proficient information.

5 Various Speech Recognition System

This study efforts to examine entirely all the published works for automatic speech recognition of various Indic languages. Works that denotes to speech recognition system in Indic languages or associated investigation on the Indian Automatic Speech Recognition datasets variety, investigational and non-experimental have been encompassed in this review paper. The necessity of the critical study is to identify the position of the investigation on automatic speech recognition. Various automatic speech recognition research studies for Indian languages emphasised on:

  • LPC (Linear Predictive Coding),

  • MFCC (Mel-frequency Cepstrum Coefficient),

  • RASTA (Relative Spectra Processing)

  • ZCPA (Zero Crossing with Peak amplitude), and

  • Dynamic Time Wrapping (DTW) features.

Various ASR systems with their features and accuracy are discussed in Table 2.

Table 2. Survey-based on feature extraction technique

Various classification techniques are available for automatic speech recognition such as HMM, GMM, RNN, SOM, DE-HMM, DE-GMM, MPE, MMI and MLE. Techniques used in various research works with features extracted and language used is discussed in Table 3.

Table 3. Survey based on classification techniques

6 Challenges and Future Directions in Speech Recognition

Robustness of an Automatic Speech Recognition system is the capability of the system to effectively deal with diverse characteristics of inconsistency in the speech (input) signal. The accuracy of a speech recognition system can be evaluated by a number of eminent factors. The utmost perceptible ones are: speaker, pronunciation, region, speech rate, context, channel and environment variability. In the development of ASR systems, these thought-provoking factors must be taken care and efficient models to be formed to deliver virtuous recognition precision regardless of these variabilities. In advanced level, ASR system development requires the accessibility of procedures or algorithms for instinctive generation of expression lexicons, instinctive generation of linguistic models for novel tasks, instinctive algorithms for speech segmentation, algorithm for finest utterance verification-rejection, attaining or exceptional humanoid presentation on ASR tasks. Some of the challenges and future directives are discussed below:

  • Several Automatic Speech Recognition systems has absence of huge speech corpus. To build such a huge corpus must include tonal information, dialectal and prosodic information to perform more analytical processing of information.

  • Language such as Punjabi, Bodo and Dogari are tonal Indic languages. An examination required to be accomplished by means of vocal tract information and pitch information about these dialects and their successive languages.

  • Additional chief problem with vernaculars is a discrepancy of dialectal statistics. Rare studies were performed on mining the dialectal information of Indic dialects. This required to be united with speech methodologies to diminish Word Error Rate.

  • Various works in this field implemented the bottle neck features. Various speech databases established in Indic dialects are grounded on noise free situation. In future, researchers can develop noisy datasets and develop speech recognition system on these datasets by utilizing various pitch characteristics and robust approaches to enhance the performance of the system.

  • By utilizing the optimisation algorithm on model metrics, an effort can be made to improve the acoustical features. Very rare works has been worked on optimizing or refining the features. Study in other dialects emphases on previously recognized techniques of feature extraction such as MFCC. A limited studies have utilized hybridisation techniques of feature extraction for the refinement of feature.

7 Conclusion

Speech recognition is a standout amongst the utmost enabling zones of machine information since people do an ordinary movement of speech recognition. In this survey, various speech recognition techniques and their works are reviewed and tabulated different features extracted and classifier used on the (input) speech signal. Prominently, three distinct factors such as, approach, features extracted and accuracy measure were taken care for comparison and studying the prevailing works. The comprehensive analysis accomplished in this study will give the attainment happened in the field of automatic speech recognition to further articulate the research notions to overcome the existing yardstick outcomes for the scholars. At last, some of the research challenges and future directives are also addressed to lead the further research in the same direction. In future, researchers can develop noisy datasets and develop speech recognition system on these datasets by utilizing various pitch characteristics and robust approaches to enhance the performance of the system.