1 Introduction

Speech is the most natural, efficient and preferred mode of communication between humans. Therefore it can be assumed that people are more comfortable using speech as a mode of input for various machines rather than such other primitive modes of communication as keypads and keyboards. Automatic speech recognition (ASR) system helps us achieve this goal. Such a system allows a computer to take the audio file or direct speech from the microphone as an input and convert it into the text; preferably in the script of the spoken language. An ideal ASR should be able to “perceive” the given input, “recognize” the spoken words and then subsequently use the recognized words as an input to another machine so that some “action” can be performed on it [42, 126, 160]. Retrospectively, we consider ASRs to be the future means of communication between humans and machines.

Human speech and accents have huge variations, and this variation in speech patterns is one of the biggest obstacles in creating an autonomous speech recognition system. Bilingual or multilingual people tend to show more of these variations in their speech patterns than people who speak only one language. The same problem also arises when we add different factors such as gender, social style/dialect, speaking style and speed into the equation [40, 112]. Another obstacle to creating an ASR is finding enough resources to train the ASR model. Currently, such training models are available only for a handful of languages out of a total of approximately 6500 world languages.

Over the past few years, many survey papers have been published to review and examine various aspects of ASR models presented over time. A recently published survey paper [160] discussed the challenges an ASR will have to overcome; and also discussed and analyzed the well-known models of ASR. It analyzes various challenges which include utterance approach and style, different speaker models, vocabulary size, and channel variability. The paper also highlighted three different classification approaches; acoustic-phonetic approach, pattern recognition approach, and artificial intelligence approach. In another work [144], authors reviewed the efficiency of different feature extraction techniques including perceptual linear prediction (PLP), revised perceptual linear prediction (RPLP), and Bark frequency cepstral coefficients (BFCC). The paper compared the results of all these feature extraction techniques on different classification models. [97] also presented some challenges to the real-world implementation of an optimal ASR system. The authors also classified ASR on the basis of speaker mode, speaking mode, and vocabulary size. The paper elaborates the front-end and back-end of an ASR system. The front-end of ASR consists of different feature extraction techniques in detail; whereas the back-end of ASR discussed various classification techniques extensively. Another survey paper [42], also reviewed different feature extraction techniques and classification models. In addition to that, the paper briefly defined different types of speech, speech analysis techniques and their impact on the performance of the system; and word error rate (WER), a metric used to calculate the accuracy of the results produced by an ASR. Similarly, [12] focused solely on ASR for under-resourced languages. This paper discussed the definition of under-resourced languages as well as why their preservation is important. The data collection methods of under-resourced languages and the basic structure of an ASR of under-resourced language were also discussed. Correspondingly, [157] comprehensively explained different hybrid HMM-ANN based ASRs, whereas, [32] gave an overview of different ASRs, as well as different approaches that can be used to recognize speech. This paper also briefly discussed different types of speech recognition techniques. In another endeavor, [86] also discussed different types of ASRs, and neural networks based speech recognition approaches.

Table 1 presents the different highlights of the survey papers, which were discussed in the previous paragraph, in a compact and easily comprehensible form. The columns represent the different points that were covered or were missing in the discussed papers.

Table 1 Highlights and shortcoming of the discussed surveys

Most of the previously conducted studies failed to review the different feature extraction techniques and language models that play a vital part in the construction of an ASR. Similarly, the latest deep learning techniques were also not explained in the above-mentioned survey papers. Whereas, different online toolkits and databases that can help train an ASR were also missing from most of the studies. Hence, this study aims to evaluate the different feature extraction techniques and deep learning classification techniques. In addition to that, different online toolkits, databases, and language models were also assessed.

This study captures all the aspects of an ASR from the feature extraction phase to language models with the following objectives in mind:

  • To understand and explain the basic structure of an ASR (shown in Fig. 2) in detail, as well as discuss how using different techniques at different stages can affect the overall performance of the system.

  • Discuss in detail the different feature extraction and classification techniques being used for the development of an ASR.

  • Evaluate different toolkits and advancements made in language models and how they affect the performance of an ASR.

  • Encapsulate all of the information available regarding the different modules of an ASR, including different state-of-the-art deep learning classification techniques.

The rest of the paper is organized as follows: Section 2 discusses different tools, resources and techniques that were used to perform this literature review. Section 3 presents a brief history of ASR, different techniques and datasets that can be employed to calculate the accuracy of the ASR, as well as the basic structure of an ASR. Section 4 explains the state-of-the-art techniques being used to extract features from an audio signal, whereas, Section 5 discusses techniques that can be used for classifying the extracted features. Section 6 explains language models, why they are needed, and their types. Section 7 presents the toolkits that can be used to perform different ASR related tasks, and finally, the survey is concluded in Section 8.

2 Research methodology

Before researching this topic, a literature review is performed to determine the cutting edge technologies in this field. In this regard, IEEE, arxiv.org, Microsoft Academic, and Google Scholar were used to search and obtain the papers relevant to the research domain. Most of the relevant scientific seed words were first identified using the generic words and their synonyms related to the domain. Later on, specific seed words, which were identified from different publications, were used.

This method of searching ensured that all of the keywords were present in the titles of the research articles and publications. The AND operation was used to make sure all of the selected words were present in the titles. Double quotations were also used to ensure that all of the words were present together in titles i.e. present as a phrase rather than in the form of individual words. Out of all of the keywords that were used, “Speech Recognition” yielded the most but noisy results. Hence, to get better results, more queries were added to the seed words. The acquired articles were studied and state-of-the-art classification techniques, datasets, and feature extraction techniques were determined. Fig. 1 presents an overview of the methodology followed to perform the research for this survey.

Fig. 1
figure 1

Overview of search method

The three factors, that impacted how the literature was filtered, were relevance to the survey topic, how recently the research was conducted, and how thoroughly the paper covered the chosen topic. Table 2 shows the details of databases used to get the literature.

Table 2 Databases used for acquiring literature

3 Background

Before we get into the technical details of the ASR systems, it is imperative to get familiar with the history of ASR. Hence, this section discusses the first speech recognition system followed by the advancements made to-date. This section also highlights different datasets that can be used for the training and testing purposes of an ASR as well as different evaluation techniques that can be used to measure the performance of an ASR.

Most of the speech recognition models are developed using a generic model. This generic model and its different types are also discussed in this section.

3.1 History and early developments

For quite some time computer scientists have been trying to create a machine that can talk and communicate like a human. Since the early 1950s, researchers have been trying to make a computer understand, interpret and reproduce human languages and speech [53]. The first speech recognition called Audrey was developed in the Bell Laboratories. This system could distinguish between different digits spoken by a single user [33]. Another system was developed in the MIT Lincoln Laboratories in 1959, which could distinguish between 10 phonemes for a single speaker [39]. In the 1970s a lot of important research was made in the area of speech recognition. Russian scientists developed a system that can be used to distinguish words [164]. The ideas of using dynamic programming [138] and pattern recognition algorithms [164] were also presented during these years. In the early 1980s, the hidden Markov model (HMM) was introduced. Even though the HMM was considered to be too simple to identify human languages [62], they still managed to replace the dynamic time warping technique that was being used [69]. In the later years of the 1980s, the n-gram model was introduced. In the early years of the 2000s, the HMM was being used in combination with a feed-forward artificial neural network (ANN) [14]. Nowadays, long-short term memory (LSTM) [14], a type of recurrent neural network (RNN), is being used for speech recognition in combination with different deep learning techniques.

3.2 Evaluation techniques

Evaluation is one of the most important aspects of a conducted research because of its importance this section explains in detail different metrics that can be used to evaluate the performance of an ASR. The performance of a speech recognition system usually depends on two factors, the accuracy of the output produced as well as the processing speed of the ASR.

3.2.1 Speed

The following method can be used to calculate the processing speed of an ASR:

3.2.2 Real-time factor

The real-time factor (RTF) is the most commonly used metric for calculating the speed of a proposed model. The RTF can be computed by using the following formula:

$$ RTF=\frac{P}{I} $$

where P is the time taken by the system to process the input and I is the duration of the input audio. If RTF equals 1, then the input audio was processed in “Real-Time”. RTF is a highly hardware-dependent value and it is not only limited to calculating the speed of a speech recognition model. It can be used to calculate the speed of any model that can process an audio or video input.

3.2.3 Accuracy

The following methods can be used to measure the accuracy of an ASR:

Word error rate

The accuracy of an ASR is hard to calculate as the output produced by the ASR may not have the same length as the ground truth. Word error rate (WER) is the commonly used metric to estimate the performance of an ASR, as it calculates error on word level rather than phoneme level [124]. The WER can be calculated using the following formula:

$$ WER=\frac{S+D+I}{N} $$

Where S is the number of substitutions performed in the output text as compared to the ground truth. D is the number of deletions performed, and I is the number of insertions performed. N is the total number of words in the ground truth.

Word recognition rate

Word Recognition Rate (WRR) is a variation of WER that can also be used to evaluate the performance of an ASR. It can be calculated using the following formula:

$$ {\displaystyle \begin{array}{c} WRR=1- WER\\ {}\kern5em =\frac{N-S-D-I}{N}\\ {}\kern1.5em =\frac{H-I}{N}\end{array}} $$

Where H = N - (S + D) represents the total number of correctly guessed words.

3.3 Datasets

A dataset is essential for the training and testing of an ASR. This section discusses in detail some of the commonly used open-source as well as paid datasets. Table 3 provides a list of the available speech datasets; and their salient features such as total time and spoken languages.

Table 3 List of speech datasets that can be used for training an ASR

3.3.1 LibriSpeech

LibriSpeech [116] is one of the most frequently used open-source speech-to-text corpus. This dataset consists of 1000 h of audiobooks along with their transcriptions. Because of the large magnitude of the collected data, it was divided into three sets. The first set is comprised of 100 h of training data, the second contains 360 h of training data, and the last set has 500 h of training data. The development set and the testing set have 10.8 and 10.1 h’ worth of data, respectively.

3.3.2 2000 HUB5 English evaluation transcripts

2000 HUB5 English evaluation transcripts is the dataset used in deep speech model [50]. It consists of 2000 h of conversational audio and their corresponding transcriptions. This dataset consists of forty source files, all with their corresponding text. Twenty of these files were scripted; a robot operator announces the topic of conversation before the conversation starts. The rest of the twenty files consist of unscripted conversations between Native English Speakers.

3.3.3 TIMIT acoustic-phonetic continuous speech Corpus

Another commonly used dataset for speech recognition is the TIMIT acoustic-phonetic continuous speech corpus [45]. This dataset consists of the recordings of 6300 phonetically rich sentences, read by 630 speakers, where 30% of them are female, and the rest are male speakers. The training set consists of 3.14 h of recording; the rest is divided into the test and development set respectively.

3.3.4 CHiME-5

The CHiME-5 [8] is another dataset that can be used for training an ASR. The main idea behind this dataset was to aid in the creation of a genuinely robust speech recognition system. This dataset contains 50.12 h of recorded conversations in real home environments. The training set of the dataset consists of 40.33 h of data with almost 80,000 utterances. The development set has 4.27 h’ worth of data with a little over 7000 utterances. Lastly, the testing set 5.12 h of data with 11,000 utterances.

3.3.5 TED-LIUM Corpus

The TED-LIUM Corpus [131] is an open-source speech dataset containing 452 h of ted talks and their corresponding transcriptions.

3.3.6 Common voice

Common Voice is a great project started by Mozilla, to gather speech data. It is an open-source project, where people can donate their voices, to read out a given sentence, or their time, that will be required to validate whether a particular audio file matches its corresponding transcription. They have gathered 2400 h of data of different languages, out of which 1900 h of data is validated. Currently, they can provide speech datasets of English, German, French, Welsh, Turkish, and 13 other languages.

3.3.7 The spoken Wikipedia

This free dataset contains 1005 h’ worth of audio files of three different languages English, German, and Dutch. The English dataset is the largest consisting of 1339 pages of Wikipedia spoken by 465 speakers. This portion of the dataset consists of 395 h of audio files. The German dataset consists of 386 h of audio files covering the content of 1014 pages spoken by 350 speakers. The Dutch dataset is quite small as compared to the other two languages; it consists of only 224 h of data even though it covers the most number of pages of 3171 spoken by 145 speakers.

3.3.8 CSTR VCTK Corpus

This dataset consists of 400 sentences spoken by 109 distinct speakers. All of the speakers are native English speakers with varying ages, gender, and accents. This dataset contains almost 9 h of audio data.

3.3.9 AISHELL-1

This open-source dataset offers 170 h of Mandarin speech data. The dataset consists of 400 unique speakers of all genders and ages. To make the dataset more robust speech on different subjects such as Finance, Science and Technology, Entertainment and Sports were used.

3.4 The architecture of an ASR

The function of an ASR is to take input of a sound wave and convert the spoken speech into text form; the input could be either taken directly using a microphone or as an audio file. This problem can be explained in the following way: for a given sequence input sequence X, where X = X1, X2,…., Xn, where n is the length of the input sequence, the function of an ASR is to find a corresponding output sequence Y, where Y = Y1, Y2,…., Ym, where m is the length of the output sequence. And the output sequence Y has the highest posterior probability P(Y|X), where P(Y|X) can be calculated using the given formula:

$$ {\displaystyle \begin{array}{c}W= argmax\ P\left(W/X\right)\\ {}\kern4.25em = argmax\ \frac{P(W)P\left(X/W\right)}{P(X)}\end{array}} $$

where P(W) is the probability of the occurrence of the word, P(X) is the probability that X is present in the signal, and P(X|W) is the probability of the acoustic signal W occurring in correspondence to the word X.

An ASR can generally be divided into 4 modules: a pre-processing module, a feature extraction module, a classification model, and a language model, as shown in Fig. 2. Usually the input given to an ASR is captured using a microphone. This implies that noise may also be carried alongside the audio. The goal of preprocessing the audio is to reduce the signal-to-noise ratio [176]. There are different filters and methods that can be applied to a sound signal to reduce the associated noise. Framing, normalization, end-point detection and pre-emphasis are some of the frequently used methods to reduce noise in a signal [105, 114, 135]. Pre-processing methods also vary based on the algorithm being used for feature extraction. Certain feature extraction algorithms require a specific type of pre-processing method to be applied to its input signal.

Fig. 2
figure 2

Basic structure of an ASR

After pre-processing, the clean speech signal is then passed through the feature extraction module. The performance and efficiency of the classification module are highly dependent upon the extracted features [3, 78, 178]. There are different methods of extracting features from speech signals. Features are usually the predefined number of coefficients or values that are obtained by applying various methods on the input speech signal. The feature extraction module should be robust to different factors, such as noise and echo effect. Most commonly used feature extraction methods are Mel-frequency cepstral coefficients (MFCCs), linear predictive coding (LPC), and discrete wavelet transform (DWT) [40, 78, 112, 127].

The third and final module is the classification model; this model is used to predict the text corresponding to the input speech signal. The classification models take input of the features extracted from the previous stage to predict the text. Like the feature extraction module, there are different types of approaches that can be applied to perform the task of speech recognition. The first type of approach uses joint probability distribution formed using the training dataset, and that joint probability distribution is used to predict the future output. This approach is called a generative approach; HMM and Gaussian mixture models (GMM) are the most commonly used models based on this approach. The second approach calculates a parametric model using a training set of input vectors and their corresponding output vectors. This approach is called the discriminative approach; Support Vector Machines (SVM) and ANN are its most common examples [11, 87]. Hybrid approaches can also be used for classification purposes; one example of such a hybrid model is that of a HMM and ANN [151].

The language model is the last module of the ASR; it consists of various types of rules and semantics of a language. Language models are necessary for recognizing the phoneme predicted by the classifier; and is also used to form trigrams, words or sentences using all of the predicted phonemes of a given input. Most modern ASRs are designed to work without Language Models as well. Such ASRs can predict words and sentences spoken in the given input, but their efficiency can be increased significantly by using a language model [18].

3.4.1 Types of ASR

As shown in Fig. 3, an ASR can be classified on the basis of speaker models, vocabulary being used, channel variability, and speaking style, which can be further classified into two types, utterance speed, and utterance approach.

Fig. 3
figure 3

Types of ASR

  • Speaker Mode

The purpose of creating an ASR is that it can transliterate any language for any speaker. Languages differ in terms of phonetics, character set, and grammar rules; speakers vary in terms of voice pitch, accent, and personality. Every speaker has a unique voice and speaking style; on this basis, an ASR can be classified into the following three types:

4 Speaker-independent models

Speaker-independent ASRs are developed to recognize multiple speakers. Such systems are not trained for a particular user and are one of the most complex types of systems to design. These systems might offer less accuracy than other methods but are more flexible and can have wide usage in the real world.

5 Speaker-dependent models

Speaker-dependent ASRs are developed to recognize a single user or multiple pre-trained users. Such systems are easily trained and also offer better accuracy than speaker-independent ASRs. But they will not be able to produce the same level of accurate results for voices outside of the user pool that they were trained on.

6 Speaker adaptive models

Speaker adaptive ASRs lie somewhat in between speaker-independent and speaker-dependent ASRs. These systems are trained in such a way that they can learn new speech patterns whenever a new speaker presents itself.

  • Vocabulary Size

The vocabulary of an ASR matters a lot as it can affect the complexity, processing time, and the accuracy of the system. The larger the size of the vocabulary, the more complex the system will be; more time will also be required to train the system. The accuracy of the system will also reduce because of the more similar sounding words in the vocabulary. Some ASRs might require a vocabulary of tens of words, for example, a number speech recognition system or a character recognition system. While for others even tens of thousands of words may not be enough; for example, for an ASR that recognizes the English language will require a larger vocabulary than a number recognizing ASR.

7 Small

A small vocabulary can consist of tens of words.

8 Medium

A vocabulary containing hundreds of words is considered to be a medium-sized vocabulary.

9 Large

A large vocabulary can consist of thousands of words.

10 Very large

A very large vocabulary usually has tens of thousands of words.

11 Out of vocabulary

All the words that are not part of vocabulary are mapped as unknown words.

  • Speaking Style

In terms of speech recognition, an utterance is a spoken word. A single word, few words, a single sentence, and few sentences can be considered as an utterance as well. Based on utterances type, multiple approaches can be used to develop an ASR.

12 Utterance approach

An utterance is divided into two types: isolated and connected words.

  1. a

    Isolated Words

A system that is based on the isolated word type of utterance requires its users to take a well-defined pause between each spoken word. This does not necessarily mean that the system will only take one-word input at a time and produce one-word output. Such systems can take multiple words as input but will only process one of them at a time.

  1. b

    Connected Words

Connected words, on the other hand, consists of a system that works with connected utterances and will take a nominal or no pause between two or more words. Such systems can take an input of multiple words at a time and process them as a whole rather than individually.

13 Utterance style

Since most people have their speaking style, utterances can also be divided into two types on this basis. These two types are continuous and spontaneous speech [82].

  1. a

    Continuous Speech

In continuous speech utterances, the users of the system are allowed to speak almost naturally. These types of utterances do not require a pause between words. The input given to the system is considered as a whole and is not divided into individual words based on pauses.

  1. b

    Spontaneous Speech

Spontaneous speech utterances are completely natural. Such utterances may include bogus starts, coughing, laughter, and words like “um” and “ah”, etc. These systems are very difficult to develop as the system will require a very large vocabulary. It will also need to be able to differentiate between valid words and other sounds.

  • Channel Variability

Another way of classifying ASRs is based on the quality of the input channel. Some ASRs require input signals that are recorded in a clean environment i.e. without any background noise. Noise is unnecessary or unwanted information in the input speech signal. It can be anything from the chirping of birds in the background to distortion from the sound not being recorded correctly. Sometimes the input sound wave also gets distorted when we change its channel by using different software.

Besides noise, the difference in ages, gender, accent, environment, and speaking speed are also considered as variations in the input signal. An ASR should be able to cope with all of the different types of background noises or variations in the input speech signal [40, 71].

14 Feature extraction

The process of feature extraction is applied to remove irrelevant information from the signal. A good feature extraction algorithm should be able to extract the features in real-time and should contain maximum information. Feature extraction algorithms can also be classified based on speech features: temporal and spectral features. The temporal analysis techniques analyze the audio signal in its original form, the time domain. In spectral analysis, as the name implies, the spectral representation of the speech signal is used, the frequency domain. Some of the methods used for feature extraction are the MFCC, PLP, DWT, relative spectral-perceptual linear prediction (RASTA-PLP), and LPC.

14.1 Spectral feature analysis

14.1.1 Mel-frequency Cepstral coefficients

MFCC [37, 114] is one of the most powerful and most commonly used technique for feature extraction [22, 27, 76, 98, 111, 156].

A human ear does not perceive the voice or pitch of a sound linearly. Since many of the applications do not work well with the change in frequency, a scale was introduced in the 1940s, called the Mel-scale. The Mel-scale was developed when researchers were experimenting with how a human ear perceives pitch. It linearized the human auditory system to a linear scale [156]. The experimentations that were performed to develop this scale concluded that only the frequencies between 0 to 1000 Hz could be linearized to the Mel-scale. The values that do not fall in this range were considered to be logarithmic [147]. The following formula can be used to linearize a frequency to Mel-scale:

$$ {F}_{mel}=\frac{1000}{\log (2)}\cdotp \left[1+\frac{F_{Hz}}{1000}\right] $$

Here Fmel is the resultant linearized frequency, and FHz is the original frequency of the function.

As we know, a continuous audio function has different values at different points of time. To simplify processing, the audio signal is divided into small frames of either 25 ms [22, 111, 156] [35, 137, 155] or 30 ms [54, 147], where 10 ms of continuous frames overlap. Once the audio signal is divided into frames, each frame is multiplied with the hamming window function, and discrete Fourier transform (DFT) is applied to the result [105]. Sometimes fast Fourier transform (FFT) is also applied to reduce the processing time of the overall process [114]. The results of the Fourier transform is then used to calculate filter bank and then the filter bank is used to calculate log energy outputs using the given below formula:

$$ {X}_i={\mathit{\log}}_{10}\left(\sum \limits_{k=0}^{N-1}\left|X(k)\right|\times {H}_i(k)\right),\kern0.5em for\ i=1,\dots, M $$

Where Hi(k) is the filter bank, X(k) is the k-th window of source signal X, M is the length of the Fourier Transform, and Xi is the log energy outputs. In the end, Discrete Cosine Transform (DCT) is applied on the log energy outputs using the formula given below:

$$ {C}_j=\sum \limits_{i=1}^M{X}_i\cos \left(j\times \left(i-\frac{1}{2}\right)\times \frac{\pi }{M}\right),\kern0.5em for\ j=0,\dots ..,J-1 $$

Where Cj is the mel-frequency cepstral coefficients, j is the serial index, and J is the total number of MFCC features. DCT allows most of the energy to be preserved while achieving dimensionality reduction by discarding coefficients with high values but low energy [105, 147]. A block diagram that summarizes the process of MFCC is illustrated in Fig. 4.

Fig. 4
figure 4

Block diagram of MFCC process

Though the frames of an input sound are divided into either frame of 25 ms or 30 ms, the influence of one phoneme can extend over more than one frame. Thus, the timing correlation between multiple frames should also be considered for more accurate results. It can be taken under consideration by using the delta and delta-delta features of MFCC; the delta MFCC has the addition of the dynamic features; whereas the delta-delta MFCC includes the acceleration features. So, the feature vector obtained from the MFCC algorithm contains three types of features. The first type is the static features, the second is the difference between static features of successive frames or delta features, and the third is the difference between successive dynamic features or delta-delta features. An MFCC feature vector usually consists of thirty-nine dimensions; thirteen for each type of feature; static, dynamic (delta), and acceleration (delta-delta). Another variation of the MFCC feature vector contains the normalized log energy as well; this feature vector also has thirty-nine dimensions, the static feature vector has twelve dimensions in this type instead of the usual thirteen [30, 35, 98, 156].

MFCC may be the most commonly used feature extraction method, but it’s not without its limitations. One of the negative features of this algorithm is that it’s not adaptive to noise. If even one of the frequency bands in the input signal is distorted the results of MFCC will suffer greatly [47, 63, 91, 106, 111]. Another negative feature is the assumption made during the process of framing; that one phoneme can be mapped to the audio of 25 to 30 ms. As we all know, different speaking styles and accents can sometimes drag one phoneme over the space of two or constrict the information of two phonemes into one frame, so this assumption may not yield the best results. Mean and variance normalization (MVN) [63], cepstral mean normalization [91, 111], and histogram equalization [63] are some of the techniques that can be used to make MFCC more robust.

14.1.2 Linear predictive coding

Linear predictive coding (LPC) [37, 114], released in 1984 [113], is one of the most powerful methods of extracting features from a speech signal, and hence has become one of the most commonly used feature extraction algorithm [107, 108, 151, 174]. Unlike MFCC, which resembles the human auditory system, LPC imitates the basic structure of the vocal tract [30]. It can also be easily compared with the basic model of speech production which is also modelled as a linear but time-varying system for both periodic pulses or voiced sounds and random noises [48, 135, 157].

The basic idea behind this algorithm is that the current sample can be represented as a linear combination of all of the previous samples. The LPC analysis can be calculated by first dividing the input audio into frames and then performing the process of windowing on these frames to make sure there are no discontinuities in the beginning or end of any frame. The last step of the process is to calculate the auto-correlation between the frames. And then the LPC analysis is performed on the obtained auto-correlation values, by using Durbin’s Method [48, 108, 113] or by using the formula given below [48, 178]:

$$ s\left[n\right]\approx \sum \limits_{k=1}^pa\left[k\right]s\left[n-k\right] $$

Where s[n] is the current sample point, p is the total number of previous sample points, which are also called predictors [4], and a[k] which is the predictor coefficient.

The main goal of LPC is to calculate the coefficients of a[k] for each frame where E, the total squared prediction error, is minimum. So, once the LPC analysis is performed, the total squared prediction error can be calculated using the formula given below:

$$ E=\sum \limits_n{\left(s\left[n\right]-\sum \limits_{k=1}^pa\left[k\right]s\left[n-k\right]\right)}^2 $$

14.1.3 Linear predictive Cepstral coefficients

After performing an LPC analysis on the given input audio, the following formula is applied to get linear predictive cepstral coefficients (LPCC) [125]:

$$ {\displaystyle \begin{array}{c}\hat{v}\left[n\right]=\ln (p),\kern1.25em for\ n=0\\ {}\hat{v}\left[n\right]=a\left[n\right]+\sum \limits_{k=1}^{n-1}\left(\frac{k}{n}\right)\hat{v}\left[k\right]a\left[n-k\right],\kern0.75em for\ 1\le n\le p\end{array}} $$

Where p is the total number of sample points, \( \hat{v}\left[n\right] \) are cepestral coefficients, and n is the number of samples present in the anaysis frame.

Recent research [21] examined the performance of LPCC as compared to MFCC. The system that was used to study these feature extraction algorithms could identify twelve Hindi words spoken by five different speakers. This system showed that LPCC and MFCC had similar results. Another research [165] showed that LPCC was 10% more efficient and 5.5% faster than MFCC. Fig. 5 sums up the process of LPCC in the form of a block diagram.

Fig. 5
figure 5

Block diagram of LPCC

14.1.4 Perceptual linear prediction

PLP uses transformations that are based on a human auditory system. This algorithm has three main characteristics; the spectral resolution of the critical band, application of intensity-loudness power law, and equal loudness curve reduction. By remapping the frequency axis to the Bark scale, PLP incorporates critical band spectral resolution into its spectrum estimate and produces a critical band spectrum approximation. This approximation integrates the energy in critical bands. As we know, human hearing is more sensitive to the middle-frequency range of audible spectrum at conversational speech levels. PLP incorporates this phenomenon in the algorithm by multiplying the loudness curve with the critical spectrum band. By doing this, the high and low-frequency regions are suppressed between the range of 400 kHz and 1200 kHz, which is the mid-range. A nonlinear relationship exists between the perceived loudness and the intensity of sound. Cube root amplitude compression of the loudness equalized critical band spectrum estimate is used to approximate the power law of hearing [88].

To calculate the coefficients of PLP, windowing is performed on the input signal, and then an FFT is applied on the windowed input signal. The resultant signal is then converted into Bark Scale using the formula given below:

$$ \theta \left({B}_i\right)=\sum \limits_{B=-1.3}^{2.5}{\left|X\left(B-{B}_i\right)\right|}^2\ \psi (B) $$

Where Ѳ is the Bark-scaled frequency, and X is the input signal. The Bark scaled frequency ensures that the critical band frequency selectivity is modelled inside the range of human cochlea [105, 125]. Once the Bark-scaled frequency is calculated, it is weighted according to the equal-loudness curve, and then the intensity-loudness power law is applied to the acquired weighted frequency. Inverse Fourier transform (IFT), linear predictive analysis, and cepstral analysis are performed in order to get the PLP coefficients [105, 125]. Fig. 6 summarizes the steps performed in PLP in the form of a block diagram.

Fig. 6
figure 6

Block diagram of PLP analysis

The research performed in [55] showed an HMM-ANN system that recognized English language phonemes and used PLP as its feature extraction algorithm. The system used TIMIT corpus for training and testing purposes. The accuracy achieved was 64.9%, but when the system was tested on HTIMIT, which consists of speech data collected over different telephone channels, the accuracy dropped to 34.4%. The research performed in [44] discussed the performance of PLP in comparison with MFCC in noisy environments. The research used two different types of noise signals: white and street noise. The system used was a multi-lingual system that could recognize words of six languages: Hungarian, English, French, Italian, Spanish, and German. The results obtained from the research showed that PLP achieved 0.2% more accuracy than MFCC.

14.2 Temporal feature analysis

14.2.1 Relative spectra–perceptual linear prediction

RASTA-PLP analysis specializes in noisy environments by merging RASTA and PLP analysis. It can easily be observed that often training and testing data’s conditions differ, testing data usually contains more real-life factors such as noise, inter-speaker variations, intra-speaker variations, and a difference in the transmission channel. The basis of RASTA [139] analysis is that the temporal properties of the environment, in which the input signal was recorded, varies from the temporal properties of the speech. So, by using a band-pass filter on all frequencies in each sub-band, the short-term noise is smoothed, and the difference in training and testing environments is reduced significantly. The block diagram shown in Fig. 7 explains the steps performed to calculate RAST-PLP features.

Fig. 7
figure 7

Block diagram of the process of RASTA-PLP

Another research [56], compared LPC, MFCC, and RASTA-PLP as feature extraction techniques for a system that recognized digits of Kannada language. The input signals were pre-processed using wavelet transforms; DWT was used for clean signals, whereas, the wavelet packet transform (WPT), was used for noisy signals. For clean speech signals, MFCC had the highest accuracy of 94%, followed by LPC, which had an accuracy of 82%, and RASTA-PLP had an accuracy of 54%. For the noisy signals without pre-processing, RASTA-PLP had the highest accuracy of 73%, followed by MFCC, with an accuracy of 60%, and LPC had the lowest accuracy of 53%. After applying the WPT, the accuracies of all three feature extraction methods were increased, with RASTA-PLP having the highest accuracy of 83%.

Hence we can easily say that, for noisy datasets, RASTA-PLP performs much better than any other feature extraction method, whereas it may not perform as well for clean speech signals. It was also observed in [56] that RASTA-PLP can have an even better performance when combined with WPT.

14.2.2 Discrete wavelet transform

We know that speech signals are not stationary and contain both temporal and frequency information. Even though most algorithms focus only on frequency information, temporal information is equally important [5, 121, 137]. DWT takes into consideration the temporal information present in the input audio signal by re-scaling, shifting, and then analyzing the mother wavelet to obtain the temporal information present in the input signal. Because of this, the input signal is not only analyzed on different frequency levels but with different resolutions as well [5, 121].

So, DWT is based on multi-resolution analysis, according to which lower frequency components appear for a much longer duration than the higher frequency components in a speech signal. Because of this reason, instead of using the same size window, different sizes of windows are used for lower and higher frequency components. For a higher frequency component, a narrow window is used, and a wider one is used for lower frequency components [121]. DWT was created to replicate the working of a human auditory system, where decreasing frequency resolution is used to analyze the increasing frequencies present in a signal [135].

The DWT analysis divides the input speech signal into two types of coefficients: detail and approximation coefficients. The detail coefficients represent the low-scale high-frequency components of the input signal, and approximation coefficients represent the highscale low-frequency components [78, 127]. DWT can be performed using a formula proposed by Stephane G. Mallat [96] this is a fast pyramidal algorithm that uses multi-rate filter-banks, called Mallattree decomposition. The algorithm decomposes the signal into detail and approximation coefficients, as shown in Fig. 8.

Fig. 8
figure 8

Decomposition of speech signal into high frequency and low-frequency components

The input speech signal is passed through a high-pass and a low-pass filter to get the detail and approximation coefficients, and then the results obtained from filters are down-sampled by two, the results obtained after down-sampling are the required coefficients. The process of applying the filters and down-sampling can be mathematically expressed in the form of the formulas given below [5]:

$$ {\displaystyle \begin{array}{c}{y}_{low}\left[k\right]=\sum \limits_nx\left[n\right]\times h\left[2k-n\right]\\ {}{y}_{high}\left[k\right]=\sum \limits_nx\left[n\right]\times g\left[2k-n\right]\end{array}} $$

Where x[n] is the input signal, h[n] is the low-pass filter, and g[n] is the high-pass filter. The approximation coefficients can be further divided by using the same steps repeatedly.

Speech signal often lies in the lower frequency components of a signal, even if the higher frequency components are removed from a signal, the speech present in the signal will still be understandable even though the overall sound of the signal will be different. The research done in [43] shows that instead of using the detail coefficients, using approximation coefficients to generate octave achieves better accuracy.

DWT coefficients are obtained by concatenating the approximation and detail coefficients starting from the last decomposition level. The total number of decomposition levels is chosen based on the frame size. The frame sizes between 3 and 6 octaves are commonly used. The filters used for computing DWT should be a quadrature mirror filter (QMF), which can be calculated using the formula given below:

$$ g\left[L-1-n\right]={\left(-1\right)}^n\times h\left[n\right] $$

where L is the length of the filter. The QMF relationship will ensure that the original input can be perfectly reconstructed from the decomposed signal.

DWT is very robust to noise as it works with localized time and frequency information. Hence, if one of the frequency bands of the input signal is altered by the noise, it will not affect all of the coefficients produced by this algorithm. Due to this reason, many of the researches related to ASRs used DWT as their feature extraction method [43, 64, 100, 151, 167].

14.2.3 Wavelet packet transform

WPT is very similar to DWT. The only difference is that the detail and approximation coefficients are more decomposed in WPT as compared to DWT [78]. The research done in [105] compared the performance of DFT based algorithms with algorithms based on DWPT for the task of speech recognition. One of the DFT based algorithm under consideration was MFCC. This research showed that DWPT based methods performed better as compared to DFT based algorithms. When compared against MFCC, a reduction of 20% in word error rate was achieved with a DWPT based method. Another research [105], compared the performances of WPT against DWT. The system being used was an ASR that could identify the Malayalam language. Here DWT outperformed WPT, as DWT achieved an accuracy of 89%, as compared to, WPT which could only achieve 61%.

14.3 Summary

From the above discussion, it was easy to conclude that in the past, feature extraction techniques that focused on spectral analysis preferred over techniques that used temporal analysis. However, over the past few years, it became obvious that spectral analysis alone was not enough to gather maximum information from the input speech signal. Hence, the wavelet techniques, which used temporal analysis, were used in some researches instead of MFCC and LPC. DWT achieved better results for the task of phoneme recognition than the more commonly used MFCC.

Storage space is a factor that should be taken into consideration when discussing feature extraction techniques. DWT is preferred if there is limited space available, as its feature vector is much smaller in size. There are other feature extraction techniques, such as Principal Component Analysis (PCA), Vector Quantization (VQ) and Linear Descriptive Analysis (LDA), that can also be used in combination with other methods, such as MFCC, to reduce the dimensions of their feature vectors. Different researches used VQ with MFCC [153] and DWT [127] to utilize its clustering property to improve the performance of their ASR. Whereas the PCA and LDA were used to reduce the dimensionality of the feature vector, all the while making the system more robust [38, 60, 163]. Another point to be considered when selecting a feature extraction technique is the type of environment the ASR will be deployed in. In clean environments, MFCC, PLP, and LPC achieved good accuracies; whereas, for noisy environments, DWT, LPCC, and WPT showed better results. One way to make an ASR more robust is to combine the MFCC, PLP, and LPC with either DWT or WPT. Another way is to use RASTA-PLP, which performs best in a noisy environment but not in clean environments.

Table 4 summarizes the advantages and disadvantages of all of the above-mentioned techniques.

Table 4 Advantages and disadvantages of the discussed feature extraction methods

15 Classification

After features are extracted, they are passed as input to a classifier. This is one of the most important and time-consuming modules, as a classifier predicts the phoneme or word that is spoken in the input signal. The job of a classifier is to learn the relationship between the given input audio features, and their corresponding text or phonemes. They are first trained using the training data, which should be big enough for a classifier to recognize the specific patterns present in the speech signal and their correspondence to the output phonemes. Many types of research have been conducted to find which classifier is best suited for speech recognition. The most commonly used classifying techniques for speech recognition are HMM, ANN, and SVM.

15.1 Hidden Markov model

HMM has been one of the most successful classifiers in terms of speech recognition. Due to this reason, it is also one of the most commonly used technique [26, 83, 114, 118, 137]. It is very flexible and can easily adapt according to the required structure. Hence, making it very easy to train and implement, with efficiency [13, 68, 114, 137].

HMM is a stochastic model, and the number of states established during the process of training is fixed and pre-defined. These states may vary from the number of hidden states in the input speech signal. HMM assume that the given speech signal can be characterized as a parametric random process, and thus its parameters can be determined in a well-defined and precise manner. This algorithm is an extension of the Markov chain, which can produce output symbols regardless of the state they are in [13, 110]. Resultantly, the output of HMM is a probabilistic function of the state, and for the input sequence, the state sequence is not observable, hence, the use of the word hidden in the name of the algorithm. An example of HMM is shown in Fig. 9. This example was taken under consideration since most ASRs use left-to-right HMMs to properly model the temporal features present in the input speech signal.

Fig. 9
figure 9

An example of left-to-right HMM with three states

Mathematically, HMM can be defined as λ(S, M, A, B, π), where S = S1, S2, ….., Sn, and is the set containing all possible states. M is the total number of unique output symbols per state. A: aij is the probability of state transition, where aij is the probability of transitioning from state Si to Sj, it can be calculated using the following formula:

$$ {a}_{ij}=P\left({T}_{t+1}={S}_j\ \right|\ {T}_t={S}_i\Big) $$

B: bj(k) is the probability of an output symbol and can be calculated using the formula given below:

$$ {b}_j(k)=P\left({v}_k\ at\ t\ \right|\ {T}_t={S}_t\Big) $$

π is the set of initial state probabilities, and it contains the probabilities of every state Si as a start state, and V = {v1, v2, …., vm} is the set of all possible output symbols. For an input set of observations O = o1, o2, ….., oT and an HMM model λ = (A, B, π), we can use the following formula to calculate the probability of a single observation [13, 123]:

$$ {P}_r\left(O\ \right|\ \pi, A,B\Big)=\sum \limits_q{\pi}_{q_t}\ \prod \limits_{t=1}^T{a}_{q_{t-1}}\ {b}_{q_t}\ \left({O}_t\right) $$

A combination of wavelet transform and HMM was introduced in [68]. HMM, and wavelet transforms were used together to boost the performance of wavelet-based algorithms. This hybrid model was called the Hidden Markov Tree (HMT) model. Even though the wavelet transformation algorithms produced great results for speech recognition, their performance could improve if dependencies between their coefficients could also be calculated, as each wavelet was treated independently. With the HMT model, Markov structures were created between the wavelet coefficients to model the dependencies. These structures were not applied directly to the wavelets but were applied in between the wavelet coefficient states. The resultant binary tree had wavelets connected vertically across the scale. The performance comparison between HMT and some wavelet transformations was done by applying them both to a simple classification problem. As predicted, the HMT showed better results than wavelet-based algorithms. As mentioned in Table 4, wavelet based algorithms are very robust. De-noising of different noisy speech signals was also performed to compare the performance of HMT. Again, HMT showed better results than wavelet-based algorithms. [1] presented an enhanced version of HMT that can be used for feature extraction.

CDHMM [29, 98, 99] is the most recently developed approach using HMM. This technique uses a maximum likelihood (ML) algorithm for training and recognition of HMM. Using this technique, variations occurring within and between phonemes can be calculated [98]. CDHMM can be further improved by using the large margin classifiers in the training process. When compared with conventional Machine Language techniques, this technique had reduced error rates [19, 29, 70].

15.2 Artificial neural networks

ANN are great classifiers, and they produce the best results for pattern recognition problems. They are used for their capability to learn and organize according to the dataset provided at the training stage. They work exceptionally well with unknown data and can classify unknown data effectively. The drawback of using an ANN is that they tend to over train and face the local minima problem. They also ignore the time variability present in the speech signal; this problem can be solved by using Hybrid HMM-ANN models. The hybrid model is used to get the advantages of both the models [137].

Some of the widely used ANN are discussed below.

15.2.1 Multilayer Perceptrons

Multilayer perceptrons (MLP) have proven to be the most efficient, successful, and commonly used type of ANN [137, 156]. An MLP is a simple feed-forward neural network containing at least three layers: input, output, and hidden. Fig. 10 shows the basic structure of an MLP.

Fig. 10
figure 10

An example of a simple MLP

This algorithm is applied during the training phase; it is based on the backpropagation approach and the concepts of lateral inhibition. The generated output is based on the output neuron with the highest activation. One of the major drawbacks of this model is that they can only take input of fixed length, which makes them unable to handle the dynamicity of the input speech signal. Another problem is that this algorithm can only deal with small vocabularies efficiently, which makes them a good phoneme recognizer but not an efficient word recognizer [67].

The work proposed in [141] used MLP to recognize digits of the Urdu language. The dataset used for the training purposes composed of speech signals of a single user, recorded in a clean environment. FFT and MFCC were used to extract the features from the speech signal. An accuracy of 94% was achieved in the testing phase. Another research [145] used MLP to recognize Persian digits. An accuracy of 98% was achieved by first using MFCC to perform denoising on the dataset, and DWT was used to extract the features. The dataset used for training purposes consisted of the data of a single male speaker. [102] used a deep MLP network to perform speech emotion recognition. The research used the speech data present in the IEMOCAP database [146]. The network was composed of an input layer, five hidden layers, and three output layers, one layer for each metric. The model achieved mean scores of 0.453 and 0.469 when testing with speaker-independent and speaker-dependent data, respectively.

Sparse multilayer perceptrons (SMLP) [6, 67] is a technique that is based on the concept of MLP. SMLP is almost identical to MLP in the structure; the only difference is that one of the hidden layers of SMLP must produce a sparse matrix as output.

15.2.2 Self-Organising maps

Self-organizing maps (SOM) were introduced in 1982 by Teuvo Kalevi Kohonen [6]. The main idea behind SOM is that input signals are placed in such a way that they can produce a contour map from a higher dimension input space to a lower-dimensional feature space. So, the input signal is first placed randomly in the input feature space, which is then organized into different clusters. Each of the formed clusters represents a unique feature of the input signals. Because of this, SOM can easily differentiate between different features present in the input signal [16, 21, 147].

The SOM can differentiate between the signals without supervision and therefore have no example of potential output. Hence, for the SOM network to be trained satisfactorily, we need a significant number of training samples. This algorithm can be performed by applying the following three steps. The first step is to calculate the level of similarity between the pattern present in the input signal and the neurons present in the output layer. The required similarity can be calculated with the help of the predefined formula of Euclidean distance. After that, the synaptic weights are determined, using the formula given below:

$$ {w}_j\left(n+1\right)={w}_j(n)+\alpha (n){h}_{j,i(x)}(n)\left(x(n)-{w}_j(n)\right) $$

Where x is the input function, wj(n) is the weights of neuron j at the time n, α(n) is the learning rate and hj,i(x) is the neighbourhood function.

The research performed in [75] used SOM in combination with DWT to perform the task of vowel recognition. This system was named the wavelet self-organizing maps (WSOM), which used SOM to model the input speech signals, and the resultant SOM mapping was used to adapt the wavelets. The WSOM obtained an accuracy of 55%. Another research [31] used SOM to convert variable length feature vectors into fixed-length feature vectors. This technique ensured that the MLP classification model used in the system will always have fixed-length feature vectors even though the length of the input signal can be variable.

The research performed in [21] used SOM to identify twelve different Hindi words spoken by five speakers. The SOM used in the research consisted of an input layer, a competitive layer, and an output layer; it is a modified version of a basic SOM called supervised SOM. In this research, four different types of features were extracted from the input signal, and these features included the intensity, and the pitch of the signal, MFCC, and LPC. The accuracy of every speaker was analyzed independent of the other speakers. The highest accuracy was achieved by the intensity features, whose mean-SOM and median-SOM, accuracy was 98.17% and 98.54%, respectively. The other feature extraction techniques achieved approximately 89% accuracy.

15.2.3 Radial basis functions

Radial basis functions (RBF) have the basic ANN structure, i.e., an input layer, an output layer, and a hidden layer. The main difference between RBF and other ANN structures is that Gaussian function is used in the hidden layer. The main task of the RBF model is to generate clusters on the basis of patterns present in the input speech signal. The Gaussian function is then used to form a relationship between all of the created clusters. This relationship is formed by applying the Gaussian function in the centers of these clusters. Hence, the output of this model can be calculated using the formula given below:

$$ \mathrm{y}=\sum \limits_{h=1}^{H-1}{w}_h{\Phi}_h(x) $$

where H is the total number of hidden layers, wh are the linear weights, x is the input signal, and Фh is the Gaussian function, which can be calculated using the following formula:

$$ {\varPhi}_h={e}^{\left(\left\Vert x-{c}_h\right\Vert /2{\sigma}_h^2\right)} $$

Where ch is the centre of the Gaussian function and σh is the width of the Gaussian function.

The research done in [17] compared the performance of MLP and RBF. The features were extracted using LPCC, and the system that was used to compare the two classifiers could identify six words of the English language, which are spoken by six speakers. MLP achieved 96% accuracy while RBF achieved an accuracy of 98.69%. The training and testing speed of RBF was also faster than MLP. The research performed in [117] combined RBF and HMM. The main task of this system was to recognize words spoken in a continuous speech environment. Cepstrum analysis was performed on the input speech signal to extract features from it. The RBF-HMM hybrid approach created a new HMM for every word in the training data and associated a target value to each of these HMM. The target value was then used to calculate the best possible number of neurons for the hidden layer of the network. This system achieved an accuracy of 80% for recognizing ten words and with a total number of eight neurons in the hidden layer.

[166] researched the combination of wavelet transformation and RBF to create a robust ASR. The wavelet transform and RBF were combined in such a way that the activation function of RBF was replaced with a wavelet transformation. The accuracy of the system was tested over sixteen speakers speaking different numbers of words, in different environments. The wavelet-RBF hybrid model achieved better results as compared to a simple RBF network. But it was observed that as the number of words in the vocabulary increased, the accuracy of the hybrid model decreased to the point where it was equivalent to the simple RBF network. Hence, it was assumed that for large vocabularies simple RBF network is better than a wavelet-RBF model.

[159] proposed a model using temporal RBF features to recognize Arabic letters. The model is divided into three modules: preprocessing, feature extraction, and classification. The preprocessing module removes salience from the input signal and then performs normalization, pre-emphasis, framing, and windowing on the signal. Once the signal is pre-processed, its different statistical features were calculated. The calculated features are then used as input for the RBF-based classification model. The research achieved a recognition rate of 98.175%.

15.2.4 Recurrent neural network

RNN [58] model doesn’t require any phonetic dictionaries or extra human effort to transcribe the input audio if it’s trained properly. For a given input sequence x = (x1,…..,xT), a RNN will calculate two things, the output vector y = (y1,…,yT) and the vector used to store the values of its hidden states h = (h1,…..,hT). An RNN uses the formula given below to find the values of the output and hidden vector:

$$ {\displaystyle \begin{array}{c}{h}_t=\mathcal{H}\left({W}_{ih}{x}_t+{W}_{hh}{h}_{t-1}+{b}_h\right)\\ {}{y}_t={W}_{ho}{h}_t+{b}_o\end{array}} $$

Where ℋ is the activation function, Wih denotes the weight matrix used between the input and the hidden states units, Who represents the weight matrix used between the hidden and output units, and ht-1 represents the previous state’s values.

Since most RNNs use LSTM cells, the following formulas can be used to mathematically describe it.

$$ {\displaystyle \begin{array}{c}{i}_t=\sigma \left({W}_{xi}{x}_t+{W}_{hi}{h}_{t-1}+{W}_{ci}{c}_{t-1}+{b}_i\right)\\ {}{f}_t=\sigma \left({W}_{xf}{x}_t+{W}_{hf}{h}_{t-1}+{W}_{cf}{c}_{t-1}+{b}_f\right)\\ {}\begin{array}{c}{c}_t={f}_t{c}_{t-1}+{i}_t\tanh \left({W}_{xc}{x}_t+{W}_{hc}{h}_{t-1}+{b}_c\right)\\ {}\begin{array}{c}{o}_t=\sigma \left({W}_{xo}{x}_t+{W}_{ho}{h}_{t-1}+{W}_{co}{c}_t+{b}_o\right)\\ {}{h}_t={o}_t\ \tanh \left({c}_t\right)\end{array}\end{array}\end{array}} $$

where it represents the value of the input gate for the current iteration, f represents the forget gate, o represents the output gate, c represents the cell activation function, and σ represents logistic sigmoid.

One major shortcoming of using a simple RNN is that it would only consider the previous context. However, in speech recognition, the future context is equally important as the previous context. So, instead of using a simple RNN, bidirectional RNN can be used to address this shortcoming. As the name describes, a bidirectional RNN processes the input vector in both directions and keep separate hidden state vector for each direction, and the following formulas can be used to describe the processing of a bidirectional RNN:

$$ {\displaystyle \begin{array}{c}{\overrightarrow{h}}_t=\mathcal{H}\left({W}_{x\overrightarrow{h}}{x}_t+{W}_{\overrightarrow{h}\ \overrightarrow{h}}{\overrightarrow{h}}_{t-1}+{b}_{\overrightarrow{h}}\right)\\ {}{\overleftarrow{h}}_t=\mathcal{H}\left({W}_{x\overleftarrow{h}}{x}_t+{W}_{\overleftarrow{h}\ \overleftarrow{h}}{\overleftarrow{h}}_{t+1}+{b}_{\overleftarrow{h}}\right)\\ {}{y}_t={W}_{\overrightarrow{h}\ y}{\overrightarrow{h}}_t+{W}_{\overleftarrow{h}\ y}{\overleftarrow{h}}_t+{b}_o\end{array}} $$

Neural networks, both feed-forward and recurrent, can be only used for frame-wise classification of the input audio. This problem can be addressed by using HMMs to get the alignment between the input audio and its transcribed output. Another method would be to use CTC [58], as the objective function, as it trains the model without knowing the initial alignment between the given input and the transcribed output. To decode the output of a CTC network, there are two methods. One method is to pick the output with the highest probability at the end of every time step. Another way is to use the beam search. If the beam search is used, then a dictionary and a language model can also be integrated with the model to increase its efficiency. Fig. 11 shows an example of a simple RNN.

Fig. 11
figure 11

A simple RNN with two hidden layers

[134] presents an Attention-based Transducer, where the encoder is composed of a pyramid LSTM layer, a simple LSTM layer, and a multi-head self-attention layer. The input signal is fed to the pyramid LSTM layer; the output of this layer is then concatenated with the two previous outputs. The concatenated output is then passed to the LSTM layer and then finally to the multi-head attention layer. The decoder used in the model is also composed of two simple LSTM layers. The data used consisted of 10 K hours of in-house English speech data gathered by the authors using the LAIX learning application. The proposed model achieved a WER of 10.3% and a RTF of 0.19.

The model proposed in [46] used RNN to recognize Bengali speech. The network consisted of three fully-connected layers, followed by a bidirectional RNN layer and then another fully connected layer with softmax as the activation function. The authors used 33 h of data from 508 speakers. The model achieved a WER of 34 with a dropout rate of 0.5, in combination with CTC and a language model. The research proposed in [171] presented an audio-visual speech recognition system. The used an RNN-transducer, the encoder was composed of five bi-directional LSTM layers, the decoder consisted of two layers of uni-directional LSTM, and the joint space was 640 dimensional. A five-layer CNN model, called V2P, was used to extract features from the input video. The dataset used for testing and training purposes consisted of transcribed YouTube videos of 31,000 h. They received a WER% of 21.5 on the audio-only system and 20.5 on the audio-visual system.

[66] compared the performance of commonly used types of RNN, such as GRU and LSTM, with a simple RNN. They used the TED-LIUM Corpus [131] for testing and training. The model used consisted of one input and output layer with five hidden layers, where the fourth layer is bidirectional. LSTM performed the best with both 500-node architecture and 1000-node architecture, having WER% of 77.55, and 65.04 respectively. In terms of time, RNN was the fastest to train and LSTM took the longest.

15.2.5 Convolutional neural network

A convolutional neural network (CNN) is another commonly used type of ANN. Such networks are generally used for computer vision (CV) tasks, but due to their good feature generation, and discrimination capability, they are also widely applied in the field of natural language processing (NLP).

A common CNN architecture is formed of alternative pooling and convolutional layers, with fully connected layers in the end. A convolutional layer is composed of set neurons, where each neuron acts as a kernel. A convolutional kernel divides the input signal into smaller signals, called receptive fields. A kernel than convolves with the input signal by multiplying itself with the corresponding elements of the receptive field [94]. The following mathematical representation can be used to express the convolutional function:

$$ g\left(x,y\right)=i\left(x,y\right)\ast h\left(x,y\right) $$

Where i(x, y) represents the input signal, h(x, y) represents the applied filter, and g(x, y) is the resultant convolved filter. The same filter, with the same set of weights, is used on all of the receptive fields. This particular feature allows CNN to capture most of the features present in a signal without using a large number of weights.

The focus of a convolution layer is to extract as many features as possible from a signal. But once those features are identified, the exact locations of a particular feature don’t matter as long as its approximate location relative to the other features is maintained [94]. The pooling layer performs the job of down-sampling, by retaining only the dominant value in each of the receptive fields, hence, further reducing the size of the input signal. By reducing the signal size, not only the network becomes less complex, but it also reduces the chances of over-fitting and increases generalization.

The research presented in [143] used the combination of CNN and Bidirectional LSTM (BLSTM) to recognize Mandarin speech. The proposed model consisted of four CNN blocks, each with four layers, followed by a layer of BLSTM, and then a fully connected layer. The input signal was batch normalized before being processed by the network. Each of the four CNN blocks consisted of a convolutional layer, followed by a batch normalization layer, then a Rectified Linear Unit (ReLU) activation layer and in the end a max pooling layer. The AISHELL-1 [15] dataset was used for training and testing. The proposed model achieved a WER% of 19.2. [156] used three different types of input, which included MFCC, power spectrum, and raw wave format, were tried with their model. The model proposed in the paper than 12 convolutional layers. The stride of the convolutional model varied with the type of input. Increasing the stride did not affect MFCC whereas it was observed that with the power spectrum and raw waveform the overall stride of the network played a vital role. LibriSpeech [116] was the dataset that was used for training, testing and validating purposes. When used with MFCC the model produced 7.2% WER, with power spectrum as its input the model had 9.4% WER, and lastly with raw waveform it had 10.1% WER.

15.2.6 Fuzzy neural network

Fuzzy neural networks (FNN) is a hybrid technique that incorporates concepts of a fuzzy system in neural networks. Because of the usage of fuzzy systems, a membership function is used to make sure every element is mapped to a proper degree of membership. This membership function proves to be very useful to map speech signals, as they have no clear boundaries [99]. Another advantage of using FNN is that an ANN requires a large amount of data to be effectively trained. But FNN shows better results with even small datasets as they converge during the learning phase [73].

The work proposed in [99] used wavelet transforms, CDHMM, and FNN to recognize fifty words. When compared with a simple CDHMM, the hybrid model was proven to be more successful in a noisy environment; by achieving 15.2% more accuracy. Though, in a clean environment, CDHMM performed better, having a difference of 7.6% in their accuracies.

Adaptive Neuro-Fuzzy Inference System (ANFIS) [73, 170] is a widely used FNN-based speech recognition system, which employs different fuzzy inference techniques to perform classification of data. The work proposed in [73] recognized isolated words of the Persian language. The input dataset was first divided into clusters using SOM and Linear VQ. Once the input was clustered, ANFIS was used to classify the data. The results obtained showed that the ANFIS performed better than a conventional FNN. Another research [170] achieved an accuracy of 85.24% while recognizing Malay digits, using ANFIS.

15.3 Support vector machines

Recently, SVM has been adopted to perform the task of speech recognition. SVM can be implemented independently [52] or as a hybrid model with HMM [133, 152]. SVM constructs a hyperplane as the decision plane, in such a way that the distance between the classes is maximized. The formula given below can be used to calculate the decision surface:

$$ f\left({x}_i\right)={w}^T\times \varPhi \left({x}_i\right)+b $$

where w and b are the weight vectors and bias value, Ф(xi) is the kernel function. The input feature space is mapped to different higher dimensional feature space using different kernel functions. Using the higher dimensional feature space, the assumption is made that the different classes are linearly separable. One of the commonly used kernel functions is the polynomial function [79], which can be calculated using the following formula:

$$ k\left({x}_i,{x}_j\right)={\left({x}_i\cdotp {x}_j+1\right)}^d $$

where d is the degree of the polynomial. Another commonly used function is the Gaussian radial basis function [16], which can be mathematically expressed as:

$$ k\left({x}_i,{x}_j\right)=\exp \left(-\gamma\ {\left\Vert {x}_i-{x}_j\right\Vert}^2\right) $$

Here the value of ϒ can be either >0 or = 1/2σ2.

Even though SVMs are considered to be good classifiers, they are not commonly used for ASR. One of the biggest issues with SVM in the context of ASR is that they cannot take variable input, which is usually the case with ASR. Researches performed in [79, 152] discussed the solutions to this problem. SVM also tend to have a high computational cost when classifying more than two classes. Different methods were proposed to tackle the multi-class problem of SVM in [149] [148, 172, 173], but the most commonly used techniques reduces the multi-class problem into a set of binary class SVM. The following two techniques can be used to resolve the multiclass problem:

  1. a-

    The one-against-all technique

  2. b-

    The one-against-one technique

15.3.1 The one-against-all technique

In this method, a multi-class SVM is divided into multiple binary class SVM based on their number of classes. The number of binary class SVM is equal to the number of classes in multi-class SVM.

All of the binary class SVM create decision planes between the class corresponding to them and all of the other classes. Different voting techniques are then used for choosing the output for a given input [149, 172]. The one-against-all technique creates a relatively small amount of binary SVM, whereas, it requires a large dataset for the training of each binary SVM. A common problem faced by dividing the multi-class SVM into binary class SVM is the unclassifiable region problem. For this technique, the unclassifiable region problem can be solved by either using continuous decision functions or by implementing fuzzy SVM. A fuzzy SVM uses a membership function in such a way that different input points will contribute differently to the learning process of SVM [36, 59, 90]. Both techniques can be used to evaluate the performance of a system and are comparable to each other. However, implementing continuous decision functions are easier and simpler to implement than fuzzy SVM.

The work presented in [169] shows a comparison between SVM and ANN. The SVM method employed followed the one-against-all technique in combination with the RBF kernel. Whereas the ANN method employed was the MLP network. Both networks were trained to recognize 12 isolated vowels of the Thai language. The SVM network not only took less processing time in the training phase but also achieved a higher accuracy. The accuracy of the SVM network was 87.08%, while the accuracy of the MLP network was 82.72%.

15.3.2 The one-against-one technique

Unlike the one-against-all technique, which creates binary SVMs based on the number of classes, the one-against-one technique creates binary SVMs for all possible pairs. By creating all possible pairs of classes, this technique distinguishes each class from all the other classes. Like the one-against-all technique, there exist many voting techniques for the one-against-one technique as well, which can be employed to choose the output class for a given input [149] [172].

The one-against-one method may create a relatively higher number of binary SVMs, but it requires less training data. This technique also has a lower computational cost, as a classifier, can be ignored if its two corresponding classes are rarely required to be distinguished [79, 133]. The system presented in [9] was used for phoneme classification. MFCC was used to extract fixed-length feature vectors from the input audio. The classifier used in this system was a one-against-one classifier in combination with a majority voting technique. This system was tested on the TIMIT dataset and was compared against HMM. The designed SVM system got an accuracy of 77.6%, which was 4% better than the HMM. The accuracy of HMM was 73.7%.

15.4 Summary

One of the most commonly used methods of classification for speech is HMM. The reason behind its ubiquitous usage is its ability to successfully model the temporal information present in the speech signals. Even though improvements were made in the field of speech recognition using HMM, but the results obtained were not optimal. So, modifications were done to the HMM in the form of ANN and SVM. The techniques of ANN and SVM can be employed independently or as a hybrid model with HMM.

In the past, MLP was the most commonly used type of ANN. Though nowadays, RNN and RBF are applied more frequently. As discussed above, SVM has shown better or at least comparable results to HMM. Though SVM are inherently binary classifiers, different modifications techniques such as the one-against-all and one-against-one can allow them to classify multiple classes successfully.

Table 5 presents the advantages and disadvantages of using different classification techniques.

Table 5 Advantages and disadvantages of the discussed classification models

16 Language model

Advancements in the field of speech recognition have increased the need for language models as speech doesn’t necessarily follow rigid grammatical rules. Speaking style of person and their regional and social dialects also affects the input instance. So, a good language model is required to deal with these problems in real-time [40, 41]. A language model consists of a vocabulary set, the search space, and the searching technique. Language models use structural constraints of a language to predict the probabilities of the occurrence of a word, for a specific word sequence. These structural constraints can vary from language to language.

The difference between a classifier and a language model is that a classifier maps speech signals to its closest possible word sequence, whereas, a language model checks the occurrence probability of the word sequence produced by the classifier. A very common example of this is, in American English, the phrases ‘recognize speech’ and ‘wreck a nice beach’ sound almost the same, but their meanings are entirely different. These ambiguities are easier to eliminate if a language model is used in combination with a classification model.

16.1 Types of language models

The language models can be divided into two types: static and dynamic.

16.1.1 Static language models

One of the most commonly used techniques of the static language models is the n-gram model. Generally, bigram or trigram language models are used where a trigram model holds more information [41]. The research presented in [98] shows by using a bigram language model in combination with their system. By adding a language model, they achieved an accuracy that was approximately 3% higher than the original accuracy. Another research [2] shows a reduction in the occurrence of out-of-vocabulary (OOV) words by using the n-Gram model.

A major drawback of using such language models is that they cannot adapt if a different speech of a different domain is given as input.

16.1.2 Dynamic language models

Dynamic language models calculate probabilities based on previously analyzed data. Thus, they can easily adapt to different speech domains. This technique is highly useful when transfer learning is being applied; i.e., a pre-trained model of a specific language or speech domain is being used as a basis to train a model for a different language. Some commonly used language models techniques are long-distance n-grams [2], triggers [34, 61], cache models [130], and tree-based models [129].

Once its chosen which technique of language model is being employed, we also need to choose which decoding search technique needs to be used to find the best result from the specified number of responses.

16.2 Decoding techniques

A decoding technique uses an acoustic model, a language model, and the spoken utterance to find the most likely word sequence. One of the most obvious methods would be to enumerate over all possible outputs to find the most likely. As the number of outputs will grow exponentially with the length of the word, hence this technique can only be employed in tasks with a very small dataset. Various pruning algorithms [41, 80] can be used to remove the low scoring hypothesis, to make searching more efficient.

Viterbi search and n-best search are two of the most commonly used decoding techniques.

16.2.1 Viterbi search

In this approach, all of the hypotheses that are associated with a particular speech utterance are considered and are directly compared with each other. Viterbi search is impractical for even medium-sized projects due to its huge computational cost; hence, Viterbi beam search [7, 65] is mostly used as it considerably reduces the size of the search space. In the Viterbi beam search, only those hypotheses whose likelihood falls under a particular radius are considered.

The research performed in [92] presents an HMM-based speaker-independent ASR. The proposed system used the TIMIT dictionary to generate its word transcription dictionary. The Viterbi algorithm was used to perform sentence decoding on the generated dictionary. Besides the Viterbi algorithm, the word pair technique was also utilized to get a smooth transition between two words. The performance of the system was increased to 92.2% from 60.1% with the usage of both dictionary and word pair techniques.

16.2.2 N-best search

The N-best search [23] is very similar to the Viterbi Search; the main difference between the two algorithms is that where Viterbi Search provides the best hypothesis, the n-best search provides the n-best hypothesis. One major drawback of this algorithm is that the short hypotheses have more chances of being chosen as the long hypotheses are more prone to errors. To overcome this problem, two of the most commonly used methods are the search method [41] and the pruning method [80].

16.3 Summary

From the above discussion, it can be concluded that a language model is required for systems that capture large vocabulary. Nowadays, a lot of research is being performed to optimize language models. And currently, n-gram and n-best search models are the most commonly used methods.

17 Toolkits and online resources

A lot of work has been done on perfecting the task of speech recognition. Some of the work performed over the years is available to us in the form of personal assistant tools such as Cortana and Siri. Even though a lot of research has been performed in this field, but most of the work has not been made available publicly. Table 6 discusses some of the toolkits and online resources that have been made publicly available, as well as a few other commonly used tools.

Table 6 Toolkits and online resources available for speech recognition

17.1 Kaldi

Kaldi [179] is an open-source speech recognition tool. This tool was developed in C++ and can easily be deployed on multiple operating systems. Currently, this toolkit only supports the English language. Kaldi can be used to extract features; it can perform classification tasks as well. The features can be extracted using multiple methods, which include the most commonly used MFCC, and cepstral mean and variance normalization (CMVN) and i-vectors. Deep neural networks (DNN) are used to perform the task of classification.

17.2 CMU Sphinx

CMU Sphinx [24] is another open-source speech recognition tool. This tool was developed in Java and can provide pre-trained models for several languages, including English, French, Mandarin, Russian, and German. CMU Sphinx uses MFCC to extract features and uses an HMM-based model to perform the classification task. It also provides an online tool that can be used to create language models for Sphinx.

17.3 Julius

Julius [122] is an open-source speech recognition tool that was originally designed to recognize the Japanese language. Over the years, with the help of different researches, a usable model for the English language was also developed. Julius itself is a language independent decoding program which can be used to create a recognizer for any language as long as an acoustic model and language model is available for that language.

17.4 Hidden Markov model toolkit

Hidden Markov Model Toolkit (HTK) [81] is a portable toolkit that can be used to manipulate and build hidden Markov models. The original purpose behind creating this toolkit was to perform the task of speech recognition, but it can be used for other pattern recognition problems as well.

17.5 RWTH ASR

RWTH ASR [85] is another toolkit that was developed for speech recognition. This toolkit can be used for both speech recognition and speaker adaptation. It utilizes MFCC and PLP to extract features. The acoustic modeling is performed using GMM. One significant limitation of this toolkit is that it is only available on Linux and macOS.

17.6 Summary

The above discussion determines that the currently available toolkits and online resources rely upon more traditional technologies such as HMM and GMM. A significant shortcoming of these tools is that they can only be employed for particular languages. Therefore, it can be reasoned that more publicly accessible speech recognition tools need to be developed; that can cater to different languages and not be confined to high resource languages only.

18 Concluding summary

Table 7 shows a list of researches done in the field of speech recognition over the past few years.

Table 7 Comparison between different ASRs

It can be viewed that researches are more focused on creating optimal large-vocabulary speaker-independent continuous speech ASR. It can also be observed that despite not being the best feature extraction algorithm, as mentioned before, MFCC is still the more favored choice. [37, 175] compared it with other techniques. [37] compared it with PLP and LPCC, where MFCC performed significantly better than the other algorithms. In [175], however, WPT consistently performed better than MFCC. The key difference between the two pieces of research is the size of their datasets; [175] used the dataset comprising of 1000 different speakers, and [37] used the dataset containing the voices of 11 distinct speakers.

For classification, HMM seems to be the most popular choice, particularly HMM hybrid models. [133] performed for the task of recognizing the English language speech, by achieving an accuracy of 94.10%. Another hybrid model, introduced in [156], performed well by obtaining an accuracy of 77.83%. Whereas, [37, 92, 98] showed average results by using a simple HMM. ANN-based researches [101, 132] also showed promising results. Hence, it can be concluded that HMM alone is not enough to achieve the goal of speech recognition. Hybrid models and ANN can help achieve much better results.

Deep learning models, such as the ones presented in [77, 136], also shows promising results. [77] used a CNN-based model in combination with CTC loss and managed to get a WER% of 8.07 using a 6-g language model. They also used the concept of transfer learning to show how their model could be used as a base model for low-resource speech recognition systems. [136] used a RNN-based encoder-decoder model and performed multiple experiments on it.

In the end, the research that used a relatively larger dataset produced better results by using a language model [98]. Therefore, we can presume that using a language model can prove to be beneficial when dealing with large vocabularies.

19 Conclusion

This survey paper discussed and reviewed different techniques and approaches that are used to perform the task of speech recognition. Based on the discussion on the basic architecture of an ASR, it is concluded that an ASR is dependent on three modules: feature extraction module, classification module, and the language model. Hence, different feature extraction methods, their advantages, and disadvantages, as well as their basic structure, were also highlighted. Similarly, from the analysis of classification models, it is inferred that HMM performed the best. Many recently employed techniques and their results are also reviewed. In the end, the last module of the speech recognition system, the language model, was examined. It is concluded that the addition of a language model; can greatly affect the accuracy of an ASR. Even though only sub-optimal methods are currently being used to create language models, further research in this field will prove to be beneficial for the task of speech recognition.