Keywords

1 Introduction

Speech recognition is the method of transforming spoken utterances into text. It has been researched for decades. Speech recognition can be applied for voice interfaces, voice-driven machines, speech-enabled browsers, and Internet-based services. For speech recognition, the speech signal is first transformed into parametric form, and acoustic models are generated from the extracted features. At the recognition step, parametric observations are attained, which have higher probabilities matching corresponding to input speech. Automatic Speech Recognition (ASR) system recognizes the utterances from spoken words [1]. Speech recognition systems can be characterized into isolated word, connected word, continuous speech, spontaneous speech, and speaker adapted speech. The words are spoken in isolation for isolated word recognition, and words are spoken with some pauses in connected word recognition. Continuous speech recognition deals with the continuously spoken word. Spontaneous speech is a natural way; the humans speak.

Speaker-adaptive systems, the speech recognition system is developed independently, and then this model is used for distinctive characteristics of target speaker [2, 3]. Researchers have experimented with improving recognition with advancements in signal processing, acoustic modeling, language modeling, and search algorithms. State of the art has been shifted from isolated word to speaker adaptation, context-dependent modeling, and discriminative training [4] during the years. Progress in speech recognition was made possible due to better and accurate speech modeling techniques, different features, extraction methods, and efficient search algorithms [5, 6]. Speech recognition systems using Hidden Markov Model (HMMs), Artificial Neural Networks (ANNs), and hybrid of HMMs and ANNs have been experimented [7, 8].

Extracting useful information from speech signal is very significant phase toward speech recognition. It is required to transform the signal into low-dimensional space, which is called feature extraction [9]. Further, extracted features should be resilient to environmental changes. For taking out these features from the speech signal, perception- and production-based methods are widely used. Perception-based methods work on the principle of the human hearing system while production-based methods work on how speech is produced. Different perception-based scales such as Mel and Bark are used for processing due to the nonlinear perception of the human hearing system. Further, the selection of the windowing method and pre-emphasis are also important criteria for pre-processing of the speech [10,11,12]. Various methods were experimented using cepstral mean normalization, energy thresholding, Linear Discriminant Analysis (LDA), and application of filter banks to improve the basic features for speaker independence and speech recognition in different noisy conditions. Other improvements were carried out using the transformation of basic features by performing differentiation and concatenation on these static features. For this, first difference and second difference of static features are widely used [13, 14]. It is also a crucial decision to decide the sub-word unit for speech recognition. Different sub-word models, such as syllable based and phoneme based, are used [15]. Scholars in speech recognition face problems due to variability in speech [16, 17]. Classifiers are used in the speech recognition process during recognition. The classification methods such as HMMs, ANNs, and discriminative training are presented in literature [18,19,20,21,22]. The complexity of the ASR system is increased when it advances from isolated to spontaneous speech and speaker-dependent to the speaker-independent mode. Speaker independent system are more difficult to develop than speaker dependent system [25]. Another challenge is coarticulation effect in phone segments. The phone segments are affected by its neighboring phones [26]. Further, problems that need addressing are lack of speech corpus, pronunciation dictionaries, and transcription files for under-resourced languages. Design of speech corpus is also crucial for the development of a speech recognition system [36, 37].

The purpose of writing this paper is to present a review of different technologies and trends in speech recognition. Readers will be able to know how the speech recognition system works, its challenges, and trends in speech recognition after going through this paper. It is an effort to highlight different aspects of speech recognition. The paper describes speech recognition issues such as feature extraction, classification methods, language modeling, speech corpus, widely used tools for the development of the spoken systems, and speech recognition parameters.

The paper is structured as follows. Section 2 describes continuous speech cognition and the structure of a speech recognition system. Section 3 explains various feature extraction methods. Section 4 describes speech corpus and widely used classification techniques. Section 5 deals with the role of language modeling in speech recognition. Section 6 describes different widely used speech recognition tools. Section 7 illustrates parameters for speech recognition testing. Finally, comparison of different classification and feature extraction techniques is presented.

2 Speech Recognition Structure

The continuously spoken words are converted into text during continuous speech recognition. Speech recognition system can be represented in Fig. 1 at a basic level. Feature extraction block converts speech signal into a suitable parametric feature. These features are used for generating acoustic models of speech utterances. The acoustic model is prepared from speech parameters. Language model block contains all the issues related to language modeling in speech recognition. The output from both the block is fed into a speech recognition engine. Speech recognition engine outputs the recognized word based on inputs from both the blocks.

Fig. 1
figure 1

Speech recognition process

3 Feature Extraction Techniques

Feature extraction is a crucial block in the speech recognition structure. Feature extraction is the process of calculating a set of meaningful properties from speech wave, which are significant for further processing. These properties are termed as features of the speech signal. Feature extraction approaches are divided into production based and perception based. Production-based features are calculated on the principle of how speech is produced, and perception-based features are based on how we perceive the sound. Linear Predictive Coding (LPC) is implemented using the concept of voice production mechanism while Mel Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Prediction (PLP) perceptually motivated coefficients are used for feature extraction. Extracted features are further modified using different dynamic coefficients. Feature extraction methods are also categorized based on temporal analysis and spectral analysis [4, 38, 39].

For feature extraction, working of the vocal tract model and auditory model is captured. The main steps are pre-emphasis and windowing. The short-term spectrum is calculated using windowing methods such as Hamming. The spectrum can be represented as the signal’s Fourier coefficients or as the set of power values at the outputs from a bank of filters. Further, the cepstrum is also calculated, which is the inverse Fourier transform of the logarithm of the spectrum. Different nonlinear logarithmic scales are used for the features based on the behavior of the auditory model [14, 36].

4 Speech Corpus and Classification Methods of Speech Recognition

Development of speech is essential corpus significant and crucial steps toward speech recognition are essential. For developing a speech recognition system, enough samples of speech are required. Different efforts have been made by researchers to develop standard speech corpuses such as from Linguistic Data Consortium (LDC) [27]. For low-resourced languages, classification is important to process in speech recognition. Here we have explained widely known models based on HMMS and artificial neural network-based model. The classification methods are used for the recognition process. Experiments have also been conducted to explore speech recognition using hybrid models with a combination of HMMs and ANNs.

4.1 Hidden Markov Model

HMMs represent the temporal nature of a speech signal. In HMMs, only observation can be seen, and the states behind observations cannot be directly observed.

Any HMM is represented by three features (∏, A, B),

where ∏ is the initial state distribution vector,

A = [aij] is the state transition matrix which shows the probability of transition from state ai to state aj at a particular time, and

B = [bjk] is the matrix of observing symbols vk at current state qj.

The HMMs are used for observation sequence probability estimation, classification of utterances, and model parameter estimation by using forward–backward from Baum and Viterbi algorithm [5, 12, 28]. The most important phases using HMM-based speech recognition are training and testing. The advantages of using HMMs are that time and spectral variability can be modeled parallelly. Other advantages are HMMs can be initialized by using an initial estimate of the HMM parameters without labeling the data [4, 27,28,29, 35].

4.2 Artificial Neural Networks (ANNs)

In speech recognition, the artificial neural network has mostly used the method after HMM. These can be used independently or with the combination of HMMs. ANNs are very effective classifier. These classifiers learn and organize the data based on input data during the training phase. ANNs are capable of adapting when the data is unknown. Most popular ANN architectures are multilayer perceptron, self-organizing maps, and radial basis function [8, 31, 32].

5 Language Model

Knowledge of spoken language is production based for speech recognition. This includes knowledge of linguistic sounds, pronunciation variation, interpretation of the meaning of words, grammatical structure, and the possible meaning of sentences. By using language models, the search space for the recognition engine is reduced. Natural language processing is used for the implementation of these features in automatic speech recognition. Example of some of the steps is designing pronunciation dictionary and defining a grammar for the utterances [5, 7]. Nowadays, statistical language models are used that give probabilities of a sequence of words based on the previous set of words. Most widely used language models for speech recognition are N-gram language models, which predict the probabilities of words based on occurrences of its previous N-1 words [27].

6 Speech Recognition Tool Kit

There are different types of open-source speech recognition tools available. These systems have been developed using different programming languages like Java, C++, and C. These tool kits are based on HMMs, ANNs, and combination of both. The following section describes each in detail.

6.1 Htk

The HTK is an HMM-based tool for developing HMMs, especially for building the speech recognition systems. It was initially developed at Cambridge University by Speech Vision and Robotics Group. Main modules for developing the speech recognition systems are data preparation, training, and testing. The module HCopy is used for feature extraction with other tools. The module HCompv is used for flat start initialization. The module HeRest is used for re-estimation, and HVite is used for recognition. Speech recognition results can also be analyzed with tool HResult. The well-documented material is available as HTK book for developing different types of speech recognition system [3, 30].

6.2 Sphinx

Large vocabulary ASR systems were developed using Sphinx in the late 80s. Nowadays, versions of Sphinx which are available are 2, 3, 3.5, and 4 decoder versions and one common training tool which is named as Sphinx train. Sphinx can also be used to train context-dependent HMMs. It supports MFCC and PLP speech features with dynamic coefficients [5, 31, 33, 34].

6.3 Kaldi

Kaldi is a speech recognition tool kit available under Apache License. It is written in C ++. It was developed by Daniel Povey and others. Stable release was made in October 2013. Kaldi requires basic concepts of speech recognition and basic processing of speech signal. For better and efficient use of Kaldi, basic knowledge of scripting language like bash, Perl, and Python are also required. Kaldi has support for the deep neural network also [2].

7 Speech Recognition Evaluation

ASR systems are evaluated based on different parameters such as recognition accuracy, out of the vocabulary word rejection, response time, and error recovery [3]. For calculating speech recognition, accuracy reference string and recognized strings are used. There are three types of errors, namely, substitution, insertion, and deletion error [3]. [18, 36] is for speech recognition. Parameters are defined below:

$$ \% {\text{ Correct }} = \, \left( {{\text{N }} - {\text{D}} - {\text{S}}} \right)/{\text{ N }} \times 100 $$
(1)
$$ \% {\text{Accuracy }} = \, \left( {{\text{N }} - {\text{D}} - {\text{S}} - {\text{I}}} \right)/{\text{ N }} \times 100 $$
(2)

where D denotes deletion error, I shows error due to insertion, S denotes the error caused by substitution, and N shows the entire number of labels in reference transcription. Another metric Word Error Rate (WER) is also used for describing the performance of the speech recognition system. It is specified as given below [35, 36]:

$$ \% {\text{WER }} = \, \left( {{\text{D}} + {\text{S}} + {\text{I}}} \right)/{\text{ N }} \times 100 $$
(3)

8 Comparative Analysis of Classification and Feature Extraction Techniques

Every classification and feature extraction technique has its own merits and demerits. Table 1 [7, 33] shows a comparative analysis of HMM- and ANN-based classification techniques. HMM-based model best describes the temporal nature of the speech signal and ANN-based system is better in learning and needs fewer steps in comparison to other speech recognition techniques.

Table 1 Classification techniques [7, 33]

Table 2 [7, 33] shows a comparative analysis of different feature extraction techniques. LPC-based coefficients are easy to compute. The perception-based coefficients such as MFCCs and PLPs provide better recognition due to the use of different nonlinearity functions to mimic the auditory model.

Table 2 Feature extraction techniques [7, 33]

9 Conclusion

This paper highlights technical details which are required during speech recognition. An attempt was made to review the speech recognition area during the past years. The focus was to review the different trends and techniques used in speech recognition. In the past year, several technical advancements occurred. Research findings show that significant work has been done toward different feature extraction, classification, and development of resources for research in speech recognition. Several standard speech corpora were designed to meet the requirements for the developments of the speech recognition system. It was also observed that different open-source tools based on HMM and ANN are also available for research work. This work also presented a comparative analysis of their specific features and classification techniques with their merits and demerits. Language models limit the search space in speech recognition, so the importance of language models has been explained. Finally, evaluation parameters for speech recognition have been discussed. However, despite a lot of research work, there is a need for the development of robust speech recognition system, especially for under-resourced language to bridge the gap of the digital divide. Surely this research paper will help the research community to go deeper in speech recognition research.