1 Introduction

Speech contemplates as the primary medium of human interaction. However, interaction is no longer restricted to people but also includes machines. Automatic Speech Recognition (ASR) is a technology for facilitating human–machine interaction. For example, speech-based conversational assistants such as Apple Siri, Google Assistant, and Amazon Alexaare immensely popular, providing a wide range of services such as managing smart home devices and doing various activities via voice requests (Dua et al. 2022; Izbassarova et al. 2020). In recent decades, researchers showed so much interest in researching the automation of simple function that requires human–machine interaction (Nassif et al. 2019). A lot of studies, data collection, and research have performed by so many researchers for high-elevated languages like Spanish, English, and many others. Whereas in case of regional and low-resource languages such as Bangla and Punjabi etc., have tremendous opportunities for improvement (Kadyan. et al. 2018; Bhatt et al. 2021).

An ASR system is built mainly consisting of the front-end and back-end. At the front-end, the feature extraction techniques are applied to extract the relevant features, and at the back-end part, all the prediction work happens. Traditionally, Mel frequency Cepstral Coefficients (MFCC) (Mohan 2014) has been introduced as speech feature extraction. However, in noisy environments, the reliability of MFCC diminishes. As a result, noise-resistant algorithms like Gammatone Frequency Cepstral Coefficient (GFCC) (Shao et al. 2009) and hybrids of other approaches are becoming popular. Originally, Constant Q Cepstrum Coefficients (Yu et al. 2017) were designed to extract the features in the Automatic Speaker Verification (ASV) System to perform spoof detection (Wang et al. 2017). As a hybrid approach, Acoustic Ternary Patterns (ATP) (Malik et al. 2020) has been also fused with other feature extraction techniques in ASV, computer vision, and image processing systems. The research in this paper also focuses on different front-end combinations, MFCC + ATP, GTCC + ATP, and CQCC + ATP, to extract valuable information from the speech signal.

Similarly, at the back-end of an ASR system, the Hidden Markov Model (HMM) (Renals et al. 1994) and Gaussian Mixture Model (GMM) (Pujol et al. 2004) have been the common popular approaches for the classification of the speech samples. Moreover, in recent few years, researchers have experimented with the variety of deep-learning and machine learning-based models such as Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Support Vector Machine (SVM) (Ganapathiraju et al. 2004) for audio classification, bi-gram, tri-gram and n-gram (Isotani et al. 1994) as a language model.

The introduction of external language models has recently demonstrated a considerable increase in end-to-end ASR accuracy and neural machine translation. This method is known as shallow fusion, in which it combines the external language model with a decoder in the logged probability dimension for decoding (Kumar et al. 2014; Mori et al. 2021).A lot of research has been done on ASR for Bangla language utilizing different speech corpus, researching feature extraction strategies such as Mel-frequency Cepstral Coefficients(i.e., MFCC), Linear Prediction Coefficients (i.e., LPC), and Dynamic Time Wrapping (i.e., DTW), acoustic models like Deep Neural Networks (i.e., DNN), Hidden Markov Model-Gaussian Mixture Model (HMM-GMM), and language modeling techniques like N-grams, mono phone, and tri phoneme models. Most of the ASR systems in the Bangla language are developed using small datasets like digits, isolated words, and phonemes. However, in the case of large vocabulary continuous speech recognition (LVCSR), we have many challenges due to the non-availability of a large corpus, morphological parsing difficulty, and accent variability in the Bangla language (Samin 2021; Kibria et al. 2020).

The work presented in this paper contributes by using modified feature extraction techniques with the help of hybrid acoustic modeling and decoding strategy to develop an ASR system more robust.

The remainder of the article is structured as follows: Sect. 2 discusses the recent work and contribution, and then Sect. 3 discusses the fundamental techniques required to develop the proposed system. Section 4 represents the architecture of the proposed ASR system; Sect. 5 explains the experimental details and the performance analysis, followed by Sect. 6, which concludes the whole proposed work.

2 Related work and contribution

This section examines the relevant works and our contribution to this field. Experiments on multiple audio feature extraction approaches at the frontend and diverse classification models at the backend expand the literature. For ASR systems to operate moderately effective, a vast amount of training data is required. Large languages like as English, Chinese, and Spanish have several state-of-the-art speech corpora for successful the ASR system training. Whereas, such a big amount of data for regional languages like Bangla, Punjabi, Hindi, and so on is not available and only a few scholars are working upon them (Wang et al. 2019; Jain et al. 2019). In reality, it is predicted that just approximately 1% of the world's languages have the essential voice corpus needed to train an ASR system (Scharenborg et al. 2017).

In 1920, the first voice recognition system was developed. Later, the voyage of voice recognition technology was enhanced by the independent effort of researchers from all around the world. In the late 2000s, research on the Bengali ASR system began. In (Karim et al. 2002), proposed a model for the recognition of Bengali Spoken letters. The pioneering work in this field relied heavily on self-created small datasets and statistical methodologies. The first use of Neural Networks was seen in 2009, by authors .Paul et al. (2009). The authors began by pre-processing the input speech with pre-emphasis and the hamming window. Then, to generate voice characteristics, a 12-dimensional Linear Predictive Coding (LPC) is utilized. Finally, the speech characteristics are sent into an Artificial Neural Network (ANN), which is used to recognize speech. However, the study was done with a small sample size of four people, and no evidence of performance evaluation was provided. Muhammad et al. (Muhammad et al. 2009), proposed a model for Bangla digits recognition, where authors implemented the model using MFCC at front-end and HMM at back-end and achieved the word accuracy of more than 90% and proposed that a portion of the performance decrease was caused by a dialectical divergence. Female uttered digits exhibited better accurate rates than male uttered digits in gender dependent trials.

In year 2010, (Rahman et al. 2010) introduced a segment method to segment the continuous waveform from Bangla Speech. The authors used mean windows for separating each word from continuous speech. Based on the gaps in every fragmented word, each segmented word was then assigned to one of three groups: mono, di, or tri syllable. However, the size of the dataset was very small at which the researchers achieved 98.48% accuracy. (Rahman and Khatun 2011), proposed a model for the isolated words recognition in Bangla corpus. They implemented the MFCC at front-end and included the direct Euclidean distance measuring technique and achieved recognition rate of 84.28% and 96% respectively for multiple and single speakers. In (Ahmed et al. 2015), they implemented the model using the deep belief network. In their model, they applied MFCC at the front-end and then those features were trained by applying them to the generative ANN model, which was composed using the HMM with Boltzmann machine’s multiple layers. On the proposed model authors achieved the overall accuracy of 94.05%. (Nahid et al. 2016), proposed a noble approach to recognize the Bangla digits. To implement this, they used the software like Avro, which is used is Unicode based writing software and CMU Sphinx4 named speech recognition API and achieved the 75% accuracy.

In (Bhowmik et al. 2017), used the DNN model for the first time, in case of Bangla corpus. The architecture used deep convolutional autoencoders’ stacks that used pre-trained MFCC as input. Furthermore, after auto encoders' training, a three-layer multi-layer perceptron was employed to forecast the phoneme probabilities. In a self-created dataset that is no longer available, the baseline obtained 82.5% phonetic categorization accuracy. (Al Amin et al. 2019) proposed the hybrid model approach by combining the GMM and DNN model with HMM and proposed that the GMM-HMM performed better.

Since the year 2020, a new era of research started in the area of Bangla Speech Corpus, where so many researchers performed their experiments and proposed their work. (Paul et al. 2021) proposed a system for continuous speech recognition. Researchers stated that a subspace Gaussian mixture model is used in combination with quad-gram as the language model LM and HMM as the AM in the ASR system. In a triple cross-validation trial, the system's WER utilizing an appropriately trained LM was observed as low as 5%. (Mandal et al. 2020) proposed an end-to-end model, in which authors implemented model by using the CTC-based CNN-RNN model and they achieved the total WER of 13.67%.

CQCC is one the most popular choice for feature extraction in the area of ASV systems. In (Cai et al. 2017) explored the CQCC features that are later fed to GMM classifier, fully connected DNN and bi-LSTM model. The proposed architecture improves the system’s performance significantly in spoofed environment. In (Saranya et al. 2018) proposed a novel approach using CQCC + MFCC for replay attack detection in ASV system, where GMM model was used as at the back-end to perform the relevant task (Chakravarty et al. 2023b).

In (Javed, A. et al. 2021) explored the joint ATP and GTCC features to protect the Voice-controlled systems from voice spoofing attacks in single-hop and multi-hop networks.

Recently, in (Adhikary et al. 2021) proposed an approach using Long short-term memory (LSTM) and gated recurrent units (GRU) at the back-end (Joshi et al. 2023) and MFCC at the front-end and achieved overall accuracy of 47% using GRU and 45.81% using LSTM model. In (Das et al. 2021) used a mixed-language corpus for their work, in which authors have used the Bangla-English spoken digits and in noisy environment created their dataset and used CNN model and MFCC at the back-end and front-end respectively.

Recently in, (Yang et al. 2022) proposed a model based on conformer, wherein, the acoustic modeling has been performed using the convolutional augmented based attention mechanism. The authors used CHiME-4 corpus and achieved the WER of 6.25%. The advantages of the proposed model are that it reduces the total training time, while consuming the relatively smaller model size. The disadvantage of this proposed model is that on increasing the number of encoders the WER also increases, which down-grades the model performance. In (Dua and Akansha 2023), author used hybrid features by combining MFCC and CQCC for Gujarati Language Automatic Speech Recognition.

Also, in, (Rakib et al. 2022) proposed the work for improving the Bangla ASR system by utilizing N-gram language model. In the research, authors implemented the pre-trained wav2vec2 model on Bengali Common Voice 9.0 speech dataset. The advantage of the proposed work is that the WER has improved to 4.66% for robust ASR system, which is relatively lesser compared to available systems. However, the authors have used only fine-tuned the available pre-trained model by performing the hyper-parameter tuning to obtain the optimal values parameter of training dataset. In (Showrav et al. 2022) proposed the work to improve the ASR system’s performance for Bengali language corpus by applying transfer learning framework on end to end (E2E) structure. The proposed work freezes the feature extractor and uses the learning rate of 0.0003, which may down-grade the system’s performance.

After performing the literature survey, it has been observed that most of the researchers have used generally MFCC as front-end and GMM-HMM based model as the backend. Motivated by the above approaches employed in ASR system domain (Muhammad et al. 2009; Ahmed et al. 2015; Bhowmik et al. 2017; Rakib et al. 2022) the proposed system uses the three Cepstral coefficients-based feature extraction techniques i.e., MFCC, CQCC, GTCC (Chakravarty et al. 2023a) and one image-based ATP technique propose the novel combinations of MFCC + ATP, CQCC + ATP, GTCC + ATP. Furthermore, as shown in the preceding discussion, a deep learning-based categorization model improves the system's performance. Hence, the proposed system makes use of Convolutional Neural Network (CNN) and bi-directional long-short model (bi-LSTM) as their back-end. The dataset used for training and evaluation purpose in Bangla Speech dataset provided by Google. The following given points represent the contribution of the proposed system:

  • The proposed system makes use of static features of MFCC, GTCC and CQCC based Cepstral coefficients with image-based ATP features. Dynamic features of MFCC, GTCC and CQCC techniques also fused with ATP features separately to make the ASR system.

  • Two standalone acoustic models (2D-CNN, bi-LSTM) are used to make the back-end of the proposed ASR system. In the proposed system, hybrid CNN + bi-LSTM model followed by CTC loss function are also implemented at the backend as the acoustic model.

  • The Bangla Speech Dataset provided by Google is used to train and evaluate state-of-the-art ASR systems.

  • The performance of these hybrid feature based ASR system is evaluated using the WER as the performance metric. The novelty of this paper lies in the fact that fusion of ATP with cepstral features improves performance of the proposed low resource language ASR system, where the proposed combination of ATP-dynamic CQCC features with integrated backend acoustic model shows a relative improvement of 10–15% in Word Error Rate (WER) over all other experimented combinations.

  • To test the robust nature of GTCC features, we inject two forms of noise—multiplicative street noise and additive babble noise into the clean dataset at two different signal-to-noise ratios (SNRs): 0 dB SNR and 5 dB SNR. To replicate noisy situations, noise is purposefully added to clean dataset. In this noisy environment, the ATP-dynamic GTCC characteristics with an integrated CNN-bi-LSTM back-end model are evaluated. This study aids in understanding how well the model operates in the presence of noise and indicates the resilience of the GTCC features under noisy settings.

3 Preliminaries

The fundamentals used to implement the proposed ASR system are discussed in this section.

3.1 Mel-frequency cepstral coefficients (MFCC)

For implementing the ASR systems Mel-frequency Cepstral Coefficients (MFCC) is the most widely used Cepstral analysis-based feature extraction technique. The reason behind is that MFCC has the capability to capture the phonetically important features from the speech (Kumar et al. 2014, Vergin et al. 1999).

Windowing technique has used to slice the waveform of audio into the sliding frames and then Fast Fourier transform (FFT) or Discrete Fourier transform (DFT) and Mel-filter bank has applied and helps in providing the required data in frequency domain and mapping the observed frequency respectively. The filter and the glottal source are separated in the cepstrum using the log of power spectrum acquired from the Mel filter bank. The step-wise approach to extract the MFCC features is given below:

  • Initially, we perform the pre-emphasis operation on the recorded speech signal \(S(n)\). As a result, we get the pre-emphasized signal containing higher frequencies.

  • Then we slice the input \(S(n)\) into the smaller frames and then apply framing and windowing \(W (n)\) to remove edge discontinuities.

  • After windowing operation, to separate the energy contained within every frequency spectrum Discrete Fourier Transform (DFT) is applied. The equation to represent the speech signal into frequency range having power spectrum is:

    $$\left| {F\left( j \right)} \right|^{2} = \left| {\mathop \sum \limits_{j = 1}^{P} f\left( j \right).e^{{\left( {{\raise0.7ex\hbox{${ - k2\pi jl}$} \!\mathord{\left/ {\vphantom {{ - k2\pi jl} {P - 1}}}\right.\kern-0pt} \!\lower0.7ex\hbox{${P - 1}$}}} \right)}} } \right.$$
    (1)

where,

1 ≤ \(l\) ≤ \(-1\), and \(|{F\left(j\right)|}^{2}\) = Power Spectrum.

  • Diverse band pass filters are used to filter the spectrum produced by the DFT, and each frequency band's power is enumerated. Then, on the produced signal, we apply the log method with Mel filter bank to achieve the Spectrogram.

  • The Mel coefficients are transformed back into the time domain using the discrete cosine transform (DCT). Later, 13-MFCC feature coefficients are produced for each frame from the DCT results.

The following is a representation of the filter bank's output equation when used to the energy spectrum:

Let \({\propto }_{i}(j)\) = filter response

$$e(i) = \sum\limits_{{j = 1}}^{{\left( {{\raise0.7ex\hbox{$P$} \!\mathord{\left/ {\vphantom {P 2}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$2$}}} \right)}} {\left| {F\left( j \right)} \right|^{2} } .\alpha _{i} (j)$$
(2)

We can represent the obtained MFCC features as:

$$M \left( f \right) = \sqrt {{\raise0.7ex\hbox{$2$} \!\mathord{\left/ {\vphantom {2 {X }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${X }$}}} \mathop \sum \limits_{j = 0}^{X - 1} \log \left[ {e \left( {j1} \right)} \right].\cos \left[ {.\left( {{\raise0.7ex\hbox{${2k - 1 }$} \!\mathord{\left/ {\vphantom {{2k - 1 } 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}} \right). {\raise0.7ex\hbox{$\pi $} \!\mathord{\left/ {\vphantom {\pi X}}\right.\kern-0pt} \!\lower0.7ex\hbox{$X$}}} \right]$$
(3)

3.2 Gammatone cepstral coefficients (GTCC)

Gammatone Cepstral Coefficients (GTCC) falls under the group of the noise resilient feature extraction approach. It is more through model based on the scale of Equivalent Rectangular band (ERB) and the collection of Gammatone Filter banks (Arafa et al. 2018; Rademacher et al. 2006).

The early operations, such as windowing and Fourier transform, are comparable to those done by MFCC. Then, the produced output after the DFT function is filtered with the Gammatone filter bank and the final feature vector generated which contains the total 14 coefficients having features as 13 Cepstral coefficients and one energy coefficient. To extract the GTCC feature coefficient, we can use the following given equation:

$$G\left( x \right) = \sqrt {\frac{2}{K}} \sum\limits_{{L = 1}}^{K} {\log } \left( {S_{p} } \right){\text{Cos}}\frac{{l\pi }}{K}\left( {j - \frac{1}{2}} \right)\;1 \le j \le J$$
(4)

where,

\(K\)= Quantity of filters contained in the filter bank

\({S}_{p}\) = \(pth\) Energy band of Spectral

\(x\)= Total Coefficients’ Number

3.3 Constant Q cepstral coefficients (CQCC)

During the implementation of an ASV system, Constant Q Cepstral Coefficients (CQCC) feature extraction is utilized to extract meaningful information from the captured speech signal. In recent time, this method has recently been shown to be the most feasible for the creation of reliable and accurate ASV systems (Oh et al. 2014; Mittal et al. 2021).

The Constant Q Transform (CQT) is used in the CQCC feature extraction procedure, which then takes the log of the powered spectrum. It also uses resampling before computing the DCT(Oh, S. Y & Chung, K. 2014; Mittal et al.2021). It returns CQCC features after setting the number of feature coefficients. The following is a mathematical depiction of the CQCC feature extraction approach:

$${C}_{PQR}\left(s\right)=CQT(p(t))$$
(5)
$${C}_{CQCC}\left(i\right)={\sum }_{s=0}^{I}log{|{C}_{PQR}(s)|}^{2 }cos\left\{i(s-0.5)\pi |I\right\}$$
(6)

In this case, Eq. (5) determines the Constant Q Transform (CQT) of the input speech signal \(p(t)\) in \({C}_{PQR}\left(s\right)\), Eq. (6) determines i, total number of CQCC parameters in CCQCC(j), where I is the number of linearly spaced bins and e denotes the number of bins to index into.

3.4 Acoustic ternary patterns (ATP)

Local Ternary Patterns (LTP) has been studied as one of the most popular descriptors in the field of image analysis and computer vision. The primary premise of LTP is to compare each pixel in a picture to its surroundings. When an image pixel is compared to its neighbors, it generates binary values of '0' or '1'. This aids in the summary of a local structure in an image as well as the generation of powerful feature descriptions (Aziz et al. 2019). Face recognition, texture analysis, and ASV systems are some of the examples of promising uses. LTPs have a minimal processing cost and are resistant to monotonous grey scale variations. Similar to image descriptor 1-D LTP features, we calculate the Acoustic Ternary Patterns (ATP) features (Malik et al. 2020) for acoustic signals by dividing the whole speech into the frames, and then inside each frame compare the neighboring values from threshold values. Instead of taking the next pixel in image processing, we take the next speech signal value. A local ATP response is calculated by taking a constant linear distance ±\(\propto\) from central sample and quantized between + 1 & − 1 and the below three-valued method is obtained as:

$$F\left( {n^{j} ,c, \propto } \right) = \left\{ {\begin{array}{*{20}c} { - 1,\quad } & {n^{j} - \;\left( {c - \propto } \right) \le 0} \\ {0,\quad } & {\left( {c + \propto } \right) < n^{j} < (c - \propto )} \\ { + 1,\quad } & {n^{j} - \;\left( {c + \propto } \right) \ge 0} \\ \end{array} } \right.$$
(7)

The detailed procedure to extract these ATP features has explained in Sect. 4.2.

3.5 Two-dimensional convolutional neural network (2D-CNN)

A Convolutional Neural Network (CNN) (Haque, M. A., 2020) is particularly adept at processing input with matrix architecture, like an image. Lately, CNN models have been used in the ASR systems as a classification model. A typical CNN model is made of so many different layers, such as, convolutional, polling and fully-connected layers. In case of two-dimensional CNN model, the model uses the two-dimensional convolutional kernels as it moves across 2-D on the input.

3.6 Bi-directional long-short term memory (bi-LSTM) model

The Long Short-Time Memory (LSTM) (Kim et al. 2019) is a kind of RNN that can learn long-term dependencies. LSTM excels at memorizing information for lengthy periods of time. A bidirectional LSTM (bi-LSTM) differs from a conventional LSTM in that our input operates in both directions i.e. forwards and backwards. Due to its flow nature the bi-LSTM model makes the information sequence in both, past to future and future to past.

3.7 Connectionist temporal classification (CTC) loss

Connectionist Temporal Classification (CTC) loss (Scheidl et al. 2018) is a technique often used in DNN for sequence like problems such as handwriting and speech recognition etc. This method is required when we suppose to align the input with the desired output. This is accomplished by adding up the likelihood of potential input to goal alignments, which results in a loss value that may be differentiated with regard per input node.

4 Proposed architecture

The architecture of the proposed method is described in this section. The suggested ASR system's design is depicted in Figs. 2, 4, 6. The frontend uses three different cepstral features, MFCC, GTCC, CQCC, and then combined them with image-based features ATP. The proposed system is trained and evaluated using the state-of-the-art Open SLR’s Bangla language dataset. The Bangla speech corpus is divided into three parts: training, validation and testing. Training accounts for 70% of the corpus, 20% for validation, while testing accounts for the remaining 10% using unique set of datasets.

The processes in GTCC with CQCC feature extraction methods are quite analogous to those in the MFCC method. As mentioned before, at the back-end the proposed work investigates the three different models. These models are 2D-CNN, bi-LSTM with hybrid CNN based encoder bi-LSTM model. As a decoder we have used the CTC based greedy decoder provided by Keras and CTC loss as the loss function. Bangla based dictionary has been used in the data generation. We have classified our proposed systems broadly into three systems, such as system 1 is based on MFCC + ATP based features, system-2 is based on the GTCC + ATP based features and system-3 is based on the CQCC + ATP feature techniques.

All these three models take the Cepstral coefficients as an input with the image-based ATP features and generate the recognized values. Both static (first-order) and dynamic (delta Δ, delta-delta Δ–Δ) features of cepstral coefficients have used for the evaluation of these models. Out of all these features CQCC and ATP features are going to be used for the first time, in case of speech recognition. All three of the suggested ASR systems are discussed in the next sub-section.

4.1 MFCC + ATP based ASR system (System-1)

In Fig. 1, we have illustrated the procedure to extract the MFCC and ATP features. In MFCC, the first 13 coefficients expanded to get the maximum 39 features.

Fig. 1
figure 1

Proposed Joint MFCC + ATP based ASR system (System 1)

The first 13 features are known as static features, while the rest are dynamic formed by evaluating first derivatives and second order derivatives of the static features respectively known as delta (\(\Delta\)) and delta-delta (\(\Delta -\Delta\)) features.

figure a

To extract the MFCC features, function 1 has used that leverages the built-in \(mfcc()\) function and assign the energy to ignore.

The output generated by the \(mfcc()\) will be in the form of (Dua and Akansha 2023; Das et al. 2021) dimensions. Hence, to merge it with other features median function has used. Motivated by the use of 2D- LTP in image processing (Das et al. 2021), we implemented this hypothesis for 1D- voice signal to adequately describe the acoustic signal and called them Acoustic Ternary Patterns (ATP). When applied to 1-D signals such as audio, the ATP approach aids in obtaining important information about the audio's local temporal dynamics. In total we have calculated 20 ATPs’ feature, including 10 upper and 10 lower features in the dimension of (Dua et al. 2022; Jain et al. 2019). Function 2 has leveraged to extract the required ATP features. It takes the audio signal as an input and produces the ATP features as output.

Two-Dimensional Convolutional Neural Network (2D-CNN) trained using the extracted static, delta and delta-delta features of MFCC by fusing them with 20-ATP features. The suggested 2D-CNN design is made up of four types of layers: Convolutional, Max Pooling,

Flatten and Dense layer. Two units with soft-max activation make up the final dense layer. Dropout layers of 20% are also included to the design to prevent overfitting. The Learning rate of 0.01 has used in the model. Figure 2 represents architecture of 2D-CNN model implemented in our ASR system.

Fig. 2
figure 2

Architecture of 2D-CNN Model

figure b

Proposed bi-LSTM model used the 5 layers of bi-LSTM, in having 50, 100, 150, 200, and 250 units each and relu as activation function. In which we apply the input on the first layer. And then we imposed the 20% dropout after each LSTM layer, the output of such layers is transmitted to a dense layer of 24 units. The output of this dense layer is sent to the final layer, which is a dense layer with soft-max activation function. And the proposed model is trained using the static and dynamic features of above-described feature extraction method with 20-ATP features. Figure 3 represents the proposed ASR system on the MFCC + ATP based features by using bi-LSTM as the back-end model.

Fig. 3
figure 3

Proposed bi-LSTM Model Architecture

In our proposed hybrid model, we have implemented an end-to-end model using the Deep-CNN layers with the very Deep Convolutional Network (VGG Net) architecture as an encoder, after that, stacked bidirectional long short-term memory (BLSTM) layers are added. Three two-dimensional (2-D) convolution layers, each having 512 filters of shape 1 × 7, encode the input text sequence embedding followed by layer normalization. After the first convolutional layer, we have added a 2-D max pooling layer. After this encoder three layers of 256-unit bidirectional LSTM layer in each direction followed by layer normalization in each layer. Dense layer with soft-max as activation function has also applied at the end. Figure 4 represents the proposed-on hybrid 2D-CNN + bi-LSTM as the back-end model in our ASR system.

Fig. 4
figure 4

Proposed Hybrid Model Architecture

4.2 GTCC + ATP based ASR System (System-2)

Figure 5, illustrates the procedure to extract the GTCC and ATP features. The back-end acoustic model are same as we have implemented in the system-1. In case of GTCC, the early operations, such as windowing and Fourier transform, are comparable to those done by MFCC.

Fig. 5
figure 5

Proposed Joint GTCC + ATP based ASR system (System 2)

Then the produced output after DFT function is filtered with the Gammatone filter bank and final feature vector generated which contains the total 42 coefficients. Later, these features have fused with the 20-ATP features having 10 lower and 10 upper bound features. Function 3 shows the speech signal as input and at first generates the \({coef}_{i}\) [] containing in total 14 GTCC feature coefficients (1 energy coefficient + first 13 GTCC features) extracted GTCC features as output in the form of \({Del}_{del}[]\).

figure c

4.3 CQCC + ATP based ASR System (System-3)

The Constant Q Transform (CQT) is used in the CQCC feature extraction procedure, which then takes the log of the powered spectrum. It also uses resampling before computing the DCT (Rahman et al. 2011, Rakib et al. 2022). It returns CQCC features after setting the number of feature coefficients. Function 4 shows the speech signal as input and extracted CQCC features as output in the form of \({Del}_{del}\left[\right]\). In function, \({coef}_{i }[]\) gives the first 20 CQCC features, whereas output contains the in-total of 60 CQCC features.

This is the first time we are introducing the CQCC features in the field of the Speech recognition, before that this feature has utilized only in the area of ASV systems. We calculated total 60 features of CQCC by extracting the 20 static, and 20 first order derivative (\(\Delta\)) and 20 s order derivatives (\(\Delta -\Delta\)). Figure 6, illustrates the procedure to extract the CQCC and ATP features, where-as for developing the ASR system the back-end model we have implemented are similar to the models used with system-1.

figure d
Fig. 6
figure 6

Proposed CQCC + ATP based ASR system (System 3)

5 Experimental details & performance analysis

To extract the features at the front-end MATLAB tool has used. For the back-end code python has used and for the execution environment Jupiter notebook has utilized. We complied our model using the “Adam” optimizer. Further, the connectionist temporal classification (CTC) loss function is used to train the model. Model-Checkpoint is used to save model weights such that the weights that produced the best performance can be utilized later.

5.1 Dataset

The dataset used to implement our models are Large Bengali ASR dataset provided by OpenSlr. It is the existing largest dataset for the Bangla language. The total duration of this dataset is 250 h (Xiao et al. 2018). We have divided the whole corpus into three parts i.e., training dataset, validation and testing dataset, where all three datasets are having different speech instances with respect to each other. Training accounts for 70% of the corpus, 20% of the corpus has been used for validation, while testing accounts for the remaining 10% using unique set of datasets. Each folder has two subfolders one named as train_clean, test_clean, val_clean accordingly in which we have.flac files of the audio, text file and other subfolder named as train_all, test_all and val_all in which we have collected the pickle dumb and text file. Table 1 summarizes about the required dataset.

Table 1 Characteristics of the OpenSLR’ Bangla Speech dataset (Hasan et al. 2019)

We injected random street noise and babble noise into the audio dataset using different techniques—multiplicative and additive approaches. The both noises taken from the NOIZEUS dataset (Hirsch et al. 2000). This dataset contains noise at 0 dB, 5 dB, 10 dB and 15 dB. However, we are using only 0 dB and 5 dB noise for our experiments. The length of both noisy audio sample and clean audio samples are different. Function 5 provides the pseudo code to add the street noise to the clean audio sample.

figure e
  • Line 1: The function \(audioread\) is used to read the audio data from the files specified in the speech and noise variables. After execution, \(a\) will be a matrix containing the audio samples of the clean speech, and \(f\) will be the frequency of the clean speech audio. Similarly,\(b\) will be a matrix containing the audio samples of the street noise, and \(f1\) will be the corresponding frequency.

  • Line 3: This line calculates the minimum length between the audio samples of the clean speech and the street noise.

  • Line 4: This line creates the noisy dataset by multiplying the audio samples of the clean speech and the street noise element-wise. The \(.*\) operator performs element-wise multiplication on the corresponding elements of the two matrices \(a\) and \(b.\) The resulting matrix \(q\) will contain the mixed audio samples of the noisy dataset.

Similarly, we have added additive babble noise to our clean audio dataset. the babble noise audios are selected from NOIZEUS dataset (Hirsch et al. 2000). The line wise description of the Function 6 is given below:

figure f
  • Line 1, Line 2: Check if the lengths of \(clean\_signal\) and \(babble\_noise\) are not equal. If they are not equal, it means that the babble noise and clean signal have different lengths.

  • Line 3: Checks if the sampling frequencies of \(clean\_signal\) and \(babble\_noise\) are not equal. If they are not equal, it means that the two audio signals have different sampling frequencies. If the sampling frequencies are different, the \(babble\_noise\) is resampled using the resample function to match the length of the \(clean\_signal\).

  • Line 4: If the sampling frequencies are the same, the \(babble\_noise\) is trimmed to the length of the \(clean\_signal\) using the \(trim\_to\_len\) function.

  • Line 5: The \(babble\_noise\) is normalized to match the amplitude of the \(clean\_signal\). The normalize function scales the \(babble\_noise\) by the ratio of the amplitude of the \(clean\_signal\).

  • Line 6: The \(clean\_signal\_power\) is calculated as the mean of the squared values of the \(clean\_signal\).

  • Line 7: The \(target\_noise\_power\) is computed as the \(clean\_signal\_power\) divided by the power ratio calculated from the desired SNR (0 dB and 5 dB).

  • Line 8: The \(babble\_noise\_power\) is computed as the mean of the squared values of the \(babble\_noise\).

  • Line 9: The \(babble\_noise\) is scaled by the ratio of \(target\_noise\_power\) to \(babble\_noise\_power\). The scale function multiplies each sample of the \(babble\_noise\) by the scaling factor.

  • Line 10: Finally, the \(scaled\_bn\) is added to the \(clean\_signal\) to create the \(noisy\_signal.\)

  • Line 11: The \(noisy\_signal\) represents the clean audio signal with the babble noise added at the desired SNR.

5.2 Metric

The system performance has evaluated using the Word Error Rate (WER) metric. The WER of the model has decreased with every increased epoch number. To calculate the WER, we have used the Levenshtein distance concept. It determines the smallest number of alteration operations required to change one string into another one (Scharenborg et al. 2017). The formula to calculate the WER is given below:

$$WER=\frac{I+D+S}{N}= \frac{I+D+S}{D+S+H}$$
(8)

Here, D is no. of deletions operation performed, I = no. of insertions operation used, H = total no. of hits, S = no. of substitutions operations, and N = total no. of input.

We can also calculate the WER using the "Percentage Accuracy (PA)"and "Percentage Correct (PC)" concepts where PA represents the rate of word accuracy, whereas PC represents the rate of word correction (Adhikary, R. et al. 2021).

$$WER=100 \%-PA$$
(9)

5.3 Result and analysis

To analysis the performance of our model in total eighteen experiments has carried out. In the given approach, before classification, a combination of extracted feature vectors is constructed. As discussed earlier, all the experiments have been carried out in three systems. There are two classification tasks in each system i.e., WER on static features and WER on delta-delta features.

5.3.1 Experiment 1: stand-alone model scenario

In the first experiment, two models the bi-LSTM and the 2D-CNN were tested on the same datasets to calculate the WER.

5.3.1.1 Performance analysis on 2D-CNN model

To demonstrate the efficiency of the Cepstral coefficients with the image-based features, we have evaluated the static and dynamic features of each described system on the 2D-CNN model and tried to analyze the performance of these systems with the available back-end model. Table 2 summarizes the results achieved on the static features, whereas Table 3 summarizes the result obtained on the dynamic features.

Table 2 WER for 2D-CNN model using static features
Table 3 WER for 2D-CNN model using dynamic features

We can observe that on taking the dynamic Cepstral coefficients in system-2, it outperforms the other cepstral features, having the 2D-CNN model as the back-end. Table 4 and Table 5 summarize the system-2 results while using the street and babble noisy dataset on static and dynamic features.

Table 4 WER for 2D-CNN model using multiplicative street noisy dataset on system-2
Table 5 WER for 2D-CNN model using Additive Babble Noisy dataset on system-2
5.3.1.2 Performance analysis on bi-LSTM model

Similar to the above experiment, for the evaluation of the all the system’s performance we have carried out to experiments on bi-LSTM model. In these experiments we evaluated the system’s performance by taking the static and dynamic features of each cepstral technique. Table 6 summarizes the results achieved on the static features Cepstral features in systems, whereas Table 7 summarizes the result obtained on applying the dynamic Cepstral features to the system.

Table 6 WER for bi-LSTM model using static features
Table 7 WER for bi-LSTM model using dynamic features

By observing, the Table 6 and Table 7, we can understand that the system-3 on taking the dynamic Cepstral coefficients values outperform other systems with the bi-LSTM model as the back-end. Table 8 and Table 9 summarizes the system-2 results on static and dynamic features while using the noisy dataset which is created by combing street noise and babble noise to clean signals respectively.

Table 8 WER for bi-LSTM model using Multiplicative Street Noisy dataset on system-3
Table 9 WER for bi-LSTM model using additive Babble Noisy dataset on system-3

5.3.2 Experiment 2: hybrid model scenario

In the scenario of hybrid model, we have followed the same approach for the evaluation i.e., we have experimented the model on both static and dynamic features of the all the three systems. Table 10 summarizes the results achieved on the static features, whereas Table 11 summarizes the result obtained on the dynamic features on taking the hybrid CNN + bi-LSTM model as back-end.

Table 10 WER for CNN + bi-LSTM model using static features
Table 11 WER for CNN + bi-LSTM model using dynamic features

By observing, the Table 10 and Table 11, we can understand that the system-3 on taking the dynamic Cepstral coefficients values outperform other systems with the hybrid model as the back-end. Table 12 summarizes the system-2 results while using the noisy dataset (new signal created by combining street noise and clean signal) on static and dynamic features. Table 13 shows the result system -2 using noisy signal created by combing babble noise to clean signal on static and dynamic features.

Table 12 WER for CNN + bi-LSTM model using Multiplicative Street Noisy dataset on system-3
Table 13 WER for CNN + bi-LSTM model using Additive Babble Noisy dataset on system-3

5.4 Discussion

The outcomes mentioned above show that system-1 on 2D-CNN model provides the 29.26% and 23.28% WER respectively on static and dynamic Cepstral coefficients. Whereas system-2 gives the WER of 22.01% and 16.43% on the same model while taking the static and dynamic cepstral features respectively. Out of all these techniques the system-3 outperforms on the 2D-CNN model having WER of 20.59% and 15.73% on its static and dynamic features. In case of stand-alone bi-LSTM model, it was observed that the system-3 outperforms the other systems having the lowest WER of 16.02% and 14.32% on its static and dynamic CQCC features sequentially. In the scenario of hybrid model, again the system-3excels the other system’s performance while producing the WER of 3.8% and 0.998% correspondingly on its static and dynamic CQCC features. From the all above experiments, it was observed that system-2 and system-3 performs closely to each other having relative difference of 4%-5% in WER. Whereas, the system-1 produces the highest WER on any of the available back-end model. Figure 7 represents the comparative analysis between all the systems on their lowest WER.

Fig. 7
figure 7

Lowest WER on static and dynamic features of all the systems

5.5 Computational cost analysis

In analyzing computational costs for three audio feature extraction methods, ATP-MFCC, ATP-GTCC, and ATP-CQCC, it is crucial to consider factors such as time complexity, memory usage, and computational requirements. The overall computational requirements depend on hardware and implementation optimizations. Memory usage is directly proportional to the number of frames and feature vector dimensionality, and temporary variables and buffers are needed during computation. ATP-MFCC and ATP-GTCC have time complexity of approximately O (NF log(F)), due to similar processing steps Chakravarty et al. (2022).

ATP-GTCC involves computationally demanding operations like filtering, envelope extraction, and DCT, which require computational resources such as floating-point arithmetic and memory access. ATP-CQCC, on the other hand, has higher computational requirements due to constant Q transform (CQT) computation and cepstral coefficient computation. The overall time complexity can vary, but is often comparable to ATP-MFCC or ATP-GTCC.

The system described in the study employs 300 epochs for training. The computational complexity is measured in terms of the average time per epoch. According to the proposed work, the hybrid 2DCNN-BiLSTM acoustic model-based system requires approximately 7 min per epoch. On the other hand, for the individual level acoustic model-based system, BiLSTM approach takes around 10 min per epoch, while the 2DCNN approach takes approximately 12 min per epoch. All these models have almost comparable time in case of training and testing.

5.6 Comparative analysis with existing approaches

In this section, we compare the suggested research work with some of the existing methodologies. Due to the technology developments, majority of the researchers have done researches in the area of the high-resource languages like Mandarin, English, and Spanish etc. As a result, there are enough amount of data available to carry-out the research work in these areas. Whereas, in case of Indian languages like Tamil, Kannada, and Bangla, etc., this cutting-edge dataset is not available. Therefore, in case of these language it is very difficult to implement state-of-the-art ASR systems.

In (Islam et al. 2021), proposed an ASR model build with using the MFCC at the front-end and LSTM model at the back-end. To train the model researchers used the Bangla-word dataset and achieved the over-all WER of 20%. In (Sen et al. 2021), proposed their work on the Bangla digits dataset. Researchers extracted the features using the MFCC technique and then employed those extracted features to the CNN model and achieved the overall accuracy of 97.1%. The dataset utilized by the researchers have been recorded in the noisy environment and in total 400 samples of both noisy and non-noisy values collected. Authors also tried to check the system’s evaluation using cross-validation of tenfold and noticed the accuracy of 96.7%. (Kibria. et al. 2022), proposed their system trained on Bangladeshi Bangla dataset named as SUBAK.KO. To implement their model, they utilized an End-to-End model using RNN and CTC with an ASR algorithm. As a decoder authors implemented two approaches i.e., beam-decoder and greedy decoder. They achieved the lowest WER of 15.78% using the beam-decoder. In (Guchhait et al.2022) proposed seven different model for the comparison purpose on the Bangla digits speech corpus. They utilized the Kaldi toolkit and in one of their models’ researchers used Grapheme to phoneme (G2P) module with the Recurrent Neural Network (RNN). Researchers proposed that the DNN-HMM based acoustic model with Light-Gated Recurrent Unit (Li-GRU) NN achieved the best WER of 4.16%, while using the feature extraction techniques of Kaldi. Table 14 compares our proposed system to the already implemented one.

Table 14 Comparison between existing and proposed approach

6 Conclusion

The work in the paper proposed the combination of cepstral-based coefficients with the image-based features. The work applied these features to the CNN, bi-LSTM, and more potent hybrid acoustic model having 2D-CNN as an encoder with bi-LSTM as an acoustic model using the Connectionist Temporal Classification (CTC) loss decoder to build the state of art ASR system for Bangla Large Vocabulary Continuous Speech Recognition (LVCSR) corpus. It implemented three different systems comprising of MFCC-ATP, CQCC-ATP, and GTCC-ATP fused features. The work also tested the system with GTCC-ATP features on noisy dataset by adding the random street noise to the clean dataset at 0 dB and 5 dB SNRs. The results showed that CQCC-ATP features with the hybrid backend model outperformed all other systems by having a relative improvement of 8–10% in WER at Δ–Δ features. Also, we observed that the performance of system in noisy conditions have improved relatively by 8% to 10% with hybrid model as compared with stand-alone model. We can further extend this work by employing these robust features to other low-resource noisy datasets by using the deep conversion technique.