1 Introduction

Automatic Speech Recognition (ASR) system is used to recognize the words in the given speech signal and convert them into corresponding text transcript. In recent year ASR system is widely used to control the electronic gadgets like Amazon Alexa, Google Assistant etc. in the form of personal assistant. Initially Hidden Markov Model (HMM) is used to find the phone from the speech signal. In the early stage HMM is used to recognize the words from the given speech that takes the past and future data to predict the present state. It is a statistical model where the modeling system is supposed to be a Markov process with unknown parameters; the challenge is to work out the hidden parameters of the observed data. In the process of recognition, several variants of HMM were used which belongs to either discrete or continuous density model [3] .The HMM is a memory less model due to that the process will not have previous state so each observation is treated individually. Consequently created sentences by a HMM are conflicting. Recurrent Neural Network (RNN) overcome these issues by generating each character based on the reference of past history of characters generated [18] .Particularly LSTM RNNs, are successful system for time series data like speech recognition. Performance of Deep LSTM model over the large vocabulary continuous speech recognition is excellent, due to their noteworthy learning capacity [14]. LSTM is totally depends on the network that contains three gates such as input, forget and output gate to control the memory cells. Implementing LSTM is highly difficult due to its complex structure and computational complexity is also increased [811, 24].

Bidirectional LSTM (BLSTM) is an algorithm that processes the sequential data. The information from both forward and backward states was passed simultaneously to the output layer. BLSTM takes up the in build memory that stores the previous processed data into it. Due to this it avoids the reiteration of process to recognize the phone again in the different context. Along with advantages of LSTM, bidirectional architecture will process the information in both forward and backward to minimize the long range dependency [23]. To overcome vanishing gradients problem Gated Recurrent Unit (GRU) is used. Forget gate is also available in GRU as like LSTM. GRU contains less parameter when compare to LSTM. GRU is more comfortable for smaller dataset and it also produce excellent result in speech signal modeling when compare with LSTM. The LSTM encompass 3 gates particularly input gate, forget gate and output gate. Input gate directs the amount of the new cell state to keep, the forget gate controls the measure of the present memory to discard, and the output gate deals with to measure of the cell state. It should be introduced to the accompanying layers of the framework. The GRU works utilizing an update and reset gates. To overlook the previous state, reset gate is placed between the previous and future state activation, and the update gate chooses the amount of the information initiation to use in refreshing the cell state [20].

In spite of the fact that LSTM models have accomplished amazing outcomes for large vocabulary continuous speech recognition, they still battle when connected to specific task, for example, preparing for low-resource dialects. It is hard to implement Conventional LSTM (CLSTM) and Bidirectional LSTM (BLSTM) over complex training mechanism and also the problem of vanishing gradient over multiple layers is a major issue. We would like to resolve these deficiencies by utilizing additional gating mechanisms that reduces the complexity in training mechanism and also overcome the temporal dependencies [10]. In the proposed work, SGU is introduced to reduce the complexity further by maintaining the accuracy. Data flow in GRU is controlled by two gates. These two gates must contain correlation and redundancy. Since input and previous hidden state information present in current hidden state are controlled by reset and update gate. Correlation by cross-correlation is proved by Micro et al. [20] by using this reset and update gate shares the same value. Based on this, 2 gates in GRU is further reduced to single gate by coupling reset and update gate to form a single gated unit (forget gate). Further Deep bidirectional design combined with SGU to make a hybrid acoustic model that reduces the time taken to train the model with less number of parameters and also maintain the accuracy comparatively. This proposed model performed well in vanishing gradient issue. Deep SGU build by placing multiple SGU layers in the form of stack [7]. Similarly bidirectional design is used to process data in both forward and backward direction with separate parameter that helps in creating both previous and future context. The proposed architecture reduces the 20% of training time per epoch.

The remaining part of the paper is arranged as follows. Section 2 details with Background Section 3 Proposed Work DBSGU are explained Section 4 describes Experimental Setup Section 5 Experimental Result Segment 6 Conclusion.

2 Background

2.1 Recurrent Neural Network (RNN)

Problems with sequence prediction have been around for quite a while. In data science industry they are considered as one of the most difficult issue to understand. These incorporate a wide scope of issues like stock market prediction, predicting a next word from a speech signal in machine translation.

LSTM plays a major role in sequence prediction issue. LSTM perform well when compared to traditional RNN in various aspects. The inbuild memory that are available in LSTM, avoids the repetition of process for the same phone recognition that occurs in different part of speech signal. In the sequence data prediction, context between the past and present information plays the important role but traditional RNN maintains the context due to that learning rate of the model is gradually increased [12].

LSTM has various advantages over RNN and conventional feed-forward neural network. In sequence prediction problem past history of data plays a major role but conventional feed-forward neural network consist of individual data in all test cases. RNN accomplished this dependency of time [12]. The network design structure of RNN and unrolled RNN is shown in Fig. 1(a) and (b).

Fig. 1
figure 1

(a) Recurrent Neural Networks. (b) An unrolled RNN. At – output Bt– Input

2.2 Long-term dependencies problem

RNNs are the idea that they are most likely to associate past data with the present undertaking. Sometimes, the recent information is carried out from the present task. In most of the situation there will be a extensive gap required between the data [2]. Unfortunately, as this gap develops, RNN ends up helpless to find out how to associate data.

2.3 LSTM networks

All RNNs have a chain of reiterating neural network modules. Likewise, LSTM has an equivalent structure, yet the repeating module has a substitute structure. There are four interfacing in LSTM which works excellently than a single neural system layer shown in Fig. 2. Equation (1) is for forget gate decides which information need to be discard from the cell state. Equation (2) is for input gate that decides what new information that needs to be stored in the cell state. Equation (5) is for output gate, that generates LSTM final output layer for the timestamp‘t’ by providing the activation function. Equations (3), (4) and (6) denotes the cell state, candidate cell state and the final output.

Fig. 2
figure 2

LSTM with four interacting layers

Steps involved in LSTM:

  • Step 1: Sigmoid layer in forgot gate chooses what data the cell state has to discard.

$${f}_{t}=\sigma ({W}_{f} \cdot [{A}_{t-1} , {B}_{t}] + {x}_{f})$$
(1)
  • Step 2: Information get’s fed into the input layer. This layer decides what data from the candidate should be added to the new cell state [21].

    $${i}_{t}=\sigma ({W}_{i} \cdot [{A}_{t-1} , {B}_{t}] + {x}_{i})$$
    (2)
    $${\tilde{C}}_{t}=tanh({W}_{C} \cdot [{A}_{(\text{t}-1)},{B}_{t}]+{x}_{C})$$
    (3)
    $${C}_t={f}_t\ast {C}_{t-1}+{i}_t\ast {\tilde{C}}_t$$
    (4)
    $${O}_{t}=\sigma \left({W}_{0} * \right({A}_{t-1} ,{B}_{t})+{x}_{0})$$
    (5)
    $${A}_{t}={O}_{t} * \text{tan}h\left({C}_{t}\right)$$
    (6)

At−1 represents output of the previous cell (or) LSTM, Bt input at that particular time, Ct−1, Ct,\(\tilde{C}\)t represents old cell state, new cell state and new candidate value, ft forget gate state, Ot output gate, it input gate, \(\sigma\) sigmoid function, x bias for the respective gate, W weight for the respective gate

2.4 Bidirectional Long Short-Term Memory (BLSTM)

One shortcoming of ordinary RNNs is that they can just utilize the previous context. In speech recognition, where whole utterance are translated without any delay. Bidirectional RNNs (BRNNs) [21] do this by handling information in the two distinct hidden layers, outcome of such layer are then given to a comparable output layer. We should take output [-1, :, :hidden size] for normal RNN \(\overrightarrow{A}\) and output[0, :, hidden_size:] for reverse RNN \(\overleftarrow{A}\), and then combine those to output and feed the result to the subsequent dense neural network y [1]. BLSTM can handle two separate information in both forward and backward direction with two individual hidden layers [16]. Long range dependency data can be handled by using BLSTM along with the feedback for the next layer. Equations (7) and (8) carries the information in forward and backward direction respectively. Equation (9) is the output layer that get combined data as the input from \(\overrightarrow{{A}_{t}}\) and \(\overleftarrow{A}\) (Fig. 3) [6].

Fig. 3
figure 3

Bidirectional Recurrent Neural Network (BLSTM)

$$\overrightarrow{{A}_{t}}=H({W}_{B\overrightarrow{A}}{B}_{t}+{W}_{\overrightarrow{A}\overrightarrow{A}} {\overrightarrow{A}}_{t-1}+{x}_{\overrightarrow{A}})$$
(7)
$$\overleftarrow{A}=\text{H}\left({\text{W}}_{\text{B}\overleftarrow{\text{A}}}{\text{B}}_{\text{t}}+{\text{W}}_{\stackrel{\leftarrow \leftarrow }{\text{A}\text{A}}} {\overleftarrow{\text{A}}}_{\text{t}-1}+{\text{x}}_{\overleftarrow{\text{A}}}\right)$$
(8)
$${y}_t={W}_{\overrightarrow{A}y}{\overrightarrow{A}}_t+{W}_{\overleftarrow{A}y}{\overleftarrow{A}}_t+{x}_y$$
(9)

Combining BRNN with LSTM provides bidirectional LSTM [6] long-range contexts can be accessed in both directions forward as well as backward.

2.5 Gated Recurrent Unit (GRU)

Introduced by Cho, et al. in 2014, GRU (Gated Recurrent Unit) aims to solve the vanishing gradient problem which comes with a standard recurrent neural network. Like the LSTM unit, however without having separate memory cells, the GRU has gating units that regulate the stream of data inside the unit [4, 15] (Fig. 4) [15].

Fig. 4
figure 4

Gated Recurrent Unit (GRU)

Where the activation of the memory cell At at the time t could be a linear interpolation of the previous initiation At−1and the activation candidate A’t at the time t, rt is the reset gate and zt is the update gate. The W terms indicate matrices of weight [5].

Using Eq. (10) update gate is calculated for the time step ‘t’. update gate decides how much of information from the past need to be forwarded to future.

$${z}_{t}=\sigma ({\text{W}}^{\left(\text{z}\right)}{B}_{t}+{U}^{\left(z\right)}{A}_{t-1})$$
(10)

Reset gate decides how much of past information need to be discarded, Eq. (11) performs the reset gate operation.

$${r}_{t}=\sigma ({\text{W}}^{\left(\text{r}\right)}{B}_{t}+{U}^{\left(r\right)}{A}_{t-1})$$
(11)

New memory cell is introduced to store relevant information from the past as shown in Eq. (12).

$${A^\prime}_{t}=tanh(\text{W}{B}_{t}+{r}_{t}\odot U{A}_{t-1})$$
(12)

Equation (13) shows the final memory of the current time step

$${A}_t={z}_t\odot {A}_{t-1}+\left(1-{\mathrm{z}}_{\mathrm{t}}\right)\odot A{\prime}_t$$
(13)

2.6 Single Gated Unit (SGU)

The Single Gated Unit (SGU) is proposed [25] to minimize the gates. There are two gates available in GRU which is further reduced to single gate with the help of SGU. Update gate in GRU is shared with reset gate successfully to form a SGU this can be calculated with help of Eq. (14).

The forget gate is basic and its inclinations bf must be instated to huge qualities; the information input gate is significant, yet the output gate is insignificant; GRU and LSTM have comparable execution [11] .The output and forget gates are basic, and numerous variants of LSTM (fundamentally simplified LSTM variants) act correspondingly to LSTM [8] .Gated units work extremely well to basic units with no gates; GRU and LSTM has practically identical precision with a similar number of parameters [4] (Fig. 5) [25].

Fig. 5
figure 5

Single Gated Unit (SGU)

Update gate in SGU shown in Eq. (14) also carried out in same way like Eq. (11) used in GRU. That couples the two gate update and reset gate to form a single forget gate.

$${f}_{t}^{i}=\sigma ({U}_{f}{A}_{t-1}+{W}_{f}{B}_{t}+{x}_{f}{)}^{j}$$
(14)

where the superscript indicates the gate vector’s j-th element. Compared to the GRU, the activation status update equations and the j-th element candidate activation becomes:

$${A}_{t}^{j}=\left(\right(1-{f}_{t})\odot {A}_{t-1}+{f}_{t}\odot \hat {{A}}_{t}{)}^{j}$$
(15)
$$\hat {{A}}_{t}^{j}=\text{t}\text{a}\text{n}h\left(U\right({f}_{t}\odot {A}_{t-1})+W{B}_{t}+x{)}^{j}$$
(16)

This constitutes a rise of roughly two-fold (adaptive) parameters compared to the RNN. When compared to GRU, SGU has reduced the number of parameters due to that SGU will be trained faster [25].

3 Proposed work : Deep Bidirectional Single Gated Unit (DBSGU) - RNN model description

A vital component of the ongoing accomplishment is utilization of deep bidirectional system that can develop continuously larger amount of acoustic data. RNN Hidden layers are placed over one another and arranged typically in neat order to form a Deep RNNs; input for each layer in the architecture is gathered from output of previous layer, as appeared in Fig. 6. Data is processed in both forward and backward in bidirectional RNN with two separate hidden layers and then processed information is given to same output layer. By repeatedly executing the hidden vector sequences An, time t begins at 1 and terminated at T similarly n starts from 1 and ends at N, expecting the equivalent number of hidden layer is utilized well in the architecture.

$${A}_{t}^{n}=H({W}_{{A}^{n-1}{A}^{n}}{A}_{t}^{n-1}+{W}_{{A}^{n}{A}^{n}}{A}_{t-1}^{n}+{x}_{A}^{n})$$
(17)

Where A0 = B is specified. The network produces yt are

$${y}_t={W}_{A^Ny}{A}_t^N+{x}_y$$
(18)

Proposed model combines the Deep Bidirectional architecture with SGU to form DBSGU. Figure 7 demonstrates the general structure of the proposed framework.

Fig. 6
figure 6

BRNN

Fig. 7
figure 7

Deep Bidirectional Single Gated Unit

In bidirectional architecture hidden layer contains one forward SGU layer and one backward SGU layer. Since it is difficult to conclude whether forward or reverse propagation will progressively fit well. We have designed a model in such a way it will act in forward as well as reverse direction. In Bidirectional design not all layers dependent upon its previous layer. Each layer can transmit the information to more than one layer.

In proposed SGU, we have used only minimum number of gates when compared to other gated unit. We have set only one forget gate that combines the functionality of both reset and forget gate. It is denoted as

$${r}_t={f}_t,\forall t$$
(19)

\({f}_{t}\) indicates that we have used one gate that is forget gate. In Eq. (19), \(f\) is used instead of \(z\) that denotes the single gate. This single gate is considered as the forget gate. In proposed method first we generate the forget gate \({f}_{t}\) then we calculate the product of each element between \({1-f}_{t}\) and \({A}_{t-1}\)and make it as a new hidden state \({A}_{t}\). \({A}_{t}\) is produced by combining the forget gate with \({x}_{t}\).

High performance is achieved by having gated unit of RNN architecture. In overall evaluation more importance is given to forget gate. By reducing the gates, accuracy is maintained with reduced complexity.

From the Eqs. 14 to 16 and Fig. 5 it is clear that SGU is much simplified when compared to LSTM and Gated Recurrent Unit. In Table 1 we can see the different number of parameters required for LSTM, GRU and SGU. You can see that SGU required only minimum number of parameters that makes it easier to process. With less number of parameters we can avoid the gradient vanishing problem. Since we need only a least number of factors to tune.

Table 1 Set of parameters

4 Experimental setup

4.1 Corpus details

Table 2 shown below describes the various properties of the dataset. Both male and female volunteers are used to create a corpus. Source to generate the dataset is collected from Wikipedia. Volunteers are in the age limit of 21 to 35. The crowd sourced high-quality multi-speaker speech data set contains speech corpus for various languages like Tamil, Telugu, Malayalam, Gujarati and soon on [9]. In this proposed work we used Tamil Language alone for training and testing purpose.

Table 2 Dataset details along with properties

Along with the .wav file this data set also contains the text transcript for corresponding audio speech file. The data set consists of 153 h of male and 7440 min of female training data set together with text transcription for Tamil, Telugu and Malayalam languages. For experimental purpose we have used windows 10 Operating System with NVIDIA GTX 1650 GPU is used. The entire work is implemented using python 3.7.

4.2 Data preprocessing

Automatic Speech Recognition converts a raw audio file into character sequences; the pre-processing stage converts a raw audio file into feature vectors of several frames. We should first split each audio file into 32ms Hamming windows with an overlap of 12ms, and then calculate the 20 static, 20 delta and 20 acceleration coefficient using mel-frequency ceptral coefficients, appending an energy variable to each frame. The range of frequency is set to 0-8000 Hz with 40 mel bands. Delta and acceleration are calculated using width of 9 frames. In other words, each audio file is split it into frames using the Hamming windows function, and each frame is extracted to a feature vector of length 39.

4.3 Parameter settings

Our Proposed model is a 4-layer Deep Bidirectional Single Gated Unit (DBSGU) network [128 256 512 256 128] contains 320 cells respectively. Each layer is arranged in a sequence projection. Each hidden layer of bidirectional SGU incorporates one forward layer of SGU and one in reverse layer of SGU. Before training, samples are standardized to zero mean and unit deviation for each measurement. In the model, weights are introduced with a homogenous appropriation, and prepared utilizing pattern by analysing it statistically. We utilized a learning rate of 0.0005 and slope fragment condition per test sample of 0.0003. Early halting on the approval set is utilized to choose the best model. The foremost possible sequence of characters is produced by the model in a greedy manner. The last output grouping is then acquired by removing any blank symbols or reiterations of characters from the output and substitution any lower case letter with a space and its lowercase counterpart. output layer is divided into 2 softmax layer and hidden layer activation function are rectifier non-linearity. ADAM is used for cross entropy error in optimization [13], that run for 24 epochs and for regularization dropout is used. Instead of regular standard dropout, recurrent dropout is used to learn the long term dependencies. Input sentences for training the model is arranged in sorted order, and then based on the length of sentence it starts training the model from least length. Zero padding is minimized by using sentence sorting approach.

4.4 Tools and performance measures

The Kaldi toolbox is used for speech recognition [19]. The LibriSpeech formula was utilized for all examinations, including the evacuation, guidance and decoding of sound features [17]. The SRILM toolbox has been utilized for Language modelling [22]. ASR efficiency is anticipated by utilizing the Word Error Rate (WER) as the measurement.

5 Experimental results

The experiment was led on the CSMS data set. Our principle objective is to evaluate the quality of hybrid DBSGU-RNN to recognize large vocabulary continuous speech recognition, and specifically compare and contrast the methodology with already available DNN system and DBLSTM. The experiments are conveyed for DBLSTM, DBGRU and DBSGU based frameworks.

The accuracy is computed using Eq. (20). A high-accuracy value represents maximized speech recognition performance.

$$Accuracy\left(\%\right)=\frac{No. \ of \ words \ are \ correctly \ recognized}{Total \ No. \ of \ words} \times 100$$
(20)

Phoneme Error Rate (PER), Frame Error Rate (FER) and Cross Entropy error rate (CE) are demonstrated in Table 2 for DBSGU and DBLSTM framework. In DBLSTM we fixed 4 bidirectional networks along with 400 tanh gate in each, giving it about indistinguishable measure of weights as the DBSGU systems.

From the Table 3, it is clear that the proposed DBSGU performed well compared with DBLSTM. The SGU is the type of GRU without reset gate. The performance of proposed DBSGU is comparatively same as DBGRU. Recognizing the long dependency speech signal is addressed in the most crucial way by removing the reset gate. The learning rate of the model is also increased around 30%, to train the model with DBGRU it takes nearly 42 min but the removed reset gate model to learn the features within 24 min per epoch (Table 4).

Table 3 Details of the training and testing data
Table 4 Hybrid training results in %PER, %FER and %CE on the Tamil language

As shown in the Fig. 7, our proposed methodology outperforms the DBLSTM technique (baseline procedure) with and without dynamic features (Fig. 8).

Fig. 8
figure 8

(a) Phoneme error rate (Tamil). (b) Frame error rate (Tamil)

Stochastic gradient descent were used to train a DBLSTM, initially we fixed 0.1 as a learning rate and 0.9 as a momentum. The proposed DBSGU system performance well when compared to DBGRU and DBLSTM. We used the LibriSpeech formula in our methodology. Our findings are shown in Table 1. It is noted that Word Error Rate (WER) is considerably decreased when compared to DBLSTM and DBGRU (Table 5).

Table 5 Epoch with average accuracy

From the above Table 3, it is clear that the greatest precision achieved is 88.40% during the DBSGU model has 4 layers, 256 cells with learning rate of 0.0005 with 810 epochs. From the Table 3, we concluded that when we increase the number of layers, the average accuracy also increase consistently. We also tried to increase and decrease the learning rate but the average accuracy is decreased in both scenarios (Fig. 9).

Fig. 9
figure 9

DBSGU based WER for Tamil languages

In Table 6, WER of the proposed model is compared with the DBLSTM and DBGRU. Accuracy is measured based on the total number of words present in the speech signal, the number of words are identified correctly. The proposed model runs more number of epochs when compared to DBGRU with the same amount of time taken for the training. For the experimental purpose various learning rate is used, such as 10− 3, 10− 4 and 10− 5. Out of various learning rate 10− 3 performed well with increased accuracy.

Table 6 WER in % for Tamil languages

5.1 Performance

DBSGU takes less number of parameters for training compared with DBGRU and DBLSTM. Due to that DBSGU requires less amount of memory and training speed is also being increased. Figure 10 shows that the comparison between proposed DBSGU with DBLSTM and DBGRU. Based on the different batch size and input size, time taken for training the model is identified. The performance of proposed model is increased by fixing the batch size as 128 and input size as 256.

Fig. 10
figure 10

(a), 10 (b), 10 (c) and 10 (d) shows the time taken to train the DBSGU model based on hidden size.

6 Conclusions

In this paper we have implemented the DBSGU, which is a Hybrid of DBNN with SGU in CSMS dataset for the word prediction from the given audio speech signal. We have compared the accuracy of our DBSGU model with the DBGRU and DBLSTM. From the results, it is seen that DBSGU could reach remarkably faster speed than the standard DBLSTM and achieves better performance. We also found that Word Error Rate (WER) is also decreased considerably for Tamil language. The proposed model is similar to DBGRU with removed reset gate, which increases the learning rate of model during training phase across 30% compared with DBLSTM. The performance of proposed system is similar to DBGRU. The proposed model maintains the accuracy even after the removal of reset with least amount of parameters. In future, the parameters can be increased to reduce training time f the model with better accuracy.