1 Introduction

People commonly interact through their senses of hearing and vision. In day-to-day life, speech recognition is a popular and effective method for understanding a person emotions and expressions. Speech recognition technology is remarkable in its ability to comprehend spoken language with precision. Previous research has primarily concentrated on the auditory mode of communication because of its outstanding ability to recognize spoken words. Unfortunately, this technique is not helpful for deaf and mute individuals, and even for normal people when the acoustic data is tainted. However, it becomes particularly challenging in noisy environments or when the acoustic signal is unavailable. Audio speech recognition faces challenges in accurately transcribing spoken language due to background noise, variations in speech patterns, speaker accents, and homophones that can result in different words being transcribed identically. To address these challenges, researchers have explored combining audio features with visual features techniques to improve the overall accuracy of speech recognition systems [1, 2]. Despite the numerous advantages of audio visual speech recognition, it encounters challenges that can lead to reduced accuracy. These challenges include operating in noisy environments, accounting for speaker variability, managing computational complexity, and addressing limited training data availability. Ensuring proper synchronization between audio and visual modalities is crucial for reliable speech recognition.

To overcome these challenges, engaging in continuous research and advancements in AVSR algorithms, data collection techniques, noise robustness, speaker adaptation, and synchronization methods is imperative. Nowadays, researchers have moved towards the VSR technique, also known as lip reading technology, which identifies the spoken content by analyzing the lip movement characteristics of the speaker without relying on audio signals [3, 4]. Lipreading can identify a speaker speech in noisy environments, even without audio signals. Lip movements in visual-based speech recognition systems have proven effective in mitigating background noise and aiding individuals with auditory impairments [5]. This promising area of research has the ability to strengthen the precision and stability of automatic speech recognition systems. VSR is a fascinating study area in computer vision and image processing that has gained recent attention. The goal of VSR systems is to use visual information from lip movements to recognize speech content. In VSR, the data will be in videos converted into frames for further implementation. Each frame contains a wealth of information about the variability of visual cues across different speakers. VSR has emerged as an essential research topic with potential uses in areas like speech recognition [6, 7], facial bio-metric [8, 9], healthcare [10, 11], and security [12]. To enhance the accuracy of VSR, extraction of spatio-temporal characteristics from video recordings is essential in video-based speech recognition. In a broader sense, spatio-temporal pertains to a phenomenon where data is gathered simultaneously in space and time. VSR refers to video features that are considered indispensable in accomplishing the task. The general architecture of VSR system is given in Fig. 1.

Fig. 1
figure 1

General architecture of VSR system

The study conducted by [13] illuminates the pivotal role played by the morphology of the mouth and lips in the processing of speech. Particularly in scenarios where speech may be compromised due to environmental noise or other factors, lip-reading emerges as a valuable adjunct, augmenting our comprehension of spoken language. In this work, we have introduced a deep learning based architecture for VSR, which leverages spatio-temporal features and utilizes haar cascade [14] for face with lip localization and detection. Furthermore, we have optimized Three-Dimensional Convolution Neural Network (3D-CNN) to extract essential spatio-temporal features from video data and used Bidirectional Long Short Term Memory (BiLSTM) [15] to process the input video frame sequence bidirectionally for capturing dependencies between past and future elements for accurate speech recognition. Furthermore, the proposed framework analyses the visual characteristics of the users lip in the video to transcribe the spoken speech accurately. Spatial features encompass information about the shape and position of the lips, while temporal characteristics capture details such as the speed and direction of lip movements. This work also employs data shuffling during training to mitigate bias and promote model generalization. Additionally, the model training incorporates the variable Learning Rate Scheduler (LRS), as presented in algorithm 3, to optimize the learning process effectively. Learning rate tuning is essential for optimizing the performance of a machine learning model by striking the proper balance between convergence speed and avoiding overshooting the optimal parameter values. It affects how rapidly the model learns and adapts during training; therefore, determining the optimal learning rate is frequently a vital stage in the hyper-parameter tuning process. Furthermore, we employ the Connectionist Temporal Classification (CTC) loss function [16] to calculate the associated loss, followed by a CTC decoder to perform the final speech-to-text conversion. Researchers previously used Bidirectional Gated Recurrent Units (BiGRU) [17] to process data forward and backward to capture bidirectional context and dependencies in sequential data. BiGRU is computationally lighter and better suited for tasks with moderate dependencies and shorter sequences. BiLSTM has an advantage over BiGRU in capturing complex dependencies memory capacity through multiple gates and potentially performing better on large datasets. However, in the context of the gating mechanisms, Long short-term memory (LSTM) [18] units used in BiLSTM have three gates (input, forget, and output gates), which allow for more fine-grained control over the information flow within the cell state and hidden state. This intricate gating mechanism enables LSTM to capture and retain longer-term dependencies in sequences, making it better suited for tasks requiring complex sequence modeling. On the other hand, GRU units used in BiGRU have two gates (update and reset gates), which can lead to slightly fewer parameters and potentially more straightforward training dynamics. We monitored the LRS to observe its impact on the model output. After careful analysis, it was concluded that it significantly affects model training performance. Consequently, we proposed optimized LRS to adjust the learning rate better, aiming for improved model performance adaptation. This paper presents the significant contributions as follows:

  1. 1.

    This paper reviews state-of-the-art approaches and datasets for character, digit, word, and sentence-level VSR.

  2. 2.

    The proposed deep learning model exhibits versatility, making it suitable for character, word, sentence, and digit datasets, which rely only on visual information.

  3. 3.

    We have optimized the 3D-CNN architecture and implemented a dynamic Learning Rate Scheduler (LRS) to regulate the learning rate throughout the model training adaptively.

  4. 4.

    We showcased the enhanced accuracy and effectiveness of the proposed model compared to existing deep learning frameworks utilized in implementing VSR systems.

This article is structured as follows. Section 2 discusses relevant work on various techniques and datasets. Section 3 describes the proposed methodology in multi-phases ranging from lip dataset development to model architecture, experiment results, and evaluation details, followed by comparing the proposed methodology to state-of-the-art methods. Section 4 includes the conclusion, future endeavors, and acknowledgment.

2 Related work

The major problem in the VSR system is visual ambiguity that arises due to the similar homophones of words or characters because they generate identical lip movements (e.g., ‘m’, ‘b’, ‘p’, ‘pat’, ‘mat’, ‘bat’). The phoneme is the smallest standard unit capable of distinguishing the meaning of a word in speech processing. Similarly, viseme is the standard unit many researchers use for analyzing the visual information from the video domain. Our research primarily focuses on word-level performance, even though we are utilizing a sentence-level dataset for VSR. We have identified several existing datasets that are suitable for performing VSR tasks. The datasets available for this task are as follows: LRW and LRS2-BBC [19], LRW-1000 [20], AVLetters [21], AVLetters2 [22], LRS3-TED [23], MIRACLE-VC1 [24], LIPAR [25], AV Digits [26], OuluVS2 [27], GRID [28].

A deep learning methodology has been proposed in [29] for recognizing complete words. This study trained LSTM network with discrete cosine transform, and deep bottleneck features for word recognition. Similarly, [30] employed an LSTM with Visual Geometry Group Network (VGGNet) to recognize the complete word. Distinguishing different words or characters with similar phonemes in lipreading poses a challenge due to the similarity in lip movements associated with those phonemes. The model aims to overcome phoneme similarity challenges by training a recurrent neural network on spatio-temporal feature patterns, resulting in state-of-the-art performance for the challenging lip reading task. Similarly, [31] introduced a lip-reading model incorporating a multi-grained spati-temporal approach to capture the distinctions between words and the individual speaking styles of different speakers. The model was trained for Word Recognition Rate using Dense-Net3D and BiLSTM. A more complex model than LipNet [3] has been introduced in the research work [32], which introduces residual networks with 3D convolutions to extract more robust features.

Existing datasets contain recordings with a limited number of participants and a limited lexicon, which further impedes progress in the field of VSR. In order to tackle this issue, [19] proposed a methodology to tackle the challenge of a small lexicon by contributing the LRW and LRS2-BBC dataset, encompassing a vocabulary of more words. Their approach introduced the Watch, Listen, Attend, and Spell (WLAS) model, which effectively converts mouth movements into words. The WLAS model transcribes spoken sentences into characters and can handle input solely from the visual stream. It achieved a WRR of 76.20% on the LRW dataset. Likewise, [20] proposed a work in which they provided a comprehensive and naturally distributed benchmark dataset, LRW-1000, for the lip-reading task. A non-autoregressive lipreading model proposed by [33] for fast lip reading and generating target word. It is designed to recognize silent source videos and generate all target text tokens simultaneously.

Wang et al. [11] created an HMM-based lip-reading software for the speech impairment people. The software employs a pre-trained VGGNet model with a user-friendly graphical user interface. However, the software disadvantage is its lower testing accuracy. Similarly, [34] focus on appearance-based visual features for people with learning disabilities. Based on the test results, the proposed system obtains a visual speech accuracy of 76.60%. In 2022, Huang et al. [35] proposed a lip-reading model, which utilizes a pre-trained neural network for feature extraction and processing through a transformer network. The overall accuracy assessment was based on word-level accuracy, achieving a result of 45.81%. Vayadande et al. [48] recognized the significance of lip reading for individuals with hearing impairments, leading to the development of advanced deep learning model, LipReadNet. This model has used 3D-CNN with LSTM networks and demonstrated substantial efficacy by achieving a 93% WRR on the GRID corpus. Time complexity is essential in analyzing the model performance; therefore, a lightweight deep learning model has been introduced in [36] based on isolated word-level recognition. This system investigates efficient models for VSR, achieving the exhibit similarity to the available model, but computational costs are reduced by eight times. Recognizing the slow training pace of conventional lipreading models, He et al. [37] proposed a new approach called the batch group training strategy. Their architecture combines 3D-CNN, MouthNet, and Bi-LSTM networks with a CTC loss function, resulting in a 93.8% accuracy on the GRID corpus. The performance of existing models with the GRID dataset as well as other datasets for VSR is provided in Tables 1 and 2 , respectively. Lip reading poses significant challenges due to similar lip movements observed across various consonant sounds.

Table 1 VSR comparison on different existing dataset and model
Table 2 VSR comparison on GRID dataset

In response to the inherent complexities associated with the VSR task, Rastogi et al. [45] proposed a deep learning approach involving a sequential model. The model incorporates the context of preceding and subsequent words to better interpret lip movements. To address the limitations stemming from the assumption of conditional independence in the CTC framework, they introduced a hybrid CTC/Attention model, effectively integrating the strengths of both approaches. Typical seq2seq models face two main challenges: exposure bias and a mismatch between the optimization target and the evaluation metric. To tackle these issues, [46] introduced a GRU based deep learning model known as pseudo-convolutional policy gradient. This model incorporates a pseudo-convolutional operation on the reward and loss dimensions, allowing it to consider more context around each time step. This approach produces a more robust reward and loss for the overall optimization process. To enhance lip-reading accuracy [41] introduced a neural network based model for a lip reading task that system relies only on visual information and operates without a lexicon. The system effectively reads sentences with a diverse vocabulary, even including words absent from its training data. 3D-CNN architecture has been customized by [25] for VSR that extracts spatio-temporal features to recognize the words. A thirteen-layer CNN integrated with batch normalization methodology has been proposed in the work [49] for lip-reading. However, the model exhibited a lower testing accuracy. Sarhan et al. [47] proposed a hybrid lip-reading model that uses encoding at the front end and decoding at the back end. The front end contains inception, gradient preservation, and a BiGRU layer, while the rear end has an attention layer, a fully connected layer, and a CTC layer. The model achieved 1.4% CER and 3.3% WER for overlapping speakers. LCSNet [50], an architecture designed by Xue et al., utilizes the channel attention mechanism to capture essential lip movement features. These extracted features are subsequently fed into a BiGRU to acquire long-term spatio-temporal features.

The performance of the VSR task is also degraded due to pose variations, lighting, and speaking speed. To address these challenges, Speaker Adaptive Training (SAT) is used in the research [51] to train Deep Neural Networks, which allows them to recognise words efficiently regardless of the speakers viewing angles. Similarly, [39] employed two constraints: local mutual information maximization for fine-grained lip movement detection and global mutual information maximization for essential frame identification to capture the lip movements. Three categories of speech data have been utilized by [26], including normal, whispered, and silent speech, to facilitate precise transcription of spoken words. Their methodology involved training an LSTM classifier while extracting DCT features. The model demonstrated a commendable 68% WRR on the AVDigits dataset. A comprehensive analysis of three pivotal works for VSR, as proposed by [1, 43, 44], is given in Table 2. Their proposed deep learning approach trained an LSTM classifier with a feed-forward network and obtained a maximum 84.70% WRR on the GRID dataset.

3 Proposed methodology

This section describes the architecture of the proposed approach to deal the GRID dataset. The approach begins with a detailed dataset description, followed by video frame extraction and conversion. The next steps proceed to extract Regions of Interest (ROI) and normalization of ROI. These normalized frames are given to the model to extract features and use them to forecast frame sequence prediction. Finally, the model target is model evaluation, with the primary goal of rigorously assessing the proposed framework performance, particularly regarding WER and WRR.

3.1 Dataset, image pre-processing and feature concatenation

In the following subsections, we will cover the dataset description, image pre-processing techniques employed for localizing the subject face and lip regions, cropping the lip areas, and further image processing techniques for processing the cropped lip region.

3.1.1 Dataset description

Our study utilizes the GRID dataset publicly available [52], widely recognized and extensively employed within speech recognition. The dataset consists of.mpg video files and corresponding.align files that contain a series of time intervals paired with related labels. These files describe a sequence of events or activities, each with its start and end times, representing the duration of the specific event.

Fig. 2
figure 2

Subjects in GRID dataset

The dataset offers a comprehensive collection of videos and alignments of 34 speakers (s1 to s34), each delivering 1000 sentences. These sentences follow a fixed structure, consisting of a command (4 options), a colour (4 options), a preposition (4 options), a letter (25 options), a digit (10 options), and an adverb (4 options). There are 51 exclusive words represented in the dataset, and the random word alternatives are used to avoid dependency on contextual cues for classification. The chosen dataset offers valuable temporal insights with a comprehensive collection of diverse recordings. It includes speakers from various backgrounds, ethnicities, and age groups, making it suitable for training and evaluating VSR models that can handle variations in speech and visual cues. Significantly, in this particular dataset, our research has been solely centered around video content without audio. It is worth noting that many earlier researchers have also chosen to focus exclusively on the video without relying on the audio portion of this dataset for the VSR task (refer Table 2). Each sentence within the corpus lasts 3 s, capturing 25 frames per second. As a result, the data for each speaker encompasses 3000 s (50 min) in total. In this paper, we have utilized 7200 videos, a subset of the corpus comprising 30 subjects, each with 240 videos. The subjects available in the dataset are given in Fig. 2.

Fig. 3
figure 3

a Original frame b Gray scale frame c Face localization d Lip localization e Lip extraction

3.1.2 Face localization and cropping region of interest (ROI)

The first stage of this pipeline entails using a video in which the subject pronounces a sentence from the provided corpus. We divide the video into separate frames, convert them into gray scale, and treat each frame as a sub-input for the pipeline. The importance of converting color frame to grayscale lies in its capacity to decrease the dimensions of the input vector. Consequently, this results in fewer parameters for training the model. The algorithm 1 outlines the entire process from face localization to ROI cropping, while Fig. 3 provides a visual representation of this procedure. Figure 3a is an original color frame, (b) is a converted grayscale frame, (c), (d) represents the face localization and lip detection, (e) represents the final phase of lip region (ROI) extraction for lip data preparation. In this process, ROI is a lip region, which will be the model input for tracking the lip movement for word recognition.

3.1.3 Extraction and normalization of ROI

As described in the previous section, the image processing technique depicted in Fig. 3 is utilized to precisely crop the ROI of the subject.

Algorithm 1
figure a

Video Pre-processing and Extraction of ROI

The subsequent part of this step is ROI normalization. The following stage of this pipeline involves image Contrast Normalization (CN). During this process, the converted image pixel values are adjusted to operate within the limit of [0 to 128 intensity values]. It defines image contrast and intensity, rendering them more uniform and suited for later processing or analysis. The CN method is applied to mitigate the contrast disparity between light and dark pixels in an image (refer to Algorithm 2).

Algorithm 2
figure b

Contrast Normalization (CN)

In algorithm 2, \(\mu \) represents a parameter that controls the strength of contrast normalization, sf is a scaling factor, and \(\epsilon \) (epsilon) is a small positive constant used in algorithms and mathematical operations to prevent division by zero and ensure numerical stability.

3.1.4 ROI concatenation

The last stage of the image pre-processing pipeline involves combining normalized ROI frames for the preparation of feature matrix. Due to variations in the number of features obtained from each sentence, the feature vector can become uneven. The reason for this is that the number of frames captured for each sentence varies, causing imbalanced data features that can pose challenges in training the model. To address this issue, we have concatenated some silent frame denoted as ’sil’. The silent frame is a video frame captured when the subject remains silent and does not articulate any words from the sentence. In the alignment of a video, ’sil’ is added at the beginning and ending of the alignments. The alignment of a video file with the spoken words“bin blue at f two now”is illustrated in Table 3.

Fig. 4
figure 4

Concatenated ROI frames including silent frames

Table 3 Alignment of a video

Consider max is the maximum number of frames utilized as input for 3D-CNN.The maximum frame count (max) is configured at 75, and the specially designated augmented frame is labeled as ’sil’ represented by b. For a person p with n frames, we define the qth frame as \(f_{pq}\). The concatenated frames \(F_c\) of one person can be defined by Eq. 1. The visual representation of \(F_c\) is shown in Fig. 4.

$$\begin{aligned} F_{c} = \left( \sum _{q=1}^{n} f_{pq} + \sum _{k=1}^{75-\text {n}} b_k \right) \end{aligned}$$
(1)

3.2 Model architecture

This section will explore various components of the proposed model architecture given in Fig. 5. Firstly, the feature extraction process is achieved by utilizing optimized 3D-CNN, effectively capturing spatial and temporal information. Activation functions are pivotal, introducing non-linearity to enhance the model capacity to learn intricate patterns. In addition, pooling layers are responsible for dimensionality reduction and preserving essential information. Furthermore, BiLSTM layers are employed to enable bidirectional context processing. To mitigate over-fitting, dropout layers are integrated. A dense layer with softmax activation is incorporated to facilitate class prediction. An appropriate loss function is selected to quantify the model performance effectively. Finally, a learning rate scheduler is implemented to adjust hyper-parameters during training dynamically, optimizing the models convergence and performance.

Fig. 5
figure 5

Proposed model architecture

3.2.1 2D CNN

Two-dimensional convolution (2D-CNN) is employed at the convolutional layers to extract features from the immediate neighborhood on feature maps in the preceding layer. In 2D-CNN, the convolution layers primarily extract spatial features from the input data. The formal representation of the output of a neuron located at coordinates (p, q) in the \(n^{th}\) feature map in the \(m^{th}\) layer, denoted as \(S_{m,n}^{p,q}\) is given by:

$$\begin{aligned} S_{m,n}^{p,q} = ReLU\left( \sum _{g}\sum _{h=0}^{H-1}\sum _{w=0}^{W-1} K_{m,n,g}^{h,w} S_{(m-1)g}^{(p+h)(q+w)}\right) \end{aligned}$$
(2)

Where ReLU is an activation function and \(S_{m,n}^{p,q}\) signifies the value of the unit associated with the feature map at point (pq) within the previous \((m-1)th\) layers of nth feature map. H and W are the kernels height and width, respectively, and g is an index over the set of feature maps in the previous \((m-1)th\) layer corresponding to the current feature map. \(K_{m,n,g}^{h,w}\) is the kernel weight value at position (pq). In this analysis, bias terms are omitted.

3.2.2 3D CNN

In 2D-CNN, convolution only focus on capturing the spatial feature, while to capture the motion information from video, it is desirable to use 3D-CNN to capture both spatial and temporal feature map. The proposed model has three 3D-convolution layers with 128, 256, and 75 filters of size \(3\times 3\times 3\) for each layer, expecting input data in the shape of a sequence of 3D frames with dimensions \(75\times 81\times 140\) denoting the number of frames for each video with frame size. The feature maps within the convolution layer establish connections with several adjacent frames from the preceding layer, effectively capturing motion details. Given a 3D-CNN with multiple layers, the function computes the value at a specific position (p, q, r) in the \(m^{th}\) feature map in the \(n^{th}\) layer is given by:

$$\begin{aligned} S_{m,n}^{p,q,r} = ReLU\left( \sum _{g}\sum _{h=0}^{H-1}\sum _{w=0}^{W-1}\sum _{t=0}^{T-1} K_{m,n,g}^{h,w,t} S_{(m-1)g}^{(p+h),(q+w),(r+t)}\right) \end{aligned}$$
(3)

where, \(S_{m,n}^{p,q,r}\) denotes the value of the unit connected to the current feature map located at position (pqr) within the \(n^{th}\) feature map of the preceding \((m-1)^{th}\) layer. H and W are the height and width of kernel, T represents the dimensions of a 3D kernel along the temporal axis. g is indexed over the set of features map in the \((m-1)^{th}\) layer connected to the current feature map. \(K_{m,n,g}^{h,w,t}\) is the kernel weight value at position (pqr). A ReLU activation function processes the output of a CNN before passing it to subsequent layers to introduce non-linearity, allowing the model to learn complex patterns and relationships in the data. It replaces all negative values in the CNN output with zeros and leaves positive values unchanged.

3.3 ReLU

A good activation function improves CNN performance significantly. The proposed architecture adds three layers of ReLU after each convolutional layer. ReLU is a widely recognized and frequently used activation function in neural networks [53] and is a notable choice due to its non-saturating nature. In this work, Fig. 6 is adapted from [54] to represent the mathematical architecture of ReLU. The definition of the ReLU activation function is given in Eq. 4.

Fig. 6
figure 6

ReLU [54]

$$\begin{aligned} a_{m,n,k} = \max (S_{u,v,k}, 0) \end{aligned}$$
(4)

where \(S_{u, v, k}\) represents the input to the activation function at the index (u, v) in the \(k^{th}\) channel. ReLU is a piecewise linear function that zeroes off negative numbers while keeping positive values. The max operation employed by ReLU enables faster computation compared to sigmoid or tanh activation functions. Additionally, it encourages sparsity within the hidden units and facilitates the networks acquisition of sparse representations. Deep networks can be trained efficiently with ReLU even if no prior training has been given [55]. Even though the disruption of ReLU at 0 may degrade performance, prior research has indicated its empirical superiority over sigmoid and tanh activation functions [56]. The ReLU activation and max-pooling combination helps CNNs learn complex features, reduce computational complexity, and improve the network ability to recognize patterns regardless of their exact location in the input data.

3.4 Max pooling

In CNN, pooling reduces computational complexity by minimizing connections between convolution layers. The proposed model provides three 3D Max-Pooling layers with a pool size of (1, 2, 2) for reducing the feature map. This layer performs down-sampling, reducing the spatial dimensions of the data while preserving important features. The pooling procedure entails sliding a three dimensional filter over each channel of the feature map and aggregating the features within the filters covered region. For an image with a feature map of dimensions \(n_{hi}\) x \(n_{wi}\) x \(n_{ch}\), the resulting dimensions after applying a max pooling layer can be calculated as follows:

$$\begin{aligned}{}[ \left( \frac{{n_{hi} - a + 1}}{b}\right) \times \left( \frac{{n_{wi} - a + 1}}{b}\right) \times n_{ch} ] \end{aligned}$$
(5)

Where \(n_{hi}\), \(n_{wi}\), \(n_{ch}\) is the height, width and channel of the feature map and a, b is the size of filters and stride. Here, \(n_{ch}\) is one because the analysis is conducted on gray scale frames. A one-time distributed flattened layer is added before giving the down-sampled features to the BiLSTM. TimeDistributed wrapper with a flattened layer combination is applied to each temporal slice of the input tensor. It independently flattens the features at each timestep, allowing the model to capture temporal patterns. This layer is beneficial when dealing with sequential data, such as text sequences.

3.5 BiLSTM

The architecture of LSTM is based on three gates and cells, which help to store, forget, and retain the information bidirectionally during data processing. The proposed architecture contains two layers of BiLSTM for better context embedding. Each Layer of BiLSTM contains 128 memory units and an orthogonal weight initializer, and the returns sequence is true as a parameter. To elaborate, it involves an input sequence vector denoted as X, which can be represented as \((x_1, x_2, \dots , x_n)\), where n represents the length of the input sentence.The LSTM structure is made up of three distinct gates: an input gate, an output gate, and a forget gate. These gates are essential components for controlling the flow of information within the LSTM cell. The hardware level symbolic circuit representation of the LSTM network structure is derived from [57] and illustrated in Fig. 7. The initial phase in LSTM revolves around determining which information of the cell state should forget, facilitated by the forgetting gate.

Fig. 7
figure 7

Architecture of LSTM cell [57]

3.5.1 Forget gate

Typically, a sigmoid function within this gate determines what information to eliminate from the LSTM memory. This determination primarily relies on \(h_{t-1}\) and \(x_{t}\). The result of this gate is denoted as \(f_{t}\), a value ranging from 0 to 1. A value of 0 indicates that the learnt information has been completely removed, whereas a value of 1 indicates that the entire value has been retained. The calculation for this output is as follows:

$$\begin{aligned} \begin{aligned} {\left\{ \begin{array}{ll} f_t = \sigma (W_{fh}h_{t-1} + W_{fx}x_{t} + b_f)\\ c_{tf} = f_t {.} c_{t-1} \end{array}\right. } \end{aligned} \end{aligned}$$
(6)

Where \(\sigma \) is the sigmoid activation function, \((W_{fh}h_{t-1}\) represents the weight value of the previous hidden state that is \(h_{t-1}\), \(W_{fx}x_{t}\) denotes the current weight value of the current input state \(x_{t}\) at particular time step t. Additionally, \(b_f\) represents the bias value at the forget gate.

3.5.2 Input gate

This gate is responsible for determining whether or not the incoming data should be stored in the LSTM memory. It comprises two parts: a sigmoid segment and a tanh segment. The sigmoid segment decides which values must be updated, whereas the tanh component generates a vector of potential new values for LSTM memory integration. The outputs of these two components are computed as follows:

$$\begin{aligned} \begin{aligned} {\left\{ \begin{array}{ll} i_t = \sigma (W_{ih}h_{t-1} + W_{ix}x _{t} + b_i) \\ g_t = \tanh (W_{gh}h_{t-1} + W_{gx}x_{t} + b_g) \\ c_{ti} = i_t {.} g_t \\ c_t= c_{ti} + c_{tf} \\ \end{array}\right. } \end{aligned} \end{aligned}$$
(7)

Where \((W_{ih}h_{t-1})\) represents the weight value of the previous hidden state \(h_{t-1}\), \((W_{ix}x_{t})\) denotes the weight value of the current input state \(x_{t}\) at a particular time step t. Additionally, \(b_i\) represents the bias value for the input gate. A similar representation is used for the weight matrix of the previous hidden state, the current input state, and for the bias in the tanh activation function.

3.5.3 Output gate

The process begins with a sigmoid layer determining the influence of a segment of LSTM memory on the output. The values are then adjusted with tanh functions to fall inside the range of \(-1\) to 1. Finally, the result is multiplied by the output of the sigmoid layer. The equations below illustrate the computation process:

$$\begin{aligned} \begin{aligned} {\left\{ \begin{array}{ll} o_t = \sigma (W_{oh}h_{t-1} + W_{ox}x_{t} + b_o) \\ h_t= \tanh (c_t)\cdot o_t \\ \end{array}\right. } \end{aligned} \end{aligned}$$
(8)

The sigmoid activation function (\(\sigma \)) is applied to the output gate. In this context, \((W_{oh}h_{t-1})\) represents the weight value associated with the previous hidden state, while \((W_{ox}x_{t})\) denotes the weight value corresponding to the current input state at time step t. Additionally, \(b_o\) represents the bias value at the output gate. A single LSTM cell, which is limited to capturing the preceding context and lacks the ability to incorporate future information, is enhanced by the introduction of a bidirectional recurrent neural network proposed by [15]. A BiLSTM processes the input sequence \(X= (x_1, x_2, \dots , x_n)\) in both the forward and backward directions, generating forward hidden states \(\overset{\rightarrow }{h}_t = (\overset{\rightarrow }{h}_1, \overset{\rightarrow }{h}_2, \dots , \overset{\rightarrow }{h}_n)\) and backward hidden states \(\overset{\leftarrow }{h}_t= (\overset{\leftarrow }{h}_1, \overset{\leftarrow }{h}_2, \dots , \overset{\leftarrow }{h}_n)\). The resulting encoded vector is constructed by concatenating the final forward and backward outputs, denoted as \(Y = [\overset{\rightarrow }{h}_t, \overset{\leftarrow }{h}_t]\).

$$\begin{aligned} \begin{aligned} {\left\{ \begin{array}{ll} \overset{\rightarrow }{h}_t = \sigma (W_{\overset{\rightarrow }{h}x}x_t + W_{\overset{\rightarrow }{h}\overset{\rightarrow }{h}}\overset{\rightarrow }{h}_{t-1} + b_{\overset{\rightarrow }{h}})\\ \overset{\leftarrow }{h}_t = \sigma (W_{\overset{\leftarrow }{h}x}x_t + W_{\overset{\leftarrow }{h}\overset{\leftarrow }{h}}\overset{\leftarrow }{h}_{t+1} + b_{\overset{\leftarrow }{h}}) \\ Y_t = W_{y\overset{\rightarrow }{h}}\overset{\rightarrow }{h}_t + W_{y\overset{\leftarrow }{h}}\overset{\leftarrow }{h}_t + b_y \end{array}\right. } \end{aligned} \end{aligned}$$
(9)

Y denotes the output sequence of the initial hidden layer, expressed as \((y_1, y_2, \ldots , y_t, \ldots , y_n)\). The BiLSTM layer output is transferred to the dropout layer to improve model resilience and prevent overfitting. This dropout layer randomly deactivates neurons during training, encouraging more generalized representations and reducing reliance on specific patterns.

3.6 Dropout layer

Dropout is a strategy that prevents neural networks from relying too heavily on individual neurons or groups of neurons, enabling the network to maintain accuracy even when specific input is missing. The proposed architecture includes two dropout layers, each with a dropout rate of 0.5, after each BiLSTM layer. This strategic integration seeks to reduce overfitting and improve the model generalization abilities while training. Dropout was first applied to fully connected layers by [58], demonstrating its effectiveness in reducing overfitting. The output of the BiLSTM layer (Y) is applied to dropout, and then it is defined as follows:

$$\begin{aligned} Z = r \star b (W^A Y ) \end{aligned}$$
(10)

where \(Y = \left[ y_1, y_2, \ldots , y_n \right] ^A\) is input to the dense layer, \(W \in \mathbb {R}^{u \times v}\) is a weight matrix, and r is a binary vector whose elements are independently drawn from a bernoulli distribution. The BiLSTM output, enhanced by a dropout layer by mitigating overfitting, is subsequently fed into a dense layer. The dense layer predicts class probabilities using softmax function.

3.7 Dense layer

The proposed model includes a fully connected neural network with a softmax activation function. This layer establishes connections between all neurons from the preceding layer to the current one. Utilizing the softmax activation function, the model transforms a real-numbered vector into a probability distribution spanning multiple classes. The output of dropout layer passed to the dense layer containing softmax function. The input vector Z is \(Z = (z_1, z_2, \ldots , z_n)\), where total number of categories deonoted as n, the softmax function calculates the probability \(p_i\) for each class i through the following formula:

$$\begin{aligned} p_i = \frac{e^{z_i}}{\sum _{j=1}^n e^{z_j}} \end{aligned}$$
(11)

where \(e^{z_i}\) represents the exponential of the \(i^{th}\) element of the input vector. This exponential transformation ensures that all values are positive. \(p_i\)’ is the probability that the input belongs to class i, and the denominator \({\sum _{j=1}^n e^{z_j}}\) calculates the sum of exponentials for all classes. The softmax function generates a higher likelihood of categories with higher scores and a lower possibility of categories with lower scores, transforming raw scores into probability distributions over the classes. The softmax layer produces class probabilities, while the CTC loss ensures that the predicted sequence aligns correctly with the ground truth, making it a crucial component in sequence-to-sequence tasks like speech recognition.

3.8 CTC LOSS

CTC loss is used as an objective function to train the proposed model, which permits end-to-end training without the necessity for frame-level alignment between input and target labels. A single set of label tokens at each time step can be denoted as \(\chi \) through the utilization of CTC, where the sequence of size-T produced by the temporal module constitutes the output marked with the blank symbols \(\phi \) and consecutive symbols are repeated. We can define a function to remove the adjacent character and the blank symbol denoted as function \(F: (\chi \cup \{\phi \})^* \rightarrow \chi ^*\) because blank symbol may come in the processed string. The probability of observing a labeled sequence \(\alpha \) can be computed by summing over this label as \(\gamma (\alpha |\beta ) = \sum _{u \in \mathcal {F}^{-1}(\alpha )} \gamma (u_1|\beta ) \ldots \gamma (u_T|\beta )\), considering all possible alignments. The conventional CTC loss, denoted as \(L_{\text {CTC}}\), is defined as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} \gamma ^{\text {ctc}}(\alpha |\beta ) = \sum \limits _{\eta \in \mathcal {F}^{-1}(\alpha )} \gamma ^{\text {ctc}}(\eta |\alpha ) \\ = \sum \limits _{\begin{array}{c} \eta \in \mathcal {F}^{-1}(\alpha ) \\ \end{array}} \displaystyle \prod \limits _{t=1}^{T} \tau ^{t}_{\rho _t} \\ L_{\text {ctc}} = -\ln \gamma ^{\text {ctc}}(\alpha |\beta ) \end{array}\right. } \end{aligned}$$
(12)

\(T_i\) represents the input time duration of the frame sequence, while \(\rho _i\) is the output label produced by softmax probability \(\tau ^i_{\rho _i}\), where \(\rho _i\) is chosen from the set {la, le, bn, bl, ..., pl, blank} at frame t. The sequence path of CTC is defined as \(\rho = (\rho _1, \rho _2, \ldots , \rho _{T})\), and \(\alpha \) denotes the sentence label (ground truth). The set of all viable paths within CTC that can be mapped to the ground truth \(\alpha \) is represented by \(\mathcal {F}^{-1}(\alpha )\). CTC restricts auto-regressive connections to model dependencies between time steps in a label sequence. This conditional independence is obtained by ensuring that the model is unaffected by the marginal distributions created at each successive phase. Consequently, CTC often decoded using a beam search method to reintroduce label temporal dependencies, combining probabilities with a language model to achieve this.

3.9 Learning rate scheduler

This article present a dynamic learning rate scheduler for adjusting the learning rate during the training process. The learning rate is a hyper-parameter that defines the step size during each iteration. Reducing the learning rate as the training advances to improve model performance is advantageous. To adopt the variable learning rate, first, we find the maximum number of epochs to be required for training the model, then, the range of the maximum number of epochs is divided into three parts\((epoch< EH1, EH1> epoch< EH1, epoch > EH1 )\), a fixed learning rate of 0.0001 is employed for the first EH1 epochs initially.

Algorithm 3
figure c

Learning Rate Scheduler

Subsequently, an exponential decay technique is applied, causing the learning rate to decrease exponentially with each epoch. This gradual reduction in the learning rate allows the model to adjust its weights more precisely. So, the learning rate is reduced exponentially by \(\lambda _1\) for the epochs greater than EH1 to less than EH2 epoch. Then, learning rate is reset to the initial value of 0.0001 and decays exponentially by \(\lambda _2\) for the epochs greater than EH2 to max_epochs. These adaptive adjustments have resulted in achieving a training accuracy of 98.85%. In our experiment, we observe that the proposed model requires max_epochs is 80 to achieve better performance. We have also set \(EH1 = 15\), \(EH2 = 30\), \(\lambda _1 = -0.2\) and \(\lambda _2 = -0.1\) experimentally (refer Algorithm 3). After completing max_epochs, the training loss and accuracy of the model reach a point where model have remained constant and did not show significant improvement or degradation. The graph for model evaluations with respect to the learning rate includes data points at 21 intervals, with sampling occurring every 4th epoch, including the last epoch, within the range from 1 to max_epochs. The visualization of graph is given in Fig. 8, which helps to understand how the learning rate influences the model convergence and performance as the training progress.

Fig. 8
figure 8

Training, validation accuracy and loss with learning rate

Table 4 Metrics for loss and accuracy based on data split ratio

Figure 8 depicts the relationship between loss, accuracy, and learning rate throughout training epochs. The graph presents data at intervals of every 4th epoch, up to a maximum of 80 epochs. Initially set at 0.0001, the learning rate remains constant for some epochs before gradually decreasing through exponential decay. The implementation of dynamic learning rate scheduler is given in algorithm 3. As the number of epochs increases, the learning rate decreases, and training and validation losses decrease as well, converging towards zero. Simultaneously, accuracy trends towards 1.00, indicating optimal model performance. Ultimately, when the learning rate approaches zero, the model achieves peak performance, with optimized loss and accuracy values.

3.10 Model evaluation

An exhaustive evaluation of the model performance is being carried out using the GRID dataset. This assessment encompasses various metrics, including WER, WRR, and monitoring loss and accuracy of the model throughout the training and validation phases.

3.10.1 Training and validation

Training is the process of teaching a model to make predictions by adjusting its parameters based on labeled data, while validation is the evaluation of the model performance on a separate dataset to ensure its accuracy and generalization capabilities. Our model undergoes an extensive training process spanning 80 epochs, employing the highly effective Adam optimizer to maximize its performance. Experimental dataset is partitioned into distinct segments to ensure a robust evaluation, allocating 80% of the data for rigorous training. The remaining 10% each is designated for the crucial tasks of validation and testing, allowing us to scrutinize the model capabilities and effectiveness meticulously. The model obtained 99.5% training accurcay, 98.8% validation accuracy while 1% training loss and 1.5% validation loss. The total number of examined frames can be computed as the product of the number of videos (\(V_d\)), frames per video (\(N_f\)), and subjects (\(T_s\)): total analyzed frames = \(V_d \times N_f \times T_s\).

Fig. 9
figure 9

Model loss and accuracy with training, validation

Table 5 WER and WRR comparison with existing methods

The Table 4 offers a comprehensive breakdown of the video dataset, encompassing the distribution ratio for the total data sample (Data Size), as well as metrics such as model training loss (Train Loss), training accuracy (Train Acc.), validation loss (Valid Loss), and validation accuracy (Valid Acc.), expressed in percentages. The last three columns of Table 4 indicates the percentage of data allocated for model training (Train), model validation (Valid), and model testing (Test). We have conducted model testing on a subset of 720 videos, representing 10% of the whole dataset (7200 videos), for assessing both WER and WRR. In contrast, Fig. 9 represents the loss and accuracy of model training and validation, enhancing visualization and understanding. The model demonstrates high training and validation accuracy and low loss, indicating its effectiveness in capturing patterns and enhancing its potential for accurate predictions.

3.10.2 WER and WRR

The WER and WRR is essential measures for determining the predictive power of the VSR model. These metrics measure the dissimilarity between the words recognized by the VSR model and spoken words in the visual input, which serves as the reference or ground truth. The WER is evaluated for the overlapping speakers, given in Eq. 13. The WER is calculated by the sum of the minimum number of insertions, substitutions, and deletions denoted as (\(M_i, M_s, M_d\)), respectively and divided by the total number of words in the ground truth. WRR serves as a complementary metric to WER, computed as 1-WER.

$$\begin{aligned} \text {WER} = \frac{(M_i + M_s + M_d)}{N} \end{aligned}$$
(13)

The Table 5 illustrates the comparison of WER and WRR between the proposed model with the existing models, which contains significant observations on the overall effectiveness of the proposed model. Specifically, when tested on overlapped speakers, the model achieves a WER of 1.11% and a WRR of 98.89%. These results highlight the model effectiveness in differentiating spoken words in challenging scenarios involving overlapping. The results revealed that the proposed model outperforms the existing VSR models, demonstrating better performance in terms of both WER and WRR.

4 Conclusion and future scope

This study has presented a comprehensive deep-learning approach for word-level VSR on the GRID dataset. The proposed methodology showcases the development of a more robust system, leading to significant enhancements compared to existing VSR models. This work optimized the 3D-CNN architecture to extract detailed local information from the lip region. The incorporation of BiLSTM introduced temporal relationship, enriching context embeddings that are crucial for VSR. The proposed dynamic LRS plays a vital role in adjusting the learning rate during model training, considerably improving its performance. Additionally, CTC loss measures the difference between actual and predicted outputs, which improves the model training and performance. The model is verified through the experiment and compared with existing techniques, obtained 1.11% WER and 98.89% WRR for overlapped speakers, which outperforms than the existing state-of-the-art methods. The future endeavor involves leveraging multi-modal features to articulate lip motion, prioritizing speaker independence by incorporating a self-attention mechanism.