1 Introduction

Audiovisual speech recognition is the most widely used technique today to automatically detect what a person is saying in the form of text. In the modern era, it has gained a lot of popularity and we use it almost in our day-to-day life in the form of Google Assistant or even Amazon Alexa. However, the common observation is that this audio speech recognition is mostly used indoors and does not give a good response outdoors. This is due to an intervention of noise. The noise adds to the audio signals and most of the necessary data are lost. That is not the case when it comes to Visual Speech Recognition (VSR), which has advantages over Audio Speech Recognition. They are (a) it is not attentive to audio noise and modification in audio environments has no effect on the data. (b) Does not need the user to make a sound. In present times we have a lot of data available and even possess a high computational ability.

AVSR primarily consists of two main part; The Audio recognition and Visual recognition (Lip reading). While, the audio recognition consists of feature extraction and recognition processes, the video recognition consists of face detection, lip localization, feature extraction and recognition. Combined, these two sources of speech information result in better automatic recognition rates than were obtained from either source alone. We chose to map the visual signal into an acoustic representation closely related to the vocal tract’s transfer function. Given such a mapping, the visual signal could be converted and then integrated with the acoustic signal prior to any symbolic encoding. The objectives of the work is given below.

  1. (a)

    Develop a database for English Language.

  2. (b)

    Audio feature extraction using MFCC and Classification using 1D CNN.

  3. (c)

    Develop an algorithm for lip localization.

  4. (d)

    Develop an LSTM algorithm for Visual speech recognition.

  5. (e)

    Integration of Audio and Visual Speech using Deep Neural Network.

  6. (f)

    Comparison of Proposed Result with Existing Results.

The rest of this paper is scheduled as follows: In Sect. 2 discussed the literature review of the existing audio visual speech recognition methods. In section discussed the database details. In Sect. 4 explained the proposed methodology. In Sect. 5 discuss the result and discussion of the proposed method. In Sect. 6 concludes the proposed work.

2 Literature review

An extensive literature survey has been conducted prior to the beginning of the proposed work. In this section discussed the existing algorithms used for audiovisual speech recognition and also figure out the drawback of the existing systems.

This will also serve us with an advantage in using various machine learning and deep learning algorithms to get the best results possible. The LRS2 database is used as the most common database available [1]. The feature extraction is carried out within the Region of Interest. The performance observed by the audio speech recognition was not on par compared to the performance given by the AVSR in noisy conditions. The noise can be of different types like street, train, etc. It shows that the noise independent of the type gives the same results. Lip-reading is the job of deciphering a transcript from the measure of a presenter’s mouth. Ahmad B. A. Hassanat explained different approaches to lip localization [2]. Ayaz A. Shaikh et al. proposed the depth sensor camera has also been used to get the third dimension in the dataset. During the creation of the dataset, the above-mentioned factors have been taken care of by using a headrest [3]. Themos Stafylakis et al. proposed residual and LSTM techniques for the LRW database and get 83% accuracy [4]. Shillingford et al. have used Lipnet. In the lipnet, used two approaches to solve the problems are learning the visual features and prediction. The word error rate of this work is 89.8% and 76.8% [5].

One of the other architectures to implement Lip-reading is Long-Short Term Memory (LSTM) [6]. LSTM is used for lip-reading that determines the words from the video input, it is accomplished by selectively indicating spatiotemporal balances that are important for an individual dataset. LRS2 datasets were used in the model and it achieves 85.2%. G. Sterpu et al. [7] look int futuristic Deep Neural Network architectures for lip-reading founded on a sequence-to-sequence Recurrent Neural Network. This work makes sure for both redeveloped and 2D or 3D Convolutional Neural network visual frontends, operational monotonic consideration, and a combined connectionist Temporal Classification Sequence-to-sequence loss. This evaluated system is done with fifty-nine talkers and terminology of over six thousand arguments on the widely accessible TCD-TIMIT dataset. Kumar et al. [8] showed the set of experiments in detail for speaker-dependent, out-of-vocabulary, and speaker-independent settings. To show the real-time nature of audio produced in the system, the hindrance values of Lipper have been compared with other speech reading systems. The audio-only accuracy is 80.25%, the annotation accuracy variance is 2.72% in audio, and Audio-visual accuracy is 81.25%, the annotation accuracy variance is 1.97% in audio-visual. One of the common datasets used in lip reading is Grid audio-visual dataset, the work in [9] is based on the Grid audio-visual dataset. The visual dataset is recorded with a frame rate is 25 Frames per second, a total of 75 frames per sample for 3 s.

In this work, LCA Net and end-to-end deep neural networks. The system archives 1.3% CER and the word error rate is 3.0%. Dilip Kumar et al. suggested the new-fangled SD-2D-CNN-BLSTM [10] architecture. The analysis of two different approaches like 3D-2D-Convolutional neural network-BLSTM trained with CTC loss on Characters and 3D-2D-Convolutional neural network- Bidirectional Long Short-Term Memory (CNN-BLSTM) trained with CTC loss on word labels for lip-reading is presented. For the first approach, the word error rate is 3.2% and 15.2% for seen and unseen words respectively. Performance on-grid dataset of the second approach, the word error rate is 1.3% and 8.6% for seen and unseen words respectively. The performance of the Indian English unseen dataset word error rate is 19.6% and 12.3% for the two approaches. One of the most famous datasets used for lip reading is “Lip Reading in the wild (LRW)” [11] from BBC Tv it contains 500 targeted words. Themos Stafylakis et al. used Residual networks and Bidirectional LSTMs and the misclassification rate of the architecture is 11.92%. Using the same database and the same method got 83% accuracy.

Audiovisual speech recognition is one prospective explanation for speech recognition in a noisy environment [12]. Shiliang Zhang et al. used bimodal –DFNN, used 150 h of multi-condition training data, and archives a 12.6% phone error rate for clean test data. The Word error rate is 29.98%. Kuniaki Noda et al. introduce a multi-stream HMM model for integration of Audio and Visual features [13]. The Word Recognition rate of MSHMM is 65% and the Signal noise ratio is 10 dB. Stavros Petridis et al. Long–short Memory based end to end visual speech recognition classification [14]. The model contains two flows which citation features straight away from the mouth. The two streams take place via bidirectional Long Short Term Memory. Databases like ouluVS2 and CUAVE used, the accuracy of the work is 9.7% and 1.5% respectively. Fei Tao et al. [15] proposed structure is likened with Conventional Hidden Marko Model with observation models fulfilled with Gaussian mixture model and used this channel matched word error rate is 3.70% and Channel mismatched word error rate is 11.48%. The hybrid Connectionist Temporal Classification architecture for audiovisual recognition of speech in the wild is used in the [16]. The audio features are of many kinds. The three of them used in [17] are LPC, PLP, and MFCC.

The study shows that the MFCC has the highest accuracy of about 94.6% for the Hindi Language in a noiseless environment. It proceeds a lot of periods to create and process the data to be in the format required for the application. The objective that is defined in the work [18] can be affected by the varying light intensity, movement of the head, the distance from the camera. Ochiai et al. proposed the most significant speaker clues are extracted from the dataset. This is attention-based feature extraction. They have used 3 layers of BLSTM with 512 units each[19]. Joon Son Chung et al. proposed a new set of databases called LRS it contains 100,000 normal sentences from BBC television [20]. Namboodiri et al. used Charlie Chaplin videos; the word spotting technique achieves 35% upper despicable typical accuracy over recognition-based technique on extensive LRW dataset. Determine the request of the technique by word recognizing in a standard speech video are “The great dictator” by Charlie Chaplin [21]. Thabet et al. applied machine learning methods to identify lip interpretation and three classifiers became the preeminent outcomes which are Gradient Boosting, Support vector machine, and logistic regression with outcomes 64.7%, 63.5%, and 59.4% correspondingly [22].

Yaman Kumar et al. proposed a speech reading or lip-reading is the method of empathizing and receiving phonetic topographies from a presenter’s visual features such as movement of mouths, face, teeth, and tongue [23]. Lu et al. proposed technology for visual speech recognition which association’s machine visualization and linguistic perception [24].

Iain et al. created their custom database called AV letters and they used three approaches first one is the hidden Marko model for recognition, for lip features top-down method is used and the third one is a bottom-up method used for nonlinear scale analysis [25]. Abderrahim Mesbah et al. proposed Hahn CNN for three databases like AV letters; Oulu VS2 and BBC LRW got an accuracy of 59.23%, 93.72%, and 58.02% [26]. Shashidhar et al. proposed the VGG16 CNN method for visual speech recognition and in this experiment; they used custom datasets and got 76% accuracy [27]. Xinmeng et al. proposed the multi-layer feature fusion convolution neural network for audiovisual speech recognition and apply MFFCNN to TCD-TIMIT and GRID Corpus dataset and got an accuracy of 82.7% [28]. Weijiang Feng et al. proposed the Multimodal recurrent neural network for audiovisual speech recognition and MRNN is applied to the AV Letter dataset and got an accuracy of 84.4% [29].

3 Database

In this section discuss the dataset creation steps, dataset features and also discuss the challenges are faced when we creating the dataset for English Language.

3.1 Dataset creation

Data-set is created for both English Words using an extensive setup which includes an electronic gimbal for stable video and a Smartphone with sufficient storage space. In Table 1 mention the parameter of the dataset features. The dataset is embraced of interrelated audio and lip movement data in various videos of multiple topics construing identical words. The formation of the dataset was finished to enable the progress and proof of procedures charity to train and test the method that contains lip-motion. The data set is a gathering of videos of agrees declaiming a fixed screenplay that is planned to be used to train software to recognize lip-motion patterns.

Table 1 Dataset features

The recordings were collected in a controlled, noise-free, indoor setting with a smartphone capable of recording at 4 K resolution. This data set consists of around 240 video samples per person. 11 male and 13 female subjects, with ages ranging from 18 to 30, volunteered for the data-set creation process. This data set can be used for speech recognition, lip reading applications. Around 240 video samples were collected per subject.

3.2 Challenges while creating dataset

Various challenges were encountered during the data-set creation process which is explained below.

  • Interference of external noise may disrupt audio feature extraction. A noise-free environment is an important requirement of data-set creation.

  • Lip movement of an individual should be in conjunction with each other to extract the lip feature, random movement of lip leads to error.

  • Each person who is ready to give a database has to spare around 30–45 min reciting the words, which can be tedious.

  • Recording a video of a person with a mustache or beard leads to difficulty in detecting lip movement.

  • The selection of English and Kannada words to prepare the database was difficult as some of the words have similar pronunciations.

4 Methodology

This section deals with the pipeline and the methodologies that were implemented in AVSR. To implement AVSR, the custom database was created and the language that was used in the database was English, which comprised of seven different words. The seven words are ‘About’, ‘Bottle’, ‘Dog’, ‘English’, ‘Good’, ‘People’, ‘Today’. Fifteen individuals pronounced each word five times, and it was recorded. Therefore, in total (15persons × 7 words × 5 times = 525 videos) the dataset consists of five hundred twenty-five videos. Out of these, 420 videos were used for training the model and the remaining were used for testing.

4.1 Audio speech model

First audio files are created from the video dataset and saved in.wav formats using FFmpeg. Then the features are extracted from the audio using Libros which is an open-source module that is available in python. The five features that are extracted from audio are MFCCS, CHROMA, MEL, CONTRAST, and TONNETZ. All these features are combined to get generate the feature vector of size 193 × 1. Next, a Convolutional Neural Network using 1 Conv1D layer, followed by MaxPooling1D layer, Batch normalization layer, and dropout and then followed by two Dense Layers is created. One-dimensional CNNs work with a sequence in one dimension, and tends to be useful in various signal analysis over fixed-length signals. They work well for the analysis of audio signals, for instance. The output from the corresponding layers of the audio model will be as follow.

$$a^{1} + b^{1} = \mathop \sum \limits_{i = 1}^{193} conv1D(wi,X\_tran\_audio\left[ i \right]$$
(1)
$${\text{y}}^{{1}} = {\text{ R }}\left( {{\text{a}}^{{1}} } \right)$$
(2)
$${\text{y}}^{{2}} = {\text{ R }}\left( {{\text{w}}^{{2}} {\text{y}}^{{1}} + {\text{b}}^{{2}} } \right)$$
(3)
$${\text{y}}^{{3}} = {\text{ R }}\left( {{\text{w}}^{{3}} {\text{y}}^{{2}} + {\text{b}}^{{3}} } \right)$$
(4)
$${\text{y}}^{{4}} = {\text{ R }}\left( {{\text{w}}^{{4}} {\text{y}}^{{3}} + {\text{b}}^{{4}} } \right)$$
(5)

where yi is the output vector of layer i, R is the ReLu activation function, wi is the weights of layer i, bi is the bias of layer i. Finally, a SoftMax layer was attached to this network for the classification. The loss function which is used to train the model is cross-entropy which is given by

$${\text{L}}^{{ < {\text{ t }} > }} \left( {\hat{y}^{{ < {\text{ t }} > }} ,{\text{y}}^{{ < {\text{ t }} > }} } \right) = \, - {\text{y}}^{{ < {\text{ t }} > }} {\text{log}}\left( {\hat{y}^{{ < {\text{ t }} > }} } \right) - \, \left( {{1} -^{{ < {\text{ t }} > }} )} \right){\text{log}}\left( {{1} - \, \hat{y}^{{ < {\text{ t }} > }} } \right)$$
(6)

4.2 Visual model

First mouth region is extracted from the video using dlib library which is available in python 3 as shown in Fig. 1 in each frame of the video. Then the regions are converted into grey scale to reduce the complexity of the model as shown in Fig. 2 Then the position of the outer lip coordinates are extracted and saved in the feature vector.

Fig. 1
figure 1

Mouth ROI Extraction

Fig. 2
figure 2

Conversion to gray scale

Recognition of visual speech using Long Short-Term Memory (LSTM), Fig. 3 shows the structure of LSTM Cell. Then a model with a network of LSTM’s and dense layers (Deep LSTM Network) is created. Long Short-Term Memory (LSTM) network is a type of recurrent neural network which is can learn order dependence in a sequence prediction problem. It contains three gates: the first one is input gate, forget gate, and output gate. The model which is created contains an LSTM layer with 8 hidden units and 30-time stamps as the first layer. The operations inside the LSTM cell are as follows. Forget Gate decides whether to keep or forget the info from the previous timestamps and Input Gate quantifies the importance of that data coming as an input and the Output Gate figures the most relevant output that it must generate.

Fig. 3
figure 3

Structure of an LSTM cell

While using LSTM tanh activation function is used, the structures of the tanh activation function as shown in Fig. 4.

Fig. 4
figure 4

Performance of Tanh Activation Function

Tanh means hyperbolic tangent function it is like a sigmoid activation function.

The function accepts any real value as input and returns a value between − 1 and 1. The larger the input means more positive values, the closer the output to 1.0, and the smaller the input means more negative values, the closer the output to − 1.

$$\sigma \left( z \right) = \frac{{e^{z} - e^{ - z} }}{{e^{z} + e^{ - z} }}$$
(7)

The model which is created contains an LSTM layer with 8 hidden units and 30-time stamps as the first layer. The operations inside the LSTM cell are as follows.

$${\tilde{\text{c}}}\left\langle {\text{t}} \right\rangle { } = {\text{tanh }}\left( {{\text{W}}_{{\text{c }}} \left[ {{\text{a}}\left\langle {{\text{t}} - 1} \right\rangle ,{\text{x}}\left\langle {\text{t}} \right\rangle } \right]{ } + {\text{b}}_{{\text{c}}} } \right)$$
(8)
$${\Gamma }_{{\text{u}}} = {\upsigma }\left( {{\text{W}}_{{\text{u}}} \left[ {{\text{a}}^{{\left\langle {{\text{t}} - 1} \right\rangle }} ,{\text{x}}^{{\left\langle {\text{t}} \right\rangle }} } \right] + {\text{b}}_{{\text{u}}} } \right)$$
(9)
$${\Gamma }_{{\text{f}}} = {\upsigma }\left( {{\text{W}}_{{\text{f}}} \left[ {{\text{a}}^{{\left\langle {{\text{t}} - 1} \right\rangle }} ,{\text{x}}^{{\left\langle {\text{t}} \right\rangle }} } \right] + {\text{b}}_{{\text{f}}} } \right)$$
(10)
$${\Gamma }_{{\text{o}}} = {\upsigma }\left( {{\text{W}}_{{\text{o}}} \left[ {{\text{a}}^{{\left\langle {{\text{t}} - 1} \right\rangle }} ,{\text{x}}^{{\left\langle {\text{t}} \right\rangle }} } \right] + {\text{b}}_{{\text{o}}} } \right)$$
(11)
$${\text{c}}^{{\left\langle {\text{t}} \right\rangle }} = {\Gamma }_{{\text{u}}} * \tilde{c}^{\left\langle t \right\rangle } + {\Gamma }_{{\text{f}}} *c^{{\left\langle {t - 1} \right\rangle }}$$
(12)
$$a^{\left\langle t \right\rangle } = {\Gamma }_{{\text{o}}} * {\text{tanh}}\left( {c^{\left\langle t \right\rangle } } \right)$$
(13)

where c represents the memory cell, t represents the time stamp, Γu represents the update gate. Γf represents the forget gate, Γo represents the output gate, σ represents the sigmoid function, Wu represents the weights of update gate, Wf represents the weights of forget gate, Wo represents the weights of output gate, b represents bias, ˜c represents candidate cell variables.

Next, one more LSTM layer is introduced. So the resulting output from the second layer will be

$$\widetilde{{{\text{c}}1}}^{{\left\langle {\text{t}} \right\rangle }} = {\text{tanh }}\left( {{\text{W}}1_{{\text{c }}} \left[ {{\text{a}}1^{{\left\langle {{\text{t}} - 1} \right\rangle }} ,{\text{a}}^{{\left\langle {\text{t}} \right\rangle }} } \right]{ } + {\text{b}}1_{{\text{c}}} } \right)$$
(14)
$$~\Gamma 1_{f} = \sigma \left( {W1_{f} \left[ {a1^{{\left\langle {t - 1} \right\rangle }} ,x1^{{\left\langle t \right\rangle }} } \right] + b1_{f} } \right)$$
(15)
$${\Gamma }1_{{\text{o}}} = {\upsigma }\left( {{\text{W}}1_{{\text{o}}} \left[ {{\text{a}}1^{{\left\langle {{\text{t}} - 1} \right\rangle }} ,{\text{x}}1^{{\left\langle {\text{t}} \right\rangle }} } \right] + {\text{b}}1_{{\text{o}}} } \right)$$
(16)
$${\text{c}}^{{\left\langle {\text{t}} \right\rangle }} = {\Gamma }1_{{\text{u}}} * \tilde{c}^{\left\langle t \right\rangle } + {\Gamma }1_{{\text{f}}} *c1^{{\left\langle {t - 1} \right\rangle }}$$
(17)
$$a^{\left\langle t \right\rangle } = {\Gamma }1_{{\text{o}}} * {\text{tanh}}\left( {c1^{\left\langle t \right\rangle } } \right)$$
(18)

This is followed by three dense layers. Hence, again the output equations from these three dense layers will be.

$${\text{y}}^{{2}} = {\text{ R }}\left( {{\text{W2 }}*{\text{ a1 }} + {\text{ b2}}} \right)$$
(19)
$${\text{y}}^{{3}} = {\text{ R }}\left( {{\text{W3 }}*{\text{ y}}^{{{2} }} + {\text{ b3}}} \right)$$
(20)
$${\text{y}}^{{4}} = {\text{ R }}\left( {{\text{W4 }}*{\text{ y}}^{{3}} + {\text{ b4}}} \right)$$
(21)

and finally, a softmax layer was attached for to this network for the classification. The loss function which is used to train the model is cross entropy which is given by

$$L^{\left\langle t \right\rangle } \left( {\hat{y}^{\left\langle t \right\rangle } ,y^{\left\langle t \right\rangle } } \right) = - y^{\left\langle t \right\rangle } {\text{log}}\widehat{(y}^{\left\langle t \right\rangle } ) {-} \left( {1 - y^{\left\langle t \right\rangle } } \right)log(1 - \hat{y}^{\left\langle t \right\rangle }$$
(22)

4.3 Fusion model

Integration of Audiovisual speech recognition using a deep neural network. Integration of audiovisual contains three parts, the first one is the audio-only second one is visual only third one is a integration of the audio and visual. Figure 5 shows the deep convolutional neural network, in this model one input layer, one output layer and two hidden layers are used.

Fig. 5
figure 5

Deep convolutional neural networks

In the audio-only part, features are extracted in the same way as the Audio model. Then a deep Convolutional Neural Network is created. The model which is created is the replica of the Audio model except in place of a softmax layer there is an additional dense layer.

$$y^{5 } = R\left( {W^{5} y^{4} + b^{5} } \right)$$
(23)

In the Video-only part, Video features as extracted the same way as the Video model. Then a deep LSTM network is created. The first layer which is the LSTM layer contains 128 hidden units with 8-time stamps. This LSTM layer is followed by a dropout and a dense layer. So, the resulting equation from this dense layer will be,

$$y_{v} = R\left( {W2*a1^{\left\langle t \right\rangle } + b2} \right)$$
(24)

In “Combination of Audio-only and Video-only parts” the feature map from the first dense layer from the Audio-only part is concatenated with the feature map from the first LSTM layer from Video only part. From Eqs. (3) and (13),

$$a_{c } = \left[ {a1^{\left\langle t \right\rangle } ,y^{2} } \right]$$
(25)

The resulting feature map is passed on to a deep neural network which contains three dense layers while the first two dense layers are followed by a Batch normalization layer and a dropout layer respectively

$$y_{d1 } = R\left( {W_{d1 } *a_{o} + b_{d1} } \right)$$
(26)
$$y_{d2 } = R\left( {W_{d2 } *y_{d1} + b_{d2} } \right)$$
(27)
$$y_{d3 } = R\left( {W_{d3 } *y_{d2} + b_{d3} } \right)$$
(28)

And as the final step, all the above three parts are combined so the vector formed by this will be a combination of the output vector of all the above three parts. Therefore, from Eqs. (23), (24) and (25)

$$ac_{2 } = \left[ {y^{5} ,y_{v} ,y_{d3} } \right]$$
(29)

Then this is passed on to a deep neural network which contains three dense layers followed b a batch normalization layer and a dropout layer.

$$yc_{1} = R\left( {W_{c1} *a_{c2} + b_{c1} } \right)$$
(30)
$$yc_{2} = R\left( {W_{c2} *y_{c1} + b_{c2} } \right)$$
(31)
$$yc_{3} = R\left( {W_{c3} *y_{c2} + b_{c3} } \right)$$
(32)

5 Result and discussion

This section discusses the result of the audio-only model, visual-only model, and audio-visual model with accuracy curve, loss curve, and confusion matrix and with classification table.

5.1 Audio speech recognition evaluation

Figure 6 shows the training of epoch for audio-only model and here shows the training accuracy 90.48% and testing accuracy of 96.62%. Figure 7 shows the accuracy curve of the audio-only model.

Fig. 6
figure 6

Training of Epochs for audio

Fig. 7
figure 7

Model accuracy curve for audio, epoch vs accuracy

Figure 8 shows the loss curve of the audio-only model. Figure 9 shows the accuracy and misclassification of each word. The first word “About” is being recognized with 80% accuracy and misclassification is only 20%, which means the algorithm predicted 80% as “About” and 20% as “Today” as in the graph. The second word is “Bottle”. It is recognized with 86% accuracy and misclassification is 14%, which means the algorithm predicted 86% as “Bottle” and 14% as “Dog” as in the graph. The third word “Dog” is recognized with 86% accuracy and misclassification is 14%, which means it predicted 86% as “Dog” and 14% as “About”, as shown in the graph. The fourth word “English” is recognized with 100% accuracy and no misclassification means it predicted 100% as “English” as shown in the graph. The fifth word “Good” is recognized with 93% accuracy and misclassification is 07%, which means it predicted 93% as “Good” and 07% as “Bottle”, as in the graph. The sixth word “People” is recognized with 86% accuracy and misclassification is 14%, which means it predicted 86% as “People” and 14% as “Dog”, and “Today” as in the graph. The seventh word “Today” is recognized with 100% accuracy and no misclassification means it predicted 100% as “Today” as in the graph. Table 2 shows the classification report for an audio database. The precision, recall, accuracy, and F1-score of the proposed system are perceived as 91%, 90%, 90%, and 91% respectively.

Fig. 8
figure 8

Model loss curve for audio, epoch vs accuracy

Fig. 9
figure 9

Confusion matrix for audio-only model

Table 2 Classification report for audio model

5.2 Visual speech recognition evaluation

Figure 10 shows the number of epoch used for the visual model training model and here shows the training accuracy 71.43% and testing accuracy of 82.73%. Figure 11 shows the model accuracy model for the visual model and the graph is epoch versus accuracy. This graph shows the training and validation graph. Figure 12 shows the loss curve of the visual model and this graph shows the training loss and validation loss. Figure 13 shows the accuracy and misclassification of each word of visual speech recognition. The first word “About” is being recognized with 86% accuracy and misclassification is only 14%, which means the algorithm predicted 86% as “About” and 14% as “Good” as in the graph. The second word is “Bottle”. It is recognized with 33% accuracy and misclassification is 67%, which means the algorithm predicted 33% as “Bottle” and 26% as “English”, 20% as “Good”, 13% as “People”, and 6.6% as “Today” as in the graph. The third word “Dog” is recognized with 60% accuracy and misclassification is 40%, which means it predicted 60% as “Dog” and 40% as “Bottle”, and “Good” as shown in the graph. The fourth word “English” is recognized with 80% accuracy and misclassification is 40%, which means it predicted 80% as “English” and 20% as “Dog”, and “Good” as shown in the graph. The fifth word “Good” is recognized with 86% accuracy and misclassification is 14%, which means it predicted 86% as “Good” and 14% as “About”, and “Today” as in the graph. The sixth word “People” is recognized with 80% accuracy and misclassification is 20%, which means it predicted 80% as “People” and 20% as “Dog”, and “Good” as in the graph. The seventh word “Today” is recognized with 73% accuracy and misclassification is 27%, which means it predicted 73% as “Today” and 27% as “Good”, and “People” as in the graph. Table 3 shows the classification report for an audio database. The precision, recall, accuracy, and F1-score of the proposed system are perceived as 75%, 71%, 71%, and 71% respectively.

Fig. 10
figure 10

Training of Epochs for Visual Speech

Fig. 11
figure 11

Model accuracy curve for Visual Speech, epoch vs accuracy

Fig. 12
figure 12

Model loss curve for Visual Speech, epoch vs accuracy

Fig. 13
figure 13

Confusion matrix for Visual Speech Recognition

Table 3 Classification report for visual model

5.3 Audiovisual speech recognition evaluation

Figure 14 shows the number of epoch used for the audiovisual model training model and here shows the training accuracy of 88.57% and testing accuracy of 91.93%. Figure 15 shows the model accuracy model for the audiovisual model and the graph is epoch versus accuracy. Figure 16 shows the loss curve of the audiovisual model and this graph shows the training loss and validation loss. Figure 17 shows the accuracy and misclassification of each word of visual speech recognition. The first word “About” is being recognized with 73% accuracy and misclassification is only 27%, which means the algorithm predicted 73% as “About” and 27% as “Bottle”, and “Dog” as in the graph. The second word is “Bottle”. It is recognized with 80% accuracy and misclassification is 20%, which means the algorithm predicted 80% as “Bottle” and 20% as “About”, “Dog”, and “English” as in the graph. The third word “Dog” is recognized with 100% accuracy and no misclassification means it predicted 100% as “Dog” as shown in the graph. The fourth word “English” is recognized with 100% accuracy and no misclassification means it predicted 100% as “English” as shown in the graph. The fifth word “Good” is recognized with 86% accuracy and misclassification is 14%, which means it predicted 86% as “Good” and 14% as “People”, and “Today” as in the graph. The sixth word “People” is recognized with 86% accuracy and misclassification is 14%, which means it predicted 86% as “People” and 14% as “Today”, as in the graph. The seventh word “Today” is recognized with 93% accuracy and misclassification is 07%, which means it predicted 93% as “Today” and 07% as “English” as in the graph. Table 4 shows the classification report for an audio database. The precision, recall, accuracy, and F1-score of the proposed system are perceived as 89%, 89%, 89%, and 88% respectively. Table 5 shows the comparison of the existing output with proposed methods with accuracy. Table 6 shows the comparison of the existing output of audiovisual speech recognition with proposed methods with accuracy.

Fig. 14
figure 14

Training of Epochs for Audiovisual Speech

Fig. 15
figure 15

Model accuracy curve for Audiovisual Speech, epoch vs accuracy

Fig. 16
figure 16

Model loss curve for Audiovisual Speech, epoch vs accuracy

Fig. 17
figure 17

Confusion matrix for Audiovisual Speech Recognition

Table 4 Classification report for audiovisual model
Table 5 Obtained visual results on the custom dataset in comparison with the existing method
Table 6 Obtained audiovisual results on a custom dataset in comparison with the existing method

6 Conclusion

In this work, we develop audiovisual speech recognition for a custom dataset and the dataset contains English words. First, we extract the audio features from the video and use 1D CNN for classification and got 90% accuracy and recognition of visual speech using the LSTM technique and got 71.42% accuracy. When combined the audio and visual using a deep neural network to get better accuracy in the AVSR model. The combined audio and video involving deep neural networks got 91% accuracy. Limitations are the proposed AVSR model recognizes a single word, this model cannot recognize sentences, and this does not end to end model. In future work, we can use more datasets for training and testing and plan to use different neural networks. Create a database in different angles other than the straight to the face to the speaker.