1 Introduction

Visual phrase recognition (VPR) is the task of recognizing spoken phrases based on the information of lip movement patterns. Audio-visual phrase recognition (AVPR) is also the task of recognizing spoken phrases, but based on acoustic information and corresponding lip movement information. The VPR and AVPR tasks performed by machines are very useful for many applications, more importantly for hearing impaired people [1] and biometric authentication applications. Moreover, lip movement related visual features are not affected by acoustic noise, and are therefore more robust against unfamiliar conditions. Such visual features are preferably used as supplementary information of acoustic features for developing robust AVPR system.

The lip movement related visual features can be divided into four groups; (1) Geometric visual features [2, 3], (2) Motion visual features [4, 5], (3) Hybrid features [6, 7] and (4) Appearance based visual features [8, 9]. Geometric visual features such as the mouth’s height and width [3] require a reliable lip contour tracking method and an accurate face detection algorithm [3]. This task is very difficult when the facial image includes beards and mustaches [10]. The optical flow technique can be used to estimate the motion visual features of the mouth region images [5], but this approach is sensitive to the speaker’s facial orientation and motion. In [7], the authors use lip movements related geometric and motion visual features together. They referred to it as hybrid feature. This approach is effectively able to recognize the English words, but with the support of the corresponding acoustic information. A lip contour tracking algorithm is not needed for the appearance-based visual feature extraction approach. Hence, this method of feature extraction is straightforward.

The appearance based visual features can be classified into two types: (1) global appearance based feature and (2) local appearance based feature. The global appearance based features are estimated directly from the entire region of the image using the discrete cosine transform (DCT) [8] and discrete wavelet transform (DWT) [9], thereby reflecting the global information. Instead of considering the entire region of the image, the local appearance information is extracted from the small regions or patches of the image to extract the micro-patterns of the image. The local appearance based features are also known as local descriptors. An example of such local descriptor is “Local Binary Pattern” (LBP) which describes the gray intensity variation of image patches. This feature represents the appearance or spatial information extracted in XY plane of the image, but not the temporal information. In [11], the authors acquire both spatial and temporal information by extracting the LBP in three orthogonal planes (TOP); XY, XT and YT planes of mouth region images. They termed this feature as spatio-temporal LBP feature or LBP-TOP feature. The histogram patterns of this feature are generated based on the gray intensity difference between the center and neighborhood pixels of mouth region image patches.

All of the above mentioned feature extraction approaches didn’t include differential excitation, gradient orientation, and gradient directional information, which provide salient micro-texture information about the image. The differential excitation is the ratio of the gray intensity difference between the central pixel and its neighborhood pixels to the central pixel gray value of the image patch. The gradient orientation is the angle between the vertical and horizontal intensity differences of neighborhood pixels. The last component; gradient directional information is defined by neighborhood pixels gray intensity differences at four different directions; vertical, horizontal, and two slant directions of the image patch. The histogram patterns of the first two components are generated by using the Weber Local Descriptor (WLD) [12] approach, while the histogram of the last component is generated by using the Gradient Direction Pattern (GDP2) [16, 18] approach. We conjecture that these histogram patterns could represent the micro-patterns of lip movements. Hence, the histogram patterns are derived from three planes of mouth region images and concatenated to obtain the proposed visual feature; Spatio-temporal Weber Gradient Directional (SWGD).

The LBP-TOP local descriptor is already explored for visual speech recognition tasks [11, 14]. However, there are many other local descriptors that are commonly used for face recognition tasks [15]. The commonly used twelve unique local descriptors include Binary Pattern of Phase Congruency (BPPC) [17], GDP2 [16, 18], Gradient Local Ternary Pattern (GLTP) [19], Local Directional Pattern (LDP) [20], Local Gradient Increasing Pattern (LGIP) [21], Local Gradient Pattern [22], Local Monotonic Pattern [23], Local Phase Quantization (LPQ) [24, 25], Local Transitional Pattern (LTP) [26], Median Ternary Pattern (MTP) [27], Pyramid of Histogram of Oriented Gradients (PHOG) [28] and WLD [12]. In this work, we extracted these local descriptors in three planes of mouth region images to obtain both the spatial and temporal information of lip movement patterns. The potential of the proposed SWGD visual feature is demonstrated by experiments and a comparative study with twelve different local spatio-temporal features. The dimension of the SWGD feature is reduced by using the soft locality preserving map (SLPM) [29]. This improves performance by increasing the feature’s discriminatory ability to classify different phrases. The dimensionally reduced SWGD feature is denoted by SWGD\(_{\text {SLPM}}\).

The performance of an audio-based speech recognizer is degraded in noisy environments due to the distortion of audio speech signals by acoustic noises [30, 31]. However, the visual feature is not affected by acoustic noises [32, 33]. The main aim of this work is to develop a robust phrase recognition system. So, we use both SWGD\(_{\text {SLPM}}\) visual and speech related audio information together for the development of a robust AVPR system. The speech related audio information is prominently represented by the characteristics of the time varying vocal-tract system and time-varying excitation source. In an audio-visual speech recognition system, mostly the Mel-frequency cepstral coefficient (MFCC) [34] feature is used as audio feature to represent the vocal-tract characteristics, and the glottal flow derivative (GFD) [35] wave related features are used to represent the excitation source characteristics. In our previous work [34], we used MFCC and glottal MFCC (GMFCC) together to improve the performance of the audio-visual speech recognition system. This work motivates us to explore MFCC and GMFCC features together as the representation of the audio information for a robust audio-visual phrase recognition task. In the case of excitation source information representation, the estimation approach of the GFD signal is very important for using excitation source information as a supplementary evidence in audio-visual speech recognition. There are many methods available in the literature for GFD estimation. The most efficient methods include iterative adaptive inverse filtering (IAIF) [35, 36], Dynamic Programming Phase Slope Algorithm (DYPSA) [37], zero-frequency resonator (ZFR) [38], speech event detection using the residual excitation and a mean based signal (SEDREAMS) [39], yet another glottal closure instants (GCI) algorithm (YAGA) [40] and dynamic plosion index (DPI) algorithms [41]. With all these approaches, we prefer to make a comparative study and select the best possible method of GFD estimation to extract the GMFCC excitation source feature for phrase recognition task.

The main contributions of the work presented in this paper are: (1) Proposed spatio-temporal visual features; SWGD and SWGD\(_{\text {SLPM}}\) for visual and audio-visual phrase recognition systems, (2) Explored the twelve different local descriptors commonly used in face recognition, and applying them to visual phrase recognition for a comparative study, (3) Finding the suitable GFD estimation method to extract the GMFCC feature for AVPR system, (4) Analyzing the advantages of using the SWGD\(_{\text {SLPM}}\) visual feature in together with MFCC and GMFCC audio features for audio-visual phrase recognition in different noisy conditions.

The rest of the paper is organized as follows: The literature survey of visual features used in lip reading is given in Sect. 2. The proposed visual feature is discussed in Sect. 3. The description of the database used for experimental analyses is provided in Sect. 4. Experimental results are discussed in Sect. 5. The summary and future scope are reported in Sect. 6.

2 Literature survey

In this section, we report the relevant works and compare the visual features that were employed in visual speech recognition tasks. Additionally, we carefully examine each feature to ascertain its benefits and drawbacks.

In this work [13], authors used three different approaches; Active Shape Model (ASM), Active Appearance Model (AAM) and Multiscale spatial analysis (MSA) to represent lip movement patterns. ASM utilizes statistical models constructed from annotated training images to represent the shape variability of lips. Unlike the ASM, which focuses primarily on shape, AAM considers both shape and appearance simultaneously. The third approach employs a nonlinear scale-space decomposition sieve algorithm to transform the images into a scale-space domain. The temporal information of the lip movements was not taken into account by any of these methods. The experimental studies were carried out with AVletters database [13].

In another work [11], authors proposed a visual that includes both spatial and temporal information of the lip movements. The binary codes or vectors of LBP obtained from XY planes gave the spatial information, whereas the temporal information such as horizontal and vertical motion of the lip movements, was described by feature vectors extracted from XT and YT planes of images. The distributions or histograms of these feature vectors in three planes were concatenated to obtain LBP-TOP features. They compared the performance of the proposed feature with the shape, motion, and global appearance visual features.

Authors [42] proposed a visual feature by combining the planar and stereo information of the global appearance visual feature (DCT) and local appearance visual feature (LBP-TOP) together. Directly concatenating these features would produce a very high dimensional feature. Hence, they reduced the dimension of DCT and LBP-TOP features by using Linear Discriminant Analysis (LDA) and minimal-Redundancy-Maximal-Relevance (mRMR) respectively and then concatenated them. The dimension of this concatenated feature is further reduced by using LDA approach. They termed these final feature vectors as Cascade Hybrid Appearance Visual Feature (CHAVF). This visual feature was employed for connected digit and isolated phrase recognition.

In this work [14], the authors employed the Phase based Eulerian video magnification (EVM) method to acquire the subtle patterns of lip movements by magnifying the input video. First, the desired frequencies of pyramid levels were amplified and passed through the temporal filter. A magnification factor was applied to the temporal filter’s output, and produced magnified video. Then, the compact representation of lip movement patterns was obtained from the magnified video using LBP-TOP feature extraction process. We denoted this feature as EVM + LBP-TOP. The support vector machine (SVM) classifier was employed to recognize the phrases.

Table 1 Summary of the literature survey on different types of visual features

The literature review is summarized in Table 1. The ASM feature represents the geometric representation or shape of the lip contours. This approach requires manual annotation of the lip contours, which takes a lot of time. We observe that local spatiotemporal visual feature; LBP-TOP outperforms AAM, ASM, MSA, optical flow and DCT based features for visual speech recognition tasks. This is because the local spatio-temporal visual feature acquires both local spatial and temporal information that effectively represent the lip movement patterns. The performance of LBP-TOP feature was improved by magnifying the video using EVM algorithm. However, the video magnification algorithm is time consuming process. So it will be difficult to employ in real-time applications. For small database such as AVletters database, the SVM classifier outperforms the HMM modeling technique for LBP-TOP feature. It means the choice of classifier has an impact on the performance of the visual speech recognition system. Since, OuluVS database [11] is a small database, so for developing phrase recognition system, we have chosen to use SVM classifier.

The gradient orientation, differential excitation, and gradient directional information that provide important micro-texture information about the mouth portion of images were not included in the visual feature extraction methods discussed in the literature review. Hence, we propose visual feature; SWGD to represent the lip movement patterns efficiently. The dimension of the proposed visual feature is high. So, we reduce the feature dimension by using SLPM algorithm.

3 Proposed methodology

In this section, we discuss the processing steps of proposed visual feature extraction. First the facial portion is detected and then cropped the mouth portion automatically. The differential excitation, gradient orientation, and gradient directional information are estimated from mouth region images. The histogram of WLD are generated by using the differential excitation and gradient orientation information, whereas the histogram of GDP2 is generated by using the gradient directional information. These histograms are obtained in XY, XT and YT planes to acquire the both spatial and temporal information. Then, these histograms are concatenated together to obtain the SWGD visual feature.

The facial portion detected by Viola-Jones face detection algorithm is further processed to crop the mouth region. In order to extract the spatio-temporal features of lip movements, it is very important to localize the mouth region accurately. The detected facial images are divided into blocks for finding the “Region of Interest” or “Mouth Region”. By conducting the empirical analysis, we decided to create 10 horizontal blocks and 11 vertical blocks of the detected facial image. These 10 horizontal and 11 vertical blocks are obtained by dividing the number of rows of image by 10 and number of columns by 11. The common portion between the last 3 horizontal blocks and vertical blocks (from \(3^{rd}\) to \(8^{th}\) vertical blocks) is considered as “Region of Interest”, where the mouth portion exist well. The cropped mouth region frames extracted from the input video is not equal in size. It means the number of rows and columns of the mouth region images are not same. However, to extract the spatio-temporal visual feature, frames present in each video should have an equal number of rows and columns. Therefore, we resized the mouth region image frames to maintain a uniform frame size for each video.

The differential excitation, gradient orientation, and gradient directional information are estimated from mouth region images or frames to represent the distinctive patterns of lip movements. While uttering speech, different patterns of lip movements are generated. These patterns can be represented by histogram of feature vectors generated from the mouth region images. The differential excitation and gradient orientation feature vectors are generated by using WLD, whereas the gradient directional feature vector is obtained by using GDP2. The histogram patterns of feature vectors are concatenated to obtain the proposed visual feature; SWGD.

The differential excitation (\(\Psi\)) is calculated by using the following equation.

$$\begin{aligned} \Psi = arctan\left( \sum _{x=0}^{S-1}\left( \frac{(I_x)-(I_c)}{I_c} \right) \right) \end{aligned}$$
(1)

where S is the total number of neighbor pixels. The intensity values of neighborhood pixels and central pixels are denoted by \(I_{x}\) and \(I_{c}\) respectively. Every central pixel has eight neighborhood pixels, therefore the value of S is 8.

The numerator of Eq. 1 is the sum of the differences of intensity value of neighboring pixels against its central pixel, whereas the denominator represents the intensity value of central pixel. For simplicity, the values of \(\Psi\) are quantized into N dominant differential excitation by using Eq. 2. By empirical analysis, the value of N is set to 8 for our proposed visual feature.

$$\begin{aligned} \Psi _{n} = floor\left( \frac{{ \Psi + {\pi }/{2}}}{{\pi }/{N}}\right) , ~~~ n = 0, 1, 2,....., N-1 \end{aligned}$$
(2)

The gradient orientation (\(\Theta\)) is defined by the equation below.

$$\begin{aligned} \Theta = arctan\left( \frac{I_{5}-I_{1}}{I_{7}-I_{3}}\right) = arctan\left( \frac{I_{V}}{I_{H}}\right) \end{aligned}$$
(3)

where \(I_{5}\) and \(I_{1}\) are the intensity values of lower and upper neighbors pixels of central pixel \(I_{c}\) whereas \(I_{7}\) and \(I_{3}\) are the intensity values of left and right neighbors pixels of \(I_{c}\).

The range of \(\Theta\) is within {-\(\pi\)/2, \(\pi\)/2}. To obtain more information about the gradient direction, the range of \(\Theta\) is increased by mapping to \(\Theta\) \('\)\(\in\) [0, 2\(\pi\)]. This mapping is done according to the values of (\(I_{V}\)) and (\(I_{H}\)).

$$\begin{aligned} \Theta '(x,y) = \left\{ \begin{matrix} \Theta (x,y),~~~~~~~~~I_{V}(x,y)> 0, I_{H}(x,y)> 0 \\ \Theta (x,y) + \pi ,~~~~I_{V}(x,y)< 0, I_{H}(x,y)> 0 \\ \Theta (x,y) + \pi ,~~~~I_{V}(x,y)< 0, I_{H}(x,y)< 0 \\ \Theta (x,y) + 2\pi ,~~~I_{V}(x,y) > 0, I_{H}(x,y) < 0 \end{matrix}\right. \end{aligned}$$
(4)

The values of \(\Theta\) \('\) are quantized to T dominant gradient orientation by using Eq. 5. In this work, the value of T is set to 4 because the proposed feature produced good performance at this value.

$$\begin{aligned} \phi _{t} = floor(\frac{{\Theta '}}{{2\pi }/{T}}), ~~~ t = 0, 1, 2,....., T-1 \end{aligned}$$
(5)

By using the quantized differential excitation (\(\Psi _{n}\)) and quantized gradient orientation (\(\phi _{t}\)), the 2D histogram of {WLD(\(\Psi _{n},\phi _{t}\))} is generated. Then the 2 dimensional WLD histogram is transformed to row feature vectors. The length of WLD feature vectors is \(N\times T\).

It is conjectured that the WLD feature may be suitable for visual and audio-visual phrase recognition applications. Therefore, we explored this local descriptor for our task objective. The gradient orientation component of WLD considers only vertical and horizontal gradient directions information estimated by {\(I_{1}\) and \(I_{5}\)} and {\(I_{3}\) and \(I_{7}\)} pixels. The slanting gradient directions between {\(I_{2}\) and \(I_{6}\)} and {\(I_{4}\) and \(I_{8}\)} pixels are not considered. This can be observed from Eq. 3 of gradient orientation calculation of the WLD feature.

In order to include the slanting gradient direction, we employ another type of local descriptor; GDP2 which describes the gradient direction information of four directions; vertical, horizontal and two slant directions. In GDP2 feature extraction, the gray intensity values of pixels at eight different directions (East, North East, North, North West,West, South West, South and South East) are considered. Then the differences between intensity values of pixels at vertical, horizontal, and two slant directions are calculated. The Gradient Direction Pattern feature is generated from sum of gray intensity differences of four directions [16].

Fig. 1
figure 1

Methodology of the proposed visual feature extraction algorithm

Instead of calculating the WLD and GDP2 local descriptors only in the spatial domain (XY plane), we also extracted these features from XT and YT planes to include the temporal information. The WLD and GDP2 features generated from XY, XT and YT planes are concatenated to produce the proposed visual feature; SWGD as shown in Fig. 1.

4 Database description

The experimental studies are carried out with the OuluVS audio-visual database [11]. This database was recorded by 20 speakers (17 males and 3 females). Out of 20 speakers, 9 speakers wear glasses. Each speaker read 10 phrases repeated up to 5 times to make suitable for visual and audio-visual phrase recognition experiments. The details of the phrases with assigned labels are given in Table 2. The speakers belong to four different countries with different speaking speeds and accents. This database was recorded inside the controlled environment room. The distance between the speaker and the camera is maintained at 160 cm. The frame rate of the video data is 25 frames per second (fps) and the image resolution is 720 \(\times\) 576 pixels.

Table 2 P1 to P10 labels represent ten phrases respectively

5 Experimental result and discussion

First, we did the experiments for different twelve local descriptor based visual features, and SWGD for ten phrases recognition. The performance of SWGD is also analyzed by using three different block sizes (\(2\times 5\times 3\), \(2\times 3\times 2\) and \(2\times 2\times 2\)) to select the best video block size. The dimension of the SWGD feature is significantly higher. Hence, we reduced the feature dimension by using the SLPM method and analyzed the feature vectors distribution by using t-SNE (t-distributed stochastic neighbor embedding) plots.

As mentioned in the earlier section, there are various methods for GFD estimation. Therefore, we conducted a comparative analysis to select the best GFD estimation method for extracting GMFCC. Then, this glottal excitation source feature; GMFCC is concatenated together with vocal tract feature; MFCC and visual feature; SWGD\(_{\text {SLPM}}\) to develop an audio-visual phrase recognizer.

5.1 Visual phrase recognition experiments

We used SVM classifiers for training and testing the visual and audio-visual phrase recognition systems. To obtain the optimum accuracy of the SVM classifier, multiple train and test data sets are created using the “Leave-One-Out” cross-validation approach. Each test data set considers only one utterance for each phrase of each speaker, and the remaining utterances for each phrase of each speaker are considered in the training data set. This step is repeated for all the utterances. The final accuracy of the system is calculated by averaging the individual scores. This cross-validation method is a time consuming approach. However, it is suitable for OuluVS database because the number of utterances or samples for each phrase of each speaker present in the database is not large scale.

The local descriptors mentioned in Sect. 1 are successfully employed in image pattern recognition applications like face recognition. These types of local descriptors each have their own merits and demerits. For example, a local descriptor like LPQ is computationally simple, but they are sensitive to noise and illumination variation. Similarly, Gabor-based local descriptor like BPPC is insensitive to illumination variation. However, it faces the problem of high computational requirements and high feature dimensionality. Therefore, finding a robust and discriminative local descriptor is still an interesting research area for image pattern representation and classification.

In this work, we extracted the spatio-temporal information of twelve different local descriptors and evaluated the performance for visual phrase recognition experiments. The local descriptors are extensively used for face recognition. However, only a few local spatio-temporal descriptors such as LBP-TOP feature have been used for the visual speech recognition system. Therefore, we explore other types of local spatio-temporal descriptors particularly for visual phrase recognition applications. They are GDP2-TOP, GLTP-TOP, BPPC-TOP, LDP-TOP, LGIP-TOP, LGP-TOP, LMP-TOP, LPQ-TOP, LTP-TOP, MTP-TOP, PHOG-TOP and WLD-TOP. We compared the performance of these features with the proposed SWGD visual feature by using SVM classifier.

Table 3 Accuracies (in %) of visual phrase recognition system with different spatio-temporal features

The performance of the SVM classifier depends on the type of kernel function. So, we compare the performance of visual phrase recognition by using three different kernel functions; Polynomial, Linear and Radial basis function (RBF). The experimental results are given in Table 3. From the results, it is evident that for all local spatio-temporal features, SVM with polynomial kernel function performs better than linear and RBF kernel functions. Further, our proposed SWGD visual feature provides higher accuracy than all twelve different local spatio-temporal descriptors. It is because the proposed visual feature considers important micro-texture information such differential excitation, gradient orientation and gradient directional information together for representing the patterns of lip movements.

Table 4 Performance of VPR system with different block sizes

From Table 4, we can observe that the proposed visual feature extracted from the block size of \(2\times 5\times 3\) in XY, XT and YT planes produced the best possible result. Therefore, this video block size is considered for all the experimental analyses. Since, the proposed SWGD visual feature is obtained by combining the WLD-TOP and GDP2-TOP features, the dimension of the SWGD feature is the sum of the dimensions of these two features. The feature dimension of WLD-TOP is determined by multiplying the size of the block, the number of planes, and the dimension of the WLD feature. The dimension of the WLD-TOP feature is equal to 2880 {size of block (2 \(\times\) 5 \(\times\) 3) \(\times\) three orthogonal planes (3) \(\times\) size of WLD histogram (8 \(\times\) 4)} whereas the GDP2-TOP feature has a dimension of 1,440 {size of block (2 \(\times\) 5 \(\times\) 3) \(\times\) three orthogonal planes (3) \(\times\) size of GDP2 histogram (16)}. Therefore, the proposed SWGD visual feature has a dimension of 4320 (2880 + 1440). We employed the SLPM approach to reduce the dimension of the SWGD visual feature to 100. We denoted this transformed visual feature by SWGD\(_{\text {SLPM}}\). The SLPM method not only reduces the dimension of the feature but also increases the discriminative ability of the proposed features to classify the phrases.

Fig. 2
figure 2

The t-SNE plot by using a SWGD and b SWGD\(_{\text {SLPM}}\)

We used the t-SNE plot to visualize the distribution of visual features of 10 different phrases in two dimensional space. The distributions of phrases plotted by t-SNE plot by using SWGD and SWGD\(_{\text {SLPM}}\) visual features are shown in Fig. 2(a) and (b) respectively. In t-SNE plot, we represent the 10 phrases by using ten different labels. We observed that the distribution of the phrases is very close to each other and difficult to classify. On the other hand, the distribution of phrases using SWGD\(_{\text {SLPM}}\) feature is clearly seen as separable clusters in Fig. 2(b). This shows that the SLPM feature dimensionality reduction approach increased the discriminative ability or separability among the classes. This SLPM approach improved the performance of the proposed SWGD visual feature from 73.9 to 81.30%.

Table 5 Performance comparison of proposed visual features with state-of-art visual features on OluVS database for VPR task

We also compared the performance of our proposed visual feature with six state-of-the-art appearance based visual features that are used in visual phrase recognition. The appearance based state-of-the-art visual features considered in this comparison are LBP-TOP [11], DCT+LDA [42], Sequential Pattern Boosting (SP-Boosting) [43], CHAVF [42], EVM+LBP-TOP [14] and Transported Square-Root Vector Fields (TSRVFs) [44]. The results of these state-of-the-art features and our proposed feature are given in Table 5. All these reported visual phrase recognition experiments were conducted using the OuluVS database. The LBP-TOP is also a spatio-temporal feature, but it is obtained by thresholding the gray color intensity differences between the center pixel and neighborhood pixels. DCT is the global visual feature, which is extracted from the entire mouth region, and LDA is used to reduce the dimension of the feature. SP-Boosting is the machine learning approach proposed for visual speech recognition in [43] and authors used a visual feature that is related to the intensity differences of the mouth region images. The CHAVF feature proposed in [42] is the combination of local feature LBP-TOP and global feature DCT. In [14], the authors applied the EVM technique to the input videos in order to amplify the subtle information of lip movements. The LBP-TOP feature extracted from this magnified input video is denoted by EVM+LBP-TOP. The LBP-TOP feature with the EVM approach could effectively represent the patterns of lip movements. Nevertheless, real-time lip reading applications are not appropriate for this slow video magnification method. In [44], authors calculated the covariance matrices of pixel location, intensity and their derivatives of the mouth region images and obatined the correlation matrices. They constructed the trajectories of correlation matrices using TSRVF for phrase recognition. All these reported appearance based visual features do not consider the micro-texture information such as differential excitation, gradient orientation, and gradient directional information extracted from the mouth region images. Hence, the proposed visual feature outperformed all those state-of-the-art appearance based visual features and was found suitable for representing the patterns of lip movements for different phrases.

From the experimental results and analysis, we can conclude that the proposed SWGD\(_{\text {SLPM}}\) is the best possible representation of lip movement patterns for visual phrase recognition. In the following section, we include audio information to develop a relatively more robust audio-visual phase recognition system.

5.2 Audio-visual phrase recognition experiments

Fig. 3
figure 3

The t-SNE plot of GMFCC features extracted from GFD signal estimated by using a IAIF, b DYPSA, c ZFR, d SEDREAM, e YAGA and f DPI method. The labels (P1 to P10) represent the ten phrases

The GFD signal can be estimated by using the IAIF, DYPSA, ZFR, SEDREAMS, YAGA, and DPI approaches. The GFD signals estimated by the aforementioned methods can be compared by conducting a discriminative analysis of phrases using GMFCC features. The GFD signal is used as the input signal, and then the MFCC feature extraction procedure is applied to extract the GMFCC feature.

The distribution of the GMFCC feature vectors is plotted in two dimensions using t-SNE. The GMFCC features are extracted from GFD signals that are estimated using the IAIF, DYPSA, ZFR, SEDREAMS, YAGA, and DPI methods. The t-SNE plots are shown in Fig. 3 and the 10 phrases are labeled as P1, P2, P3, P4, P5, P6, P7, P8, P9, and P10. The labels with assigned phrases of OuluVS database are given in Table 2. The distribution of GMFCC features for 10 phrases is clearly seen as separable clusters in Fig. 3(a) whereas the distribution of the phrases shown in Fig. 3(b–f) is very close to each other, making it difficult to classify the phrases. This demonstrates that the IAIF GFD estimation method is more suitable than other methods; DYPSA, ZFR, SEDREAMS, YAGA, and DPI to extract GMFCC feature for phrase recognition. This is due to the fact that while other approaches require the locations of glottal closure instants (GCIs) to be obtained, the IAIF approach does not. It is challenging to accurately estimate the location of GCIs from noisy speech. Therefore, we employed the IAIF method in GFD estimation to extract the GMFCC feature for the AVPR system.

The main objective of the proposed AVPR system is to recognize spoken phrases in noisy conditions. In this experiment, we removed the silence portions at the starting and ending parts of the audio speech files, because they do not carry any important information related to the phrases. We use four additive white Gaussian noise levels having SNR varying from – 6dB, – 3dB, + 3dB, to + 6dB to create noisy speech signals. The experimental results are presented in Table 6. Individually MFCC provides the better performance, but MFCC and GMFCC in together further improves the performance, reflecting the usefulness of using the excitation source based GMFCC feature as a supplementary evidence for phrase recognition in noisy conditions.

Table 6 Performance (in %) of phrase recognition at different SNR levels

As we mentioned earlier, the visual information is not affected by noise, and therefore the performance of the proposed SWGD\(_{\text {SLPM}}\) visual feature remains unchanged at all noise levels. However, the benefit of using the SWGD\(_{\text {SLPM}}\) visual feature helps in improving the performance. For example, on an average from extremely low (– 6dB) to high noise level (+ 6dB) the performance improved from 90.10 to 91.88%, reflecting a relative improvement of 2%. It is also observed that at the lowest noise level (– 6 dB), the inclusion of the proposed SWGD\(_{\text {SLPM}}\) visual feature significantly helps in improving the performance of the audio-visual phrase recognition.

6 Conclusion and future scope

The proposed SWGD and its dimensionally reduced SWGD\(_{\text {SLPM}}\) visual features that give better results than all twelve local spatio-temporal features in the context of phrase recognition tasks. The experiments are made with an internationally standard OuluVS database and a polynomial kernel function based SVM classifier. The experimental results show that the GDP2-TOP and WLD-TOP features are providing better performance of 69.9% and 72.6% respectively, among the twelve local spatio-temporal features. These performances are lower than our proposed SWGD (73.9%) and low dimensional SWGD\(_{\text {SLPM}}\) (81.30%) visual features. A comparative study with other state-of-the-art features shows that the TSRVFs feature provides the good performance of 70.6 %, which is less than our proposed SWGD\(_{\text {SLPM}}\) visual feature. The reason for getting better performance may be due to the effective representation of micro-texture lip movement patterns using both spatial and temporal information. The use of a reduced dimensional SWGD\(_{\text {SLPM}}\) visual feature is another reason to achieve further improvement in the performance.

In our previous work, MFCC, GMFCC, and their combined representation (MFCC+GMFCC) acoustic representations were found to be effective in recognizing confusing phonemes like (‘p’ and ‘b’) and the English letters (‘P’ and ‘B’). Motivated by this work, we use both of these acoustics features; the vocal-tract related feature (MFCC) and the glottal excitation source related feature (GMFCC) for phrase recognition. Here, we first verify the best suitable method of GFD estimation, particularly for extracting the GMFCC feature for phrase recognition task. We observed that the GMFCC feature extracted through the IAIF based GFD estimation method provides better classification of phrases than other GFD estimation methods; DYPSA, ZFR, SEDREAMS, YAGA, and DPI. This is because the IAIF approach does not require the locations of GCIs, but others approaches need to obtain GCIs locations. The accurate estimation of GCIs location from noisy speech is a difficult task. It is concluded that the IAIF method is found suitable for glottal excitation source feature estimation in phrase recognition applications. The experimental results show that the acoustic based phrase recognition system provides better performance than the proposed VSR system. However, with the inclusion of acoustic additive noise, the VPR system performance remains unchanged, but audio based system performance is degraded proportionally with the SNR level of the noise. By including the GMFCC excitation source feature and the proposed SWGD\(_{\text {SLPM}}\) visual feature, the best possible performance of the audio based system has been relatively increased by 3.6%. This shows the robustness of the proposed AVPR system against noise.

The proposed visual features could be used for a continuous audio-visual speech recognition system. The performance of our proposed visual feature could be improved with a larger audio-visual database and deep neural network modeling. This proposed visual feature may be suitable for other speech processing applications, such as audio-visual speaker recognition and language identification.