1 Introduction

Human’s speech perception is bimodal in nature: human combine audio and visual information in deciding what the others speak. The first AVSR system was reported in 1984 by Petajan [18]. During the last decade more than hundred articles have appeared on AVSR [5, 6, 8, 9, 13, 17, 23, 25]. AVSR systems can enhance the performance of the conventional ASR not only under noisy conditions but also in clean conditions when the talking face is visible [20, 26]. The major advantage of utilizing the acoustic and the visual modalities for speech understanding comes from “Complementarity” [21] of the two modalities and, “Synergy”: Performance of audio-visual speech perception can outperform those of acoustic-only and visual-only perception for diverse noise conditions [22]. Generally, in AVSR systems, the integration can take place either before the two information sources are processed by a recognizer (early integration/feature fusion) or after they are classified independently (late integration/ decision fusion). Some studies are in favor of early integration [1, 6, 7, 13], and others prefer late integration [25, 19, 24]. Despite all these studies, which underline the fact that speech reading is part of speech recognition in humans, still it is not well understood when and how the acoustic and visual information are integrated. This paper takes the advantages of late integration on practical implementation issue to construct a robust AVSR system.

Commonly, the integration weight which determines the amount of contribution from each modality in decision fusion based AVSR system is calculated from the relative reliability of the two modalities [31]. The method of reliability measure proposed in [3, 32] use all the classes of recognition hypotheses, where as the method proposed in [5, 31] uses only N (i.e N = 4) best recognition hypotheses. But, both of these methods did not show performance improvements practically at very low SNR conditions. To solve this issue, this work proposes a genetic algorithm based reliability measure which uses optimum number of best recognition hypotheses rather than N best recognition hypotheses to determine an appropriate integration weight. Further improvement in recognition accuracy is achieved by optimizing the above measured integration weight by genetic algorithm. The performance of the proposed integration weight estimation scheme using GA based reliability measure is demonstrated for isolated word recognition (incorporating commonly used functions in mobile phones) via multi-speaker database experiment. After the recognition tasks were carried out over the common audio-visual side face speech database, the performance of proposed system is compared with the audio-only, visual-only unimodal systems and some existing bimodal AVSR systems namely, the baseline reliability ratio-based system and N best recognition hypotheses reliability ratio-based system under various SNR conditions. An outline of the remainder of the paper is as follows. The following section explains some of the existing methods to find the integration weight based on reliability measure of the modalities. How Genetic Algorithm can be used to measure the correct reliabilities of each modality and optimize the integration weight is explained in Section 3. Section 4 discusses the database, audio, and visual features. Section 5 discusses the Hidden Markov Model (HMM) training and recognition results. The discussion, conclusion and future direction of this work are outlined in the last section.

2 Review of Existing Integration Weight Estimation Schemes

The main focus of this work is on the estimation of appropriate integration weight based on the correct reliability measure of audio and visual modalities. After the acoustic and visual subsystems perform recognition separately, their outputs are combined by a weighted sum rule to produce the final decision. For a given audio-visual speech test datum of O A and O V , the recognized utterance C * is given by [5],

$$ C^{*}=\arg \max _ i\left\{\gamma \log P\big(O_{\!A}/ \lambda_{A}^{i}\big)\!+\!(1\!-\!\gamma) \log P\big(O_{V}/\lambda_{V}^{i}\big)\right\} $$
(1)

where \(\lambda_{A}^{i}\) and \(\lambda_{V}^{i}\) are the acoustic and the visual HMMs for the ith (1 ≤ i ≤ N) utterance class, respectively, N is the number of utterance classes being used in the recognition experiment, and \(\log P(O_{\!A}/\lambda_{A}^{i})\) and \(\log P(O_{V}/\lambda_{V}^{i})\) are their log likelihood against the ith class. The weighting factor γ(0 ≤ γ ≤ 1) determines the contribution of each modality to the final decision. If it is not estimated appropriately we cannot expect complementarity [21] and synergy [22] of the two information sources and moreover, the combined recognition performance may be even inferior to that of any unimodal systems, which is called “attenuating fusion” [25]. One simple solution to this problem is assigning a constant weight value over various SNR conditions or manual determination of the weight [29]. In some other work, the weight is determined from SNR by assuming that SNR of the acoustic signal is known which is not always a feasible assumption [4]. Indeed, some researchers determine the weight by using additional adaptation data [30]. Finally, the most popular approach among such schemes is the reliability ratio(RR) based method in which the integration weight is determined from the relative reliability of the two modalities [31]. Hence, in this section we briefly, review this baseline reliability ratio(RR)-based integration method and another related method called N-best recognition hypotheses reliability ratio-based integration method [5, 31].

2.1 Audio-Visual Decision Fusion Based on Baseline Reliability Ratio Method

The reliability of each modality can be measured from the outputs of the corresponding HMMs. When the acoustic speech is not corrupted by any noise, there are large differences between the acoustic HMMs output or else the differences become small. Considering this observation, the reliability of a modality is defined by the most appropriate and best in performance [2]

$$ S_{m}=\frac{1}{N_c-1}\sum\limits_{i=1}^{N_c}{\big(\max _ j \log P\big(O/\lambda_m^{j}\big) - \log P\big(O/\lambda_m^{i}\big)\big)} $$
(2)

which means the average difference between the maximum log-likelihood and the other ones and N c is the number of utterance classes being considered to measure the reliability of each modality \(m\in \left\{A,V\right\}\). In this method, all the utterance class recognition hypotheses are used to measure the reliability. Then, the integration weight γ can be calculated by [31]

$$ \gamma=\frac{S_{A}}{S_{A}+S_{V}} $$
(3)

where S A and S V are the reliability measure of the outputs of the acoustic and visual HMMs, respectively.

2.2 Audio-visual Decision Fusion Based on N-best Recognition Hypotheses Reliability Ratio Method

Adjoudani and Benoit [31], measured the reliability of each modality \(m\in \left\{A,V\right\}\) over N best recognition hypotheses allowing for satisfactory evaluation of certainty versus uncertainty in conformity [5]. Accordingly, the reliability of a modality is defined as

$$ S_{m}=\frac{2}{N(N-1)}\sum\limits_{i=1}^{N-1}\sum\limits_{j=i+1}^{N}{\left| \log P\big(O/\lambda_m^{i}\big) - \log P\big(O/\lambda_m^{j}\big)\right|} $$
(4)

which means the average absolute differences of log-likelihood. In this method, only four best recognition hypotheses are used to measure the reliability of each modality and then, the integration weight γ is calculated as in Eq. 3.

3 Audio-Visual Decision Fusion Based on Proposed Integration Weight Estimation Schemes

In this section, we explain our novel integration weight estimation scheme which uses optimum best recognition hypotheses to measure the correct reliability of each modality and in turn appropriate integration weight. In the next subsection, we present a genetic algorithm based optimization scheme to further optimize the integration weight from the above measured reliabilities.

3.1 Audio-visual Decision Fusion Based on GA Adaptive Reliability Measure Method (Proposal 1)

The method of reliability measure proposed in [3, 32] use all the classes of recognition hypotheses, where as the method proposed in [31] uses only N (i.e. N = 4) best recognition hypotheses. The estimated integration weight based on these measures shows “attenuating fusion” [25] for noisy speech data on certain SNR conditions. To solve this issue, this work proposes a GA based scheme to select optimum number of best recognition hypotheses to measure the correct reliability of each modality so as to, increase the recognition accuracy at all SNR conditions.

The genetic algorithm is a method for solving both constrained and unconstrained optimization problems. It is built on the principles of evolution via natural selection: an initial population of individuals is created and by iterative application of the genetic operators (selection, crossover, mutation) an optimal solution is reached according to the defined fitness function. The GA is used in this work to obtain the correct reliabilities of each modality and in turn maximize the recognition accuracy according to the defined fitness function. The problem is formulated as follows:

The optimum number of acoustic recognition hypotheses to measure the correct reliability (S A ) is obtained by solving

$$ \begin{array}{rll} S_{A}&=&\arg \max\limits_ {N_{A}}\left\{\frac{1}{(N_A-1)}\sum\limits_{i=1}^{N_A}\max\limits_j \:\log P\big(O_{\!A}/\lambda_A^{j}\big)\right.\\ &&\left. \qquad\quad\;\;\;-\log P\big(O_{\!A}/\lambda_A^{i}\big)\right\} \end{array} $$
(5)

similarly, the correct visual reliability (S V ) is obtained by solving

$$ \begin{array}{rll} S_{V}&=&\arg \max\limits_ {N_{V}}\left\{\frac{1}{(N_V-1)}\sum\limits_{i=1}^{N_V}\max\limits_j \:\log P\big(O_V/\lambda_V^{j}\big)\right.\\&&\left. \qquad\quad\;\;\;-\log P\big(O_V/\lambda_V^{i}\big)\right\} \end{array} $$
(6)

subject to: \(1\:\leq N_A, N_V\:\leq N\).

Then, the integration weight (γ) is calculated as in Eq. 3. Finally, the fitness function to be optimized is given as

$$ \textnormal{Recognition Accuracy}=\frac{\sum diag(R)}{\sum \sum (R)}\times 100 $$
(7)

where R is the confusion matrix. The proposed Algorithm 1 based on GA to solve Eqs. 5, 6 and 7 is explained step-by-step in the following procedure

  1. Step 1

    Initialization: Generate a random initial population of size [N × 2], for best acoustic and visual recognition hypotheses length to be considered to measure the correct reliability.

  2. Step 2

    Fitness Evaluation: Fitness of all the solutions {N A1, N A2....N AN } and {N V1, N V2....N VN } in the population is evaluated. The steps for evaluating the fitness of a solution are given below:

    1. Step 2a:

      Assume the confusion matrix R of size [N c × N c ] with all zero values.

    2. Step 2b:

      Class = 1: No of validation utterance class.

    3. Step 2c:

      Datum = 1: No of validation utterance datum.

    4. Step 2d:

      Get the acoustic log likelihood \(\log P(O_{\!A}/\lambda_{A}^{i}); (1\leq i \leq N_c)\) for the Class and Datum. Each of its entries represents the log likelihood of the Datum O A against all acoustic classes.

    5. Step 2e:

      Find the maximum value in the acoustic log likelihoods and is represented as amax.

    6. Step 2f:

      Compute the acoustic reliability S A as:

      $$ S_{A}=\frac{1}{N_{A}-1}\sum\limits_{i=1}^{N_{A}}\big(amax-\log P\big(O_{\!A}/\lambda_A^{i}\big)\big) $$
      (8)

      where N A  ∈ {N A1, N A2....N AN } is the number of acoustic recognition hypotheses being considered to measure the correct acoustic reliability.

    7. Step 2g:

      Similarly get the visual subsystem log likelihood \(\log P(O_{V}/\lambda_{V}^{i}); (1\leq i \leq N_c) \) for the Class and Datum. Each of its entries represents the log likelihood of the Datum O V against all visual classes.

    8. Step 2h:

      Find the maximum value in the visual log likelihoods and is represented as vmax.

    9. Step 2i:

      Compute the visual reliability S V as:

      $$ S_{V}=\frac{1}{N_{V}-1}\sum\limits_{i=1}^{N_{V}}\big(vmax-\log P\big(O_{V}/\lambda_V^{i}\big)\big) $$
      (9)

      where N V  ∈ {N V1, N V2.... N VN } is the number of visual recognition hypotheses being considered to measure the correct visual reliability.

    10. Step 2j:

      Estimate the integration weight γ as:

      $$ \gamma = \frac{S_{A}}{(S_{A}+S_{V})} $$
      (10)
    11. Step 2k:

      Integrate the log likelihoods as follows:

      $$ C1=\left\{\gamma \log P\big(O_{\!A}/\lambda_{A}^{i}\big)+(1-\gamma) \log P\big(O_{V}/\lambda_{V}^{i}\big)\right\} $$
      (11)

      using the estimated integration weight value γ in step 2j. Now C1 is a \(\left[N_c \times 1\right]\) matrix which gives the audio-visual combined recognition hypotheses.

    12. Step 2l:

      Find the maximum value of C1 and its corresponding position. The position represents the recognized utterance class.

    13. Step 2m:

      Update the confusion matrix R as follows

      $$ R (class, position) = R (class, position) + 1 $$
      (12)
    14. Step 2n:

      Go to step 2c until all the Datum are over.

    15. Step 2o:

      Go to step 2b until all the Classes are over.

    16. Step 2p:

      The recognition accuracy or fitness value is calculated as

      $$ \textnormal{Recognition~ accuracy}=\frac{\sum diag(R)}{\sum \sum (R)}\times 100 $$
      (13)
  3. Step 3

    Updating Population: Two best solutions in the current population [parents] are forwarded to the next generation parents without any changes [Elite Count], the remaining solutions in the new population are generated using scattered crossover function and Gaussian mutation function.

    The scattered crossover function creates a random binary vector and selects the genes from the 1st parent if the vector is 1, and the genes from the 2nd parent if the vector is 0, and combines the genes to form the next generation parents [16]. Similarly, the Gaussian mutation function adds a random number taken from a Gaussian distribution with zero mean and user defined variance to each entry of the current parents to form the next generation parents [16]. The combination of scattered crossover function and the Gaussian mutation function converges quickly to the given fitness function.

  4. Step 4

    Termination: Repeat steps 2 to 3 until the algorithm reaches the maximum number of iterations.

The final solution of this Algorithm 1 gives the number of best acoustic and visual subsystems recognition hypotheses to be considered to measure the correct reliability of each modality. The performance of the proposed [1] method over the baseline reliability ratio-based and N best recognition hypotheses reliability ratio-based methods are shown in Table 1

Table 1 Recognition performance comparison of AV baseline-RR, AV N best-RR, and AV GA adaptive-RR bimodal systems.

3.2 Audio-visual Decision Fusion Based on GA Adaptive Reliability Measure and Optimum Integration Weight Method (Proposal 2)

The GA adaptive reliability measure proposed in Section 3.1 improves the recognition accuracy over the baseline reliability ratio-based and N best recognition hypotheses reliability ratio-based methods and its performance comparison is shown in Table 1. But still there is attenuating fusion at very low SNR conditions for noisy speech data. To solve this issue, we propose a scheme to further optimize the integration weight computed in Section 3.1 and thereby improves the recognition accuracy without attenuating fusion at any SNR conditions. The problem is formulated as follows:

Define the new integration weight \(\overline{\gamma}\) as

$$ \overline{\gamma} = \left[\frac{S_{A}}{(S_{A}+S_{V})}\right] \times x $$
(14)

i.e. \(\overline{\gamma}= \gamma \times x \). Then, for the given test datum O A and O V the recognized utterance C * is obtained by solving

$$ C^{*}\!=\!\arg \max _ {i,x}\left\{\overline{\gamma} \log P\big(O_{\!A}/ \lambda_{A}^{i}\big)\!+\!(1\!-\!\overline{\gamma}) \log P\big(O_{V}/\lambda_{V}^{i}\big)\right\} $$
(15)

subject to : \(0\:\leq \overline{\gamma} \:\leq 1\)

Finally, the objective function given in Eq. 7 based on this new integration weight is optimized using genetic algorithm. The procedure of the proposed [2] algorithm for optimizing the objective function using GA is explained in the following procedure

  1. Step 1

    Initialization: Generate a random initial population of size [N × 3], for best acoustic and visual recognition hypotheses length to be considered to measure the correct reliability, and the integration weight multiplier (x).

  2. Step 2

    Fitness Evaluation: Fitness of all the solutions {N A1, N A2....N AN }, {N V1, N V2....N VN }, and {x 1,x 2....x N } in the population is evaluated. The steps for evaluating the fitness of a solution are given below:

    1. Step 2a–i:

      Follow the same steps as in Section 3.1.

    2. Step 2j:

      Estimate the new integration weight \(\overline{\gamma}\) as:

      $$ \overline{\gamma} = x_i\times\left(\frac{S_{A}}{(S_{A}+S_{V})}\right) $$
      (16)

      based on the integration weight multiplier solution x i . where x i  ∈ {x 1, x 2....x N }.

    3. Step 2k:

      Integrate the log likelihoods as follows:

      $$ C2=\left\{\overline{\gamma} \log P\big(O_{\!A}/\lambda_{A}^{i}\big)+(1-\overline{\gamma}) \log P\big(O_{V}/\lambda_{V}^{i}\big)\right\} $$
      (17)

      using the estimated integration weight value \(\overline{\gamma}\) in step 2j. Now C2 is a \(\left[N_c \times 1\right]\) matrix which gives the audio-visual combined recognition hypotheses.

    4. Step 2l:

      Find the maximum value of C2 and its corresponding position. The position represents the recognized utterance class.

    5. Step 2m:

      Update the confusion matrix R as follows:

      $$ R (class, position) = R (class, position) + 1 $$
      (18)
    6. Step 2n–p:

      Follow the same steps as in Section 3.1.

  3. Step 3

    Updating Population: As similar to step 3 of the algorithm in Section 3.1, two best solutions in the current population are forwarded to the next generation parents without any changes, the remaining solutions in the new population are generated using scattered crossover function and Gaussian mutation function.

  4. Step 4

    Termination: Repeat steps 2 to 3 until the algorithm reaches the maximum number of iterations.

The final solution of Algorithm 2 gives the number of best acoustic and visual recognition hypotheses to be considered to measure the correct reliability of each modality and the optimum integration weight multiplier (x).

4 Experimental Database, Audio, and Visual Speech Features

This paper focuses on a slightly different type of AVSR system which use visual features extracted from side-face mouth region images rather than frontal face images. Potamianos et al. has demonstrated that using mouth videos captured from cameras attached to wearable headsets produced better results as compared to full face videos [27]. With reference to the above, as well as to make the system more practical, around 70 commonly used mobile functions (isolated words) were recorded 30 times each by a microphone and web camera located near the speaker’s right cheek mouth region. Samples of the recorded side-face videos are shown in Fig. 1. Advantage of this kind of arrangement is that face detection, mouth location estimation and identification of the region of interest etc. are no longer required and thereby reducing the computational complexity [10]. Most of the audio-visual speech databases available are recorded in ideal studio environment with controlled lighting or kept some of the factors like background, illumination, distance between camera and speakers mouth, view angle of the camera etc. as constant. But in this work, the recording was done in the office environment on different days with different values for the above factors to make the database suitable for real life applications. Also, the database included natural environment noises such as fan noise, birds sounds, sometimes other people speaking and shouting sounds.

Figure 1
figure 1

Example video frames of multi-speaker side-face audio-visual speech database recorded in a typical office environment.

4.1 Acoustic Feature Extraction

The acoustic speech was recorded at the rate of 8 kHz with 16-bit resolution. The popular Mel-frequency cepstral coefficients (MFCC) are extracted from the acoustic speech signal [13]. The frequency analysis of the signal is performed for each frame segmented by the Hamming window having the length of 32 ms and an overlap of 12.5 ms. For each frame, we perform the Fourier analysis, computation of the logarithm of the Mel-scale filter bank energy, and the discrete cosine transformation. The cepstral mean subtraction (CMS) method is applied to remove channel distortions existing in the speech data [15]. As a result we obtain 39 acoustic parameters: 12 MFCCs, 12 ΔMFCCs, 12 ΔΔMFCCs, log energy, Δ log energy, and ΔΔ log energy.

4.2 Visual Feature Extraction

Visual features proposed in the literature of AVSR can be categorized into shape-based, pixel-based and motion-based features [28]. Pixel-based and shape-based features are extracted from static frames and hence viewed as static features. Motion-based features are features that directly utilize the dynamics of speech [11, 12]. Dynamic features are better in representing distinct facial movements and static features are better in representing oral cavity that cannot be captured either by lip contour or motion-based features [10]. This work focuses on the relative benefits of both static and dynamic features for improved AVSR recognition.

4.2.1 DCT Based Static Feature Extraction

Potamianos et al. [13] reported that intensity based features using discrete cosine transform (DCT) outperform model-based features. Hence DCT is employed in this work to represent static features. Each side-face mouth region video is recorded with a frame rate of 30 frames/s and [240 × 320] pixel resolutions. Prior to the image transform the recorded video frames are converted to equivalent RGB image. This RGB image is converted to YUV color space and only the luminance part (Y) of the image is kept as such since it retains the image data least affected by the video compression [14]. The resultant Y- image was sub sampled to obtain [16 × 16] and then passed as the input to the DCT. The DCT returns a 2D matrix of coefficients and moreover, the triangle region feature selection outperforms the square region feature selection, as those include more of the coefficients corresponding to low frequencies [14]. Hence in this work, [6 × 6] triangle region DCT coefficients without the DC component are considered as 20 static features of a frame.

4.2.2 Motion Segmentation Based Dynamic Feature Extraction

In this work, dynamic visual speech features which show the side-face mouth region movements of the speaker are segmented from the video using an approach called motion history images (MHI) [11]. MHI is a gray scale image that shows where and when movements of speech articulators occur in the image sequence. The MHI is defined as

$$ \emph{MHI}= Max\bigcup\limits_{t=1}^{N-1}{\emph{DOF}_{t}(m,n) \times t} $$
(19)

where N represents the number of frames used to capture the side-face mouth region motion and DOF is the binarized difference image over a threshold. The threshold is optimized through experimentation. In Eq. 19, to show the recent movements with brighter value, the binarized version of the DOF is multiplied with a ramp of time and integrated temporally. Next, DCT was applied to MHI and the transformed coefficients are obtained. Similar to static feature extraction, only [6 × 6] triangle region DCT coefficients without the DC component are considered as the dynamic features. Finally, the static and dynamic features are concatenated to represent visual speech.

5 HMM Training and Recognition Results

The HMM is a commonly used classifier in speech recognition, since it has the desirable property that it can readily be model the time-varying speech signal [15]. This work adopts left-right continuous HMMs having Gaussian mixture models (GMMs) in each state. The whole-word model which is a standard approach for small vocabulary speech recognition task was used. The number of states in each HMM and number of Gaussian functions in each GMM are set to 10 and 6 respectively, which are determined experimentally. The initial parameters of the HMMs are obtained by uniform segmentation of the training data onto the states of the HMMs and iterative application of the segmental k-means algorithm and the Viterbi alignment. For training the HMMs, the standard Baum-Welch algorithm was used [15]. The training was terminated when the relative change of the log-likelihood value is less than 0.001 or maximum number of iteration is reached, which is set to 25.

5.1 Recognition Results

The bimodal decision fusion speech recognition system using side-face mouth region image is shown in Fig. 2. The dataset was recorded in an office environment with a background noise. Each word was recorded 30 times for each speaker, hence we have a total of 90 samples/word. Out of these 90 recorded samples, 60 samples were taken randomly for training the HMMs and 15 samples have been used as a validation data to estimate the best acoustic and visual recognition hypotheses length N A and N V to measure the correct reliabilities using the proposed Algorithm 1. The same set of samples have been used to estimate the optimum integration weight using the proposed Algorithm 2. The remaining 15 samples were artificially degraded with additive white Gaussian noise at SNRs of 20, 10, 5, 0, and −5 dB. Theses noisy samples have been used as a test data to compute the recognition accuracy. The experiment was conducted three times for each SNR, in each trial 60 samples were taken randomly for training and 15 samples for testing. Finally, the average of all three trials are taken as the recognition accuracy.

Figure 2
figure 2

Block diagram of the proposed audio-visual decision fusion speech recognition system using mouth region side-face images.

Table 2 shows recognition accuracies obtained by the audio-only, visual-only, audio-visual baseline reliability ratio (AV baseline-RR), audio-visual N best recognition hypotheses reliability ratio (AV N-best RR), and the proposed GA adaptive reliability ratio (AV-GA adaptive-RR) and GA adaptive reliability and optimized (AV-GA adaptive RR & GA optimized) bimodal systems at various SNR conditions. Similarly Fig. 3 compares the unimodal and bimodal system’s recognition performance. In Table 2, “Clean” means the recorded speech samples without any additional white Gaussian noise. From the results (Table 2), the following observations were made,

  1. 1.

    The acoustic-only recognition shows nearly 77% for the recorded speech but, as the speech contains more artificially added white Gaussian noise, its performance is degraded sharply; the recognition is even less than 2% at −5 dB SNR conditions. Since the maximum recognition accuracy for the recorded speech is 77%, it shows that the recorded speech itself is highly noisy.

  2. 2.

    The average recognition accuracy for visual-only system is 62.57%, which appears constant regardless of acoustic noise conditions.

  3. 3.

    The baseline reliability ratio-based and N best recognition hypotheses reliability ratio-based bimodal systems show performance improvement only under Clean and 20 dB SNR conditions over the acoustic-only and visual-only systems but, under the remaining SNR conditions (i.e., −5 dB ~ 10 dB) their performances are inferior to that of visual-only system i.e. they show attenuating fusion at these SNR conditions.

  4. 4.

    The proposed [1] GA adaptive reliability ratio-based system improves the recognition accuracy over baseline reliability ratio and N best recognition hypotheses reliability ratio-based systems under most of the SNR conditions.

  5. 5.

    The proposed [2] GA adaptive reliability ratio and optimized bimodal system further improves the recognition accuracy and shows performance improvement as compared to all other systems at all SNR conditions. Especially, the performance improvement is larger when the SNR is small i.e. −5 dB ~ 10 dB. This demonstrates that the noise-robustness of recognition is achieved by the proposed [2] system for the recorded noisy audio-visual speech recognition task.

Table 2 AVSR performance in recognition accuracy (%) by the audio-only, visual-only, unimodal systems and AV baseline-RR, AV N-best RR, and the proposed [1] & [2] bimodal systems.
Figure 3
figure 3

Performance of the existing unimodal and bimodal systems and the proposed AV-GA adaptive-RR and AV-GA adaptive RR & GA optimized bimodal systems in recognition accuracy (%).

6 Discussion and Conclusion

In this paper, influence of the reliability measure on integration weight estimation is demonstrated via 70 commonly used mobile functions isolated words audio-visual speech database of three speakers. The proposed systems use an audio-visual speech data base developed by us, which extracts visual features from the side-face mouth region images rather than frontal face images. Generally, the dynamic visual speech features are obtained by derivative of static features [14], but in this work the dynamic features are obtained via MHI approach and concatenated with static features to represent the visual speech. For evaluating the proposed systems, the recognition accuracy is compared with other two related methods namely reliability ratio based method and N-best recognition hypothesis reliability ratio based method. The results in Table 2 clearly show overall performance improvement by the proposed systems over the existing unimodal and bimodal systems in a noisy side face audio-visual speech database. Especially at low SNRs, all the reported methods show very poor recognition accuracy. But the proposed methods solve this issue and improve the recognition accuracy considerably.

In the proposed works, the fusion happens at the end of each utterance based on a single reliability measure for the whole utterance. This will not be able to effectively account for time-varying noise conditions where the reliability will also vary during the duration of the utterance. This problem can be solved to a certain extent, when we measure the reliability of both acoustic and visual signal in a frame-by-frame basis and fuse the decision to find the correct utterance. If we do so, the complexity of the algorithms become very high even for the isolated word recognition task. This is one of the drawbacks of the proposed algorithms, in view of handling time-varying noises in the acoustic signal. Also, the amount of training data used to model the HMMs are low, hence we obtained very poor recognition accuracy at low SNRs. In future, we will record more samples from other speakers and model the HMMs in more reliable form to obtain reasonable accuracy at low SNRs.

The proposed works consider effect of noise only in audio signal. Since we recorded the video signal so close to the speaker’s mouth region the possibility of video distortion is less. This assumption may not be correct always. So, in future we will include the effect of noise in the recorded video signal and also apply these proposed algorithms to see their performance.

And more over, the proposed algorithms in their current form are suitable for isolated word recognition because, we can easily calculate the combined likelihood based on a single reliability measure. But, if we extend these algorithms to the continuous speech recognition task it become very challenging because we have to unmanageably consider many possible word or phoneme sequence hypotheses to calculate the combined likelihood. With these considerations, further investigations on applying the proposed algorithms to complex tasks such as continuous speech recognition are in progress.