1 Introduction

HAL, the computer in the movie “2001: A Space Odyssey,” tells a crew member: “Dave, although you took very thorough precautions in the pod against my hearing you, I could see your lips move.” This depicts how speech is understood multimodally because each perceptual function, by itself, distinguishes the utterance. When the auditory and the visual perceptual functions are active jointly, they fuse separate recognition results at a higher level to yield optimal speech understanding. Emotion also contributes to this fusion process, because muscles move during utterance, depending on mood and prosodic features.

We know that speech is more intelligible when the listener can see the speaker’s face. This is due primarily to lip movement. Some phonemes may be confused in the audio domain (e.g.,/m/and/n/) but not in the visual domain (Fig. 1), where they correspond to distinct visemes. Phoneme-to-viseme mapping highlights this peculiarity in the human understanding of speech. Because there are fewer visemes than phonemes, the mapping between phoneme subsets and a given viseme is many-to-one. As a result, the visual interpretation of speech, in order to be effective, requires understanding speech acoustically. However, mere audio interpretation also proves inadequate for distinguishing similar sounds whose meanings differ, even when those sounds have different visemes.

Fig. 1
figure 1

Phonemes that may be confused in the audio domain (e.g.,/m/and/n/) might correspond to distinct visemes in the visual domain

There are various application contexts for automatic speech recognition (ASR): speech-to-text transcription in harsh, noisy conditions, natural interfaces in automobiles, access to systems for the visually or hearing impaired, automatic captioning of audiovisual streams, audiovisual data mining, natural user interfaces in handheld and wearable devices, etc. Such applications demand a multimodal approach to ASR but need implementations whose computational paradigms offer flexibility and adaptability.

Speech is the most effective way for human beings to communicate. It is also ideal between a human being and a machine. However, the performance gap between natural, human speech recognition and artificial, machine speech recognition is still huge, especially in noisy environments. Personal communication and information systems are rapidly evolving toward wearable systems. Consequently, the human–machine interface and the interaction paradigms need to feel natural and reliable, given the harsh operating conditions of these new, deeply embedded computing systems. Multimodal, speech-recognition systems will be one of the most important enabling technologies for such soon-to-come, deeply embedded systems (Kölsch et al. 2006; Marshall and Tennent 2013).

2 Related work for AVSR

The AVSR approach to ASR sytems from Petajan’s work (Petajan 1984) is still an active research field. Its hot topics concern lip localization, feature extraction, and methods for fusing audio and visual information. Hidden Markov models (HMMs) were proposed first, followed by soft computing paradigms, mainly artificial neural networks (ANNs). A combination of HMMs and ANNs were applied by Noda (Noda et al. 2015), who experimented with a connectionist-hidden Markov model system for noise-robust AVSR. Dupont (Dupont and Luettin 2000) developed a sensor-fusion module responsible for joint modeling through time of acoustic and visual feature streams that uses multistream hidden Markov models (MHMMs). Kasabov (Kasabov 1998; Watts 2009) demonstrated that the evolving connectionist paradigm is well suited to the challenge of fusing auditory and visual information at the decision stage (Kasabov et al. 2000).

Recognizing speech presents challenges because speech signals vary greatly due to many factors related to how the utterance is produced (different speakers, different ways of speaking, different acoustics, different emotional states, different physical states, different ages, physiology changed by aging, etc). Speech is more than mere acoustics; it involves collateral visual communication (moving lips, facial expressions, and some body language), as well as explicit gesture (motion and movement of hands and arms). Two main challenges need to be addressed: adapting to variability and fusing multimodal data.

Multimodal speech recognition is founded on human beings’ natural ability to communicate by integrating various sensory signals, while using context of situation to make decisions about what utterances they are hearing. For example, the McGurk and MacDonald experiment (McGurk and MacDonald 1976) showed that human beings combine auditory and visual information during interpretation (Wright and Wareham 2005). As a result, decisions about the meaning of a speech sound may differ according to situational audiovisual context. This experiment shows that audiovisual speech processing at recognition time enables two embedded capabilities, one for merger and one for combining. Merging and combining at the phoneme-recognition stage are powerful abilities that enable the AVSR to fuzzily remedy ambiguities in deciding on the most probable phoneme-to-grapheme transcription of the uttered speech. Implementing such fusion through soft computing proved effective, especially when the AVSR system uses two stages, a lower stage to extract features and an upper stage to fuse decisions (Noda et al. 2015; Malcangi et al. 2013; Patel et al. 2005; Stork et al. 1992).

Audiovisual speech recognition (AVSR) (Basu et al. 1999; Massaro 1996; Benoît et al. 1996; Salama et al. 2014; Bernstein and Auer 1996) enhances speech recognition by combining it with image recognition so that, e.g., heard utterances are supplemented by lip reading, which is especially helpful in harsh audio environments. The common approach to implementing AVSR systems is to merge audio features and video features into a single pattern-matching framework. This approach leads to highly complex systems that are hard to tune when they are actually up and running (Kaucic et al. 1996; Yang and Waibel 1996; Steifelhagen et al. 1997; Malcangi and de Tintis 2004). An alternative is to run the utterance-recognition system and the lip-reading system independently at the feature-matching stage and then fuse the decision at a later stage. Using such two-stage AVSR frameworks offers several advantages. First, it makes for more flexible and reliable AVSR. Second, it enables us to follow soft computing paradigms. This not only lowers complexity but also enhances performance under harsh conditions, since it relies on fuzzy logic and neural networks.

The fuzzy-logic approach to AVSR is proposed by (Badura et al. 2014), who find it effective. But fuzzy logic has some drawbacks in terms of how it is to be modeled. For example, we need to establish how knowledge is to be developed (the rule set and membership functions). We need to choose a method for rule explosion. Various approaches have attempted to address these issues (Joo 2003), and knowledge development and rule explosions have both been efficiently optimized under the evolving paradigm (Kasabov 1998).

Methods from computational intelligence and techniques for adaptive machine learning have been successfully applied to AVSR, but certain problems with the evolving nature of the speech-recognition process remain open. One concerns the best choice of architecture to guarantee lifelong learning. Excessive training time is another important issue, given the real-time nature of the AVSR task.

Approaches that rely on an evolving connectionist system (ECoS) show promise for developing AVSR suited to highly variable phenomena. The simple evolving connectionist system (SECoS), a minimal implementation of ECoS, did a reasonably good job of recognizing isolated phonemes (Watts and Kasabov 2000). Its ability to learn and make generalizations was tested on the Otago Speech Corpus (Sinclair and Watson 1995), a body of segmented words representing 45 phonemes. The SECoS model’s performance was evaluated compared to the more traditional connectionist structure, the multilayer perceptron (MLP), a model widely adopted in deploying ASRs and AVSRs, using the same datasets. SECoS outperformed MLP, showing good data recall and good adaptability to new data. The cost of this performance is seen in the large number of nodes in its hidden layer.

Because fuzzy neural networks are an optimal connectionist paradigm for modeling linguistic rules through the behavior of a process, we applied the evolving fuzzy neural-network (EFuNN) paradigm (Kasabov 2001) to implement the decision layer for a previously developed fuzzy-based AVSR (Malcangi et al. 2013). The purpose of that research was to develop an intermediate stage between the stage that matches phonemes to visemes and the stage that transcribes speech to text. This enables merger and combination to be completed before matching errors caused by noise are propagated to the stage that transcribes phonemes to graphemes.

The remainder of the paper is organized as follows. Section 2 describes related work on AVSRs. Section 3 presents the framework, the proposed evolving adaptive AVSR system architecture and the feature-extraction units. Section 4 discusses the fusion method, i.e., applying the EFuNN evolving architecture to fuse phoneme-viseme classification and predict phoneme occurrence. Section 5 describes experimental simulations and performance evaluation. Finally, Sect. 6 gives conclusions and future development.

3 Framework

Differently from the existing works we propose a new framework for adaptive speech recognition was defined and set up, according to the model for evolving intelligent systems (EIS). Among ECoS paradigms, we opted for EFuNN because of its ability to generate evolving rules that can be deployed in a fuzzy logic engine. Adaptation is driven by an analysis module that acts on the feature-decision layer, evaluating output from the decision-fusion layer and evolving in response to changes in surrounding context.

3.1 The proposed adaptive AVSR system architecture

The architecture for the proposed AVSR system consists of three feed-forward stages with feedback that enables its evolving functionality (Fig. 2). The first stage has three parallel operating units devoted to extracting and classifying features. This hard computing stage is based on (audio and video) digital signal-processing algorithms (DSP). The second stage, a soft computing fusion and combination unit, is based on a fuzzy logic engine (FLE). The third stage is a speech-to-text-transcription unit based on artificial intelligence (AI). The feedback goes through a transversal layer that exploits the EFuNN’s capacity for quickly generating a set of evolved rules, which are applied to the fuzzy engine at runtime.

Fig. 2
figure 2

Audiovisual speech-recognition system with decision layer based on fuzzy logic engine fed with rules tuned using EFuNN paradigm

System input consists of audio and video streams. Audio information is captured by an array microphone (STMicroelectronics MEMS microphones). It is then conditioned and digitalized at 16 kHz/16 bit. Video information is captured by a camera recording at 24 fps. Application-specific software was developed to jointly capture audio and video on a frame-by-frame basis. The hardware and software setup for the audio-visual front-end is shown in Fig. 3. A MATLAB-based graphical user interface was developed to support system development and testing.

Fig. 3
figure 3

Audio-visual front-end hardware setup

3.2 The feature-extraction unit

Phonemes and visemes (Table 1) are matched, scored, and encoded at the lower stage. Matching is based on signal-processing methods so as to classify, score, and store the utterance and its related visual sequence, frame by frame.

Table 1 Viseme-class coding and the corresponding phoneme symbols

The feature-extraction unit consists of three distinct subsystems, one that processes the utterance (Fig. 4a) to match phonemes, a second that processes video frames (Fig. 4b) to match visemes, and a third that measures the similarity of the matched phonemes and visemes. Phoneme and viseme matching units are independent systems. The similarity-scoring subsystem depends on the matching and scoring subsystems for phonemes and visemes.

Fig. 4
figure 4

Utterance of the word menu, its phonemic transcription, and corresponding visemes

The phoneme-extraction unit segments the audio stream into short intervals (20.85 ms), measures the features (pitch, formants, and intensity), and executes its classification (phoneme: score, duration).

The following features were used:

Root mean square (RMS):

$${RMS_{j} = \sqrt {\frac{1}{N}\sum\limits_{m = 0}^{N - 1} {s_{j}^{2} (m)} } }$$
(1)

m: sample number

N: total samples in a frame

s: frame

j: frame index

Zero-crossing rate (ZCR):

$${ZCR_{j} = \sum\limits_{m = 0}^{N - 1} {0.5} }\left| {sign(s_{j} (m)) - sign(s_{j} (m - 1))} \right|$$
(2)

Auto correlation (AC):

$${AC_{j} = \sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{N + 1 - i} {s_{j} (j)s_{j} (i + j - 1)} } }$$
(3)

Cepstral linear prediction coefficients (CLPC):

$${CLPC_{j} = a_{m} + \sum\limits_{k = 1}^{m - 1} {\left( {\frac{k}{m}} \right)c_{k} a_{m - k} } }$$
(4)

Frame-by-frame uttered speech (j) is encoded in feature vectors that are matched against a set of phoneme templates to classify and score the jth frame. Euclidean distance metrics are applied to match each frame. Phoneme duration is measured as the number of contiguous audio frames (windows) that the current phoneme matches.

The viseme-extraction unit measures lip features (height, width, and duration) on each video frame (1/24 s) and yields a classification (viseme: score, duration). Visemes are identified by the height-to-width ratio of lip contour. To measure lip features, mouth contour is located after the face has been detected, whereupon lip position is determined. Four key lip points (Fig. 5) are pinpointed, two for vertical and two for horizontal, delimiting effective lip contour. Height and width are measured to create a relative index from the ratio of height to width. The viseme is then identified and scored, employing a matching method based on a set of templates and its Euclidean distance metrics. Viseme duration is then measured as the number of contiguous visual frames that the viseme matches.

Fig. 5
figure 5

After four key lip points were tagged, height and width are measured to index from the ratio of height to width

The phoneme-viseme-comparison unit is a lookup table that scores how much the current phoneme is phonetically similar to the current viseme (Cappelletta and Harte 2012). The merge-and-combine unit predicts the phoneme on a window-by-window basis, running a fuzzy logic engine tuned according to the EFuNN paradigm. The output of this unit is a stream of phonemes (one phoneme per window) ready for phoneme-to-grapheme transcription. The phoneme-to-text-transcription unit applies a ruleset to transform each phoneme into the corresponding alphabetic representation, yielding the final text transcription of the uttered word (or a homophone). EFuNN-based feedback updates the fuzzy engine’s rule set when changes in context occur (e.g., noise increases, new speaker, etc.) or errors are found at the higher transcription layer.

4 Evolving Fuzzy modeling of AVSR

The fuzzy logic-based inference paradigm was applied to draw inferences about phonemes from a set of audio and visual features, handling uncertainty due to noise and great variation in both audio and visual information. The main issue in designing the fuzzy logic engine was setting the rules. This was accomplished by applying the evolving neuro-fuzzy EFuNN paradigm, a neuro-fuzzy structure that evolves by creating and modifying its nodes and connections after learning from input.

The EFuNN paradigm connects using a feed-forward architecture of five layers of neurons and can be trained with neural-network methods (Kasabov 1998, 2001). By evolving it obviates the need to adapt to an a priori architecture, because it starts with a minimal set of initial nodes and then grows or shrinks at training and learning time, depending on its data input. This strategy avoids the problem of catastrophic forgetting and enables the network to be further trained with new data, retaining the effects of previous learning because new nodes are created without removing the old ones, thus preserving previous knowledge. Pruning and aggregation at training time avoid overtraining during learning by removing weak connections and their nodes.

The EFuNN is a five-layered, feed-forward, artificial neural network, in which each layer performs one specialized function in the fuzzy logic engine: input, condition, rule, action, and output (Fig. 6). The input layer (layer 1) represents (crisp) input variables that are presented to the nodes on the condition layer. The nodes on the condition layer (layer 2) are fuzzy membership functions that perform fuzzification on crisp input. The rule layer is the evolving layer (layer 3), which can create and aggregate the nodes, adapting them to changes in fuzzified input data. The nodes in this layer shape the rules that embed map the correspondence of input to output. The action layer (layer 4) consists of fixed-shape, fuzzy membership functions that fuzzily quantify output values. This layer computes the degree to which an output vector belongs to an output membership function (MF). The output layer (layer 5) defuzzifies the action output.

Fig. 6
figure 6

EFuNN evolving architecture applied to fuse phoneme-viseme classification and to predict phoneme occurrence

The layers perform their functions as follows:

  • Layer 1: input (crisp values).

  • Layer 2: condition (input membership functions).

  • Layer 3: association (rules).

  • Layer 4: action (output membership functions).

  • Layer 5: output (crisp values).

The learning algorithm consists mainly of certain key actions, such as updating connections, aggregating nodes, pruning nodes, and extracting rules. At layer 3, the rule nodes cluster input–output data associations. Two connection weights, W1 and W2, are adjusted so that W1 is related to the fuzzified input vector and W2 is related to the corresponding output vector. To adjust W1, supervised learning based on output error is applied. To adjust W2, similarity is applied, using the cluster method.

To train and test the fuzzy engine, a sequence of pattern data was recorded from the output of the phoneme-extractor and viseme-extractor units. Data vectors x(t) of the input with the corresponding output were assembled to train and test the EFuNN. The data vector consists of five input measurements and one output. The five input measurements are: phoneme recognized (PR), phoneme score (PS), phoneme duration (PD), viseme recognized (VR), viseme score (VS), and phoneme-viseme similarity (PVS). The output item is the predicted phoneme (PP). Thus:

$$x\left( n \right) = \left[ {PR, \, PS, \, PD, \, VR, \, VS, \, PVS, \, PP} \right]$$
(5)
$$n = t/Tw$$

The vector, indexed by time-window number n, is compiled throughout the utterance. The size of Tw is compatible with the quasi-stationary characterization of speech (from 20 to 40 ms). However, it must also be compatible with the duration of a visual frame (1/24 s, i.e., 41.7 ms). Hence, the time window was set to 20.85 ms, half the duration of a visual frame. The dataset was generated from basic uttered phoneme sequences (syllables), with and without added background noise. It consists of 520 patterns x(n), one per frame, each fully describing the association between input and output. The dataset was split randomly to yield two data subsets, one with 80 % of the vectors for training, the other with 20 % for testing. The NeuCom (2016) environment was used to model and simulate the EFuNN by applying the following setup:

  • Sensitivity threshold: 0.99

  • Error threshold: 0.01

  • Number of membership functions: 3

  • Learning rate for W1: 0.1

  • Learning rate for W2: 0.1

  • Node age: 60

The sensitivity and error thresholds affect the generation of new rule nodes. If the sensitivity among inputs increases, then the network is more likely to create new rule nodes. If the threshold for error between actual output and desired output decreases, then the network is more likely to increase rule nodes. As the threshold increases, the network tends to retain its learning over a longer time. If pruning is on, i.e., the network’s ability to remove connections between the layers while maintaining its original training performance, the network is less likely to reproduce a rule node that was pruned previously. The learning rate influences the training process. As the learning rate increases, the node will saturate faster, reducing its capacity to generalize. As the age threshold increases, the network’s ability to retain what it has learned over the time increases. If aggregation is on, the network tries to aggregate the rules to form global behavior descriptions, thus avoiding increases in size that would make it unwieldy. The number and shape of membership functions depends directly on the dynamics of the input and output data and on how the functions are measuring data in the crisp domain. The more functions there are, the more the interconnections at the input and output layers.

5 Performance evaluation

The adaptability of the AVSR experimental setup was evaluated on the basis of the evolving functionality yielded by the EFuNN. Two sets of tests were conducted, the first to check the AVSR’s ability to fuse the AV decision and the second to check adaptation through the evolving method. To run the first test, the word menu, with the right phoneme sequence, was first entered. Then, the same word with the phonemes/m/and/n/swapped was entered. To run the second test, environmental conditions for the utterance of the word menu were altered by adding audio noise. The EFuNN’s performance was tested by checking its ability to recover from having confused similar phonemes in the two conditions, noise-free and noisy. The word was uttered and put into the AVSR forty times. The audio and the visual scores were collected to train the EFuNN by tuning the rules to be loaded into the fuzzy engine, then new input of forty utterances of word menu was been sent to the AVSR system. The fused decisions were presented graphically, grouping the forty utterances of the word menu by phoneme class.

5.1 Decision-fusion test

One hundred utterances of the word menu were uttered in noise-free conditions. The EFuNN was trained from scratch with the output from the audio-visual scoring layer, then tested. The results (Fig. 7) showed that the EFuNN’s self-teaching ability is adequate to learn from data, recovering from confusion over similar phonemes (e.g.,/m/and/n/in the uttered word menu). The fuzzy engine was trained once (Fig. 7a) and twice (Fig. 7b) without any knowledge of the/m/-and-/n/phoneme mismatch. Then (Fig. 7c), the/m/-and-/n/phoneme mismatch was tested. The fuzzy engine learned how to fuse and combine phonemes and visemes recognized by independent audio and visual units.

Fig. 7
figure 7

Desired and actual output values of the trained by EFuNN for noise-free utterance of the word menu: the fuzzy engine was trained (a) once and (b) twice without knowledge of any/m/-/n/mismatch; then (c) the/m/-/n/phoneme mismatch was tested

The previously trained AVSR evolved by acquiring more knowledge about fusing audio and visual data. Its ability to recognize right the/m/-/n/phoneme sequence (Fig. 8a) and to recover from the mismatched one (Fig. 8b) improved.

Fig. 8
figure 8

The evolved AVSR system performed better at recognizing the right/m/-/n/phoneme sequence (a) and at recovering from the mismatched one (b)

5.2 Adaptation test

Additive noise (24 dB) was mixed, in linear fashion, into the utterance of the word menu and fed as input to the AVSR. No special noise-recovery strategy was applied at the (hard computing) lower stage. The EFuNN was allowed to evolve incrementally with the new set of noisy decision. The test sequence was then executed. The evolved ruleset for decision fusion was tested. Before evolving, the AVSR mismatched the/m/-/n/phoneme sequence (Fig. 9a, b). After evolving with new knowledge about the noisy conditions (24 dB noise), its recognition rate was quite good. It never confused the two similar phonemes/m/and/n/at 0 dB noise level (Fig. 10a, b). It performed only slightly less well at 24 dB noise (Fig. 10c, d), thus demonstrating its ability to evolve without forgetting.

Fig. 9
figure 9

Under noisy conditions, the AVSR performs badly at the feature and scoring layer

Fig. 10
figure 10

After a first evolutionary step based only on the right/m/-/n/phoneme sequence under noisy conditions (24 dB noise), the AVSR systems performs very well a, b with 0 dB noise and well c, d with 24 dB noise

6 Conclusion and future development

We proposed a new framework for adaptive speech recognition based on the model for evolving intelligent systems (EIS). We opted for EFuNN because of its ability to generate evolving rules that can be deployed in a fuzzy logic engine. Adaptation is driven by evaluating output from the decision-fusion layer and evolving in response to changes in surrounding context.

The fuzzy logic-based inference paradigm was applied to draw inferences about phonemes from a set of audio and visual features, handling uncertainty due to noise and great variation in both audio and visual information. The evolving neuro-fuzzy EFuNN paradigm, a neuro-fuzzy structure that evolves by creating and modifying its nodes and connections after learning from input, has been applied.

These results show that the evolving fuzzy neural-network (EFuNN) paradigm can be successfully applied to develop a fuzzy logic-based inference engine for merging and combining phonemes and visemes at the intermediate stages of a layered AVSR system. Several advantages were found, mostly in performance. These included an increase in reliability because system complexity was reduced.

Future development will focus on extending the evolving and adapting capabilities of the ECoS paradigm to the upper and lower stages of the AVSR system. One remaining issue is how to integrate the dynamic, evolving fuzzy neural-network paradigm into the AVSR in a proactive fashion. This would allow evolving capabilities to be embedded in the system. Another issue is how to apply the EFuNN paradigm to the system’s lower layer, scaling it according to that layer’s pattern-matching nature, and husbanding hard computing power for the important task of conditioning and feature extraction. Disambiguation is also a key issue that will affect the AVSR’s upper layer at the phoneme-to-grapheme transcription stage and the syntactic transcription of the utterance.