Keywords

1 Introduction

Demographic change and ageing in developed countries are challenging the society effort in improving the well being of its elderly and frail inhabitants. The evolution of the Information and Communication Technologies led to the emergence of Smart Homes equipped with ambient intelligence technology which provides high man-machine interaction capacity [1]. However, the technical solutions implemented in such Smart Homes must suit the needs and capabilities of their users in the context of Ambient Assisted Living. Under some circumstances, classic tactile commands (e.g., the switch of the lamplight) may not be adapted to the aged population who have some difficulties in moving or seeing. Therefore, tactile commands can be complemented by speech based solutions that would provide voice command and would make it easier for the person to interact with her relatives or with professional carers (notably in case of distress situations) [2]. Moreover, analysis of sounds emitted in a person’s habitation may be useful for activity monitoring and context awareness.

The Sweet-Home project was set up to integrate sound based technology within smart homes to provide natural interaction with the home automation system at any time and from anywhere in the house. As emphasized by Vacher et al. [3], major issues still need to be overcome. For instance, the presence of uncontrolled noise is a real obstacle for distant speech recognition and identification of voice commands in continuous audio recording conditions when the person is moving and acting in the flat. Indeed, it is not always possible to force the user to take up a position at a short distance and in front of a microphone when he has to manage a specific device, such as a remote control device. Therefore, some microphones are set in the ceiling to be available without any action of the user.

This paper presents preliminary results of speech recognition techniques evaluated on data recorded in a flat by several persons in a daily living context. A glossary is given in Sect. 2 in order to define all specific terms used in this chapter. The background, the state of the art and the challenges to tackle are given in Sect. 3. The data recording and the corpus are presented in Sect. 4. In Sect. 6, several techniques of multisource speech recognition are detailed and evaluated. Section 6.5 is devoted to word spotting needed to recognize voice commands in sentences. The chapter finishes with Sect. 7 which makes a review of the open problems with regard to the application of speech processing for Assistive Technologies and with Sect. 8 which emphasizes the future work and studies necessary to design a usable system in the real world.

2 Glossary

Activities of daily living (ADL) are, as defined by the medical community, the things we normally do in daily living, including any daily activity we perform for self-care (such as feeding ourselves, bathing, dressing, grooming), work, and leisure. Health professionals routinely refer to the ability or inability to perform ADLs as a measurement of the functional status of a person, particularly in regard to people with disabilities and the elderly. A well known scale for ADL was defined by Katz and Akporn [4].

Ambient Assisted Living (AAL) aims to help seniors to continue to manage their daily activities at home thanks to ICT solutions for active and healthy ageing.

Automatic Speech Recognition (ASR) is the translation of spoken words into text by an automatic analysis system.

Blind Source separation (BSS) is the separation of a set of source signals from a set of mixed signals, without the aid of additional information (or with very little information) about the source signals or the mixing process.

Driven Decoding Algorithm (DDA) is a method that allows to drive a primary system search by using the one-best hypotheses and the word posteriors gathered from a secondary system in order to improve the recognition performances.

Distant Speech Recognition is a particular case of Automatic Speech Recognition when the microphone is moved away from the mouth of the speaker. A broad variety of effects such as background noise, overlapping speech from other speakers, and reverberation are responsible of the high degradation of performances of the conventional ASR in this configuration.

Hidden Markov Model (HMM) is a statistical Markov model in which the system being modelled is assumed to be a Markov process with unobserved (hidden) states.

Home Automation is the residential extension of building automation. Home automation may include centralized control of lighting, appliances and other systems, to provide improved convenience, comfort, energy efficiency and security.

Home Automation Network is a network specially designed to ensure the link between sensors, actuators and services.

KNX (KoNneX) is a worldwide ISO standard (ISO/IEC 14543) for home and building control.

Maximum A Posteriori (MAP) estimator, as the maximum likelihood method, is a method that can be used to estimate a number of unknown parameters, such as parameters of a probability density, connected to a given sample. This method is related to maximum likelihood however, it differs in the ability to take into account a non-uniform a priori on the parameters to be estimated.

Maximum Likelihood Linear Regression (MLLR) is an adaptation technique that uses small amounts of data to train a linear transform which, in case of Gaussian distribution, warps the Gaussian means so as to maximize the likelihood of the data.

Recognizer Output Voting Error Reduction (ROVER) is based on a ‘voting’ or re-scoring process to reconcile differences in ASR system outputs. It is a post-recognition process which models the output generated by multiple ASR systems as independent knowledge sources that can be combined and used to generate an output with reduced error rate.

Smart Home is a house that is specially equipped with devices giving it the ability to anticipate the needs of their inhabitants while maintaining their safety and comfort.

Signal to Noise Ratio (SNR) it is a measure that compares the level of a desired signal to the level of a reference or to background noise: \(SNR=\frac{P_{\textit{signal}}}{P_{\textit{reference}}}\) The signal and the noise are usually measured across the same impedance and the SNR is generally expressed in dB scale: \(SNR_{dB}=10.\log _{10}\left( \frac{P_{\textit{signal}}}{P_\textit{reference}}\right) =20.\log _{10}\left( \frac{A_{\textit{signal}}}{A_{\textit{reference}}}\right) \), where \(P\) and \(A\) denote respectively the power and amplitude of signal or reference.

Wizard of Oz It is an interaction method in which the user is not informed that the reaction of a device is actually controlled by a human (the ‘wizard’). This is a reference to the 1939 American musical fantasy film “The Wizard of Oz”.

Word Error Rate (WER) is a common metric of the performance of a speech recognition or machine translation system. \(WER=\frac{S+D+I}{N}\), where \(S\) is the number of substitutions, \(D\) the number of deletions, \(I\) the number of insertions, \(N\) the number of words in the reference.

Word Spotting is related to search and retrieval of a word in an audio stream.

3 Background and State of the Art

As reported in Sect. 1, Smart Homes have been designed with the aim of allowing seniors to keep control of their environment and to improve their autonomy. Despite the fact that audio technology has a great potential to become one of the major interaction modalities in Smart Home, this modality is seldom taken into consideration [58]. The most important reason is that audio technology as not reached a sufficient stage of maturity and that there is still some challenges to overcome [3]. The Sweet-Homeproject presented in Sect. 3.1 aims at designing an audio analysis system running in real-time for voice commands recognition in a realistic home automation context. The state of the art and the challenges to tackle are developed in Sect. 3.2 while Sect. 3.3 focuses on keyword spotting.

3.1 The Sweet-Homeproject

Main Goals. The Sweet-Home project is a French national supported research project (http://sweet-home.imag.fr/). It aims at designing a new smart home system by focusing on three main aspects: to provide assistance via natural man-machine interaction (voice and tactile command), to ease social inclusion and to provide security reassurance by detecting situations of distress. If these aims are achieved, then the person will be able to pilot his environment at any time in the most natural way possible [9].

Acceptance of the system is definitely a big issue in our approach therefore, a qualitative user evaluation was performed to assess the acceptance of vocal technology in smart homes [10] at the beginning of the project and before the study presented in Sect. 4. Height healthy persons between 71 and 88 years old, seven relatives (child, grand-child or friend) and three professional carers were questioned in co-discovery in a fully equipped smart home alternating between interview and Wizard of Oz periods. Important aspects of the project have been evaluated: voice command, communication with the outside world, domotic system interrupting a person’s activity, and electronic agenda. In each case, the voice based solution was far better accepted than more intrusive solutions. Thus, in accordance with other user studies [11, 12], audio technology seems to have a great potential to ease daily living for elderly and frail persons. To respect privacy, it must be emphasized that the adopted solution will analyse the audio information on the fly and is not designed to store the raw audio signal. Moreover, the speech recognizer must be made to recognize only a limited set of predefined sentences in order to prevents recognition of intimate conversations.

Fig. 1.
figure 1

The general organisation of the Sweet-Home system

Sweet-Home Technical Framework. The Sweet-Home system is depicted in Fig. 1. The input of the system is composed of the information from the domotic system transmitted via a local network and information from the microphones transmitted through radio frequency channels. While the domotic system provides symbolic information, raw audio signals must be processed to extract information from speech and sound. This extraction is based on our experience in developing the AuditHIS system [13], a real-time multi-threaded audio processing system for ubiquitous environments. The extracted information is analysed and either the system reacts to an order given by the user or the system acts pro-actively by modifying the environment without an order (e.g., turns off the light when nobody is in the room). Output of the system thus includes domotic orders, but also interaction with the user when a vocal order has not been understood for instance, or in case of alert messages (e.g., turn off the gas, remind the person of an appointment). The system can also make it easier for the user to connect with her relative, physician or caregiver by using the e-lioFootnote 1 or VisageFootnote 2 systems. In order for the user to be in full control of the system and also in order to adapt to the users’ preferences, three ways of commanding the system are possible: voice order, PDA or classic tactile interface (e.g., switch).

The project does not include the definition of new communication protocols between devices. Rather than building communication buses and purpose designed material from scratch, the project tries to make use of already standardised technologies and applications. As emphasized in [14], standards ensure compatibility between devices and ease the maintenance as well as orient the smart home design toward cheaper solutions. The interoperability of ubiquitous computing elements is a well known challenge to address [15]. Another example of this approach is that Sweet-Home includes systems which are already specialised to handle the social inclusion part. We believe this strategy is the most realistic one given the large spectrum of skills that are required to build a complete smart home system.

3.2 Automatic Speech Recognition in Smart Homes

Automatic Speech Recognition systems (ASR) are especially good with close talking microphones (e.g., head-set), but the performances are significantly lower when the microphone is far from the mouth of the speaker such as in smart homes where microphones are often set in the ceiling. This deterioration is due to a broad variety of effects including reverberation and presence of undetermined background noise. All these problems are still to solve and should be taken into account in the home context.

Reverberation. Distorted signals can be treated in ASR either at the acoustic model level or at the input (feature) level [16]. Deng et al. [17] showed that feature adaptation methods provide better performances than those obtained with systems trained with data with the same distortion as the ones coming from the target environment (e.g., acoustic model learned with distorted data) for both stationary and non stationary noise conditions. Moreover, when the reverberation time is above 500 ms, ASR performances are not significantly improved when the acoustic models are trained on distorted data [18]. In the home involved in the study, the only glazed areas that are not on the same wall are right-angled, thus the reverberation is minimal. Given this and the small dimensions of the flat we can assume that the reverberation time stays below 500 ms. Therefore, only classic ASR techniques with adaptation using data recorded in the test environment will be considered in this study.

Background Noise. When the noise source perturbing the signal of interest is known, various noise removal techniques can be employed [19]. It is then possible to dedicate a microphone to record the noise source and to estimate the impulse response of the room acoustic in order to cancel the noise [20]. This impulse response can be estimated through Least Mean Square or Recursive Least Square methods. In a previous experiment, these methods showed promising results when the noise is composed of speech or classic music [21]. However, in case of unknown noise sources, such as washing machine or blender noise, Blind Source Separation (BSS) techniques seem more suited. The audio signals captured by the microphones are composed of a mixture of speech and noise sources. Independent Component Analysis is a subcategory of BSS which attempts to separate the different sources through their statistical properties (i.e., purely data driven). This method is particularly efficient for non-Gaussian signals (such as speech) and does not need to take into account the position of the emitter or of the microphones, but it assumes signal and noise to be linearly mixed, this hypothesis seems to be not suited in realistic recordings. Therefore, despite the important effort of the community, noise separation in realistic smart home condition remains an open challenge.

3.3 Word Spotting

Spoken word detection has been extensively studied in the last decades especially in the context of spoken term detection in large speech databases and in continuous speech streams. Performances reported in the literature are good in clean conditions, especially with broadcast news data however, when experiences are undertaken in users’ home conditions such as with noisy or spontaneous speech, performances decrease dramatically [22]. In [23], an Interactive Voice Response system was set up to support elderly people to deal with their medication. Over the 300 persons recruited, a third stopped the experiment because they complained about the system and only 38 persons completed the experiment.

In this study, some aspects of both spotting and Large Vocabulary Continuous Speech Recognition are considered. A Large Vocabulary Continuous Speech Recognition system was used in the approach to increase the recognition robustness. Language and acoustic models adaptation and multisource based recognition were investigated. Finally, we designed an original approach which integrates word matching directly inside the ASR system to improve the detection rate of domotic order, this will be described in Sect. 6.5.

4 Recorded Corpus and Experimental Framework

One experiment was conducted to acquire a multimodal corpus by recording individuals performing activities of daily living in a smart home. The speech part of the corpus, called the Sweet-Home speech corpus, is composed of utterances of domotic orders, distress calls and anodin sentences in French recorded using several microphones set in the ceiling of the smart home. This corpus was used to tune and to test a classic ASR system in different configurations. This section briefly introduces the smart home, the Sweet-Home speech corpus. The monosource ASR system is described in Sect. 5.

Fig. 2.
figure 2

The Domus Smart Home used during the Sweet-Home project

Fig. 3.
figure 3

The Domus Smart Home and the position of the sensors

4.1 Data Acquisition in the Smart Home

The Domus smart home. The Sweet-Home speech corpus was acquired in realistic conditions, i.e., in a smart-home and in distant speech condition inside the Domus smart home. This smart home was designed and set up by the Multicom team of the Laboratory of Informatics of Grenoble to observe users’ activities interacting with the ambient intelligence of the environment. Figure 2 shows the details of the flat. It is a thirty five square meters suite flat including a bathroom, a kitchen, a bedroom and a study, all equipped with sensors and effectors.

More than 150 sensors, actuators and information providers are managed in the flat. The flat is fully usable and can accommodate a dweller for several days so that it is possible to act on the sensory ambiance, depending on the context and the user’s habits. The technical architecture of Domus is based on the KNX bus system (KoNneX), a worldwide ISO standard (ISO/IEC 14543) for home and building control. The flat has also been equipped with 7 radio microphones for the need of the Sweet-Home project; the microphones are set into the ceiling (2 per room except for the bathroom). Audio data can be recorded in real-time thanks to a dedicated PC embedding an 8-channel input audio card  [13]. The sample rate is 16kHz and the bandwith 8kHz. It must be noticed that the distance between the speaker and the closest microphone is about 2 m when he is standing and about 3 m when he is sitting. Figure 3 shows the position of the microphones and of some sensors in the flat.

Corpus Recording. 21 persons (including 7 women) participated to a 2-phase experiment to record, among other data, speech corpus in the Domus smart home. To make sure that the audio data acquired would be as close as possible to real daily living sounds, the participants performed several daily living activities. Each experimental session lasted about 2 h. The average age of the participants was \(38.5 \pm 13\) years (22–63, min-max). No instruction was given to any participant about how they should speak and in which direction. Consequently, no participant emitted sentences directing their voice to a particular microphone.

A visit, before the first phase of the experiment, was organized to make the participants accustomed to the home in order to smoothly perform the experiment. During this first phase, participants uttered forty predefined French casual sentences on the phone such as “Allo” (Hello), “J’ai eu du mal à dormir” (I slept badly) but were also free to utter any sentence they wanted (some did speak to themselves aloud). Then, the first phase consisted in following a scenario of activities without condition on the time spent and the manner of achieving them (having a breakfast, listening to music, get some sleep, clean up the flat using the vacuum, etc.). Note that the microphone of the telephone was not recorded, only the 7 microphones set on the ceiling were used.

The second phase consisted in reading aloud a list of 44 sentences:

  • 9 distress sentences such as “A l’aide” (Help), “Appelez un docteur” (call a doctor);

  • 3 orders such as “Allumez la lumière” (turn on the light);

  • 32 colloquial sentences such as “Le café est très chaud” (The coffee is hot).

This list was read in 3 rooms (study, bedroom, and kitchen) under three conditions: no background noise, vacuum on or radio on. 396 sentences were recorded but only those in the clean condition were used in this paper, the noisy condition records having been designed for other experiments.

Table 1. Sweet-Home speech corpus description

4.2 The Sweet-Home French Speech Corpus

Only the sentences uttered in the study during the phone conversation of the phase 1 were considered. For the phase 2 record, only the sentences uttered in the kitchen without additional noise (vacuum or radio) were considered. Each speaker did not follow strictly the instructions given at the beginning of the experiment, therefore this corpus was indexed manually. Some hesitations and word repetitions occurred along the records. Moreover, when two sentences were uttered without a sufficient silence between them, they were considered as one sentence. A complete description of the corpus according to each speaker is given in Table 1. The Sweet-Home speech corpus is made of 862 sentences uttered by 21 persons in the first phase, 917 sentences in the second phase; it lasts for each channel 38 min 46 s in the case of the first phase, and 40 min 27 s in the case of the second phase. The SNR (Signal-to-Noise Ratio) is an important parameter which was used for the combination of several sources. For Phase 1 (when the speaker was in the study) mean SNR was 21.8 dB/20.0 dB (channels 6 and 7), for Phase 2 (when the speaker was in the bedroom) mean SNR was 22.1 dB/22.1 dB (channels 4 and 5).

The databases recorded in the course of the Sweet-Home project are devoted to voice controlled home automation, they will be distributed for an academic and research use only [24].

5 Monosource ASR Techniques

The architecture of an ASR is described by Fig. 4. A first stage is the audio interface in charge of acoustical feature extraction in consecutive frames. The next 3 stages working together are:

  • the phoneme recognizer stage;

  • the word recognition stage constructing the graph of phonemes; and

  • the sentence recognition stage constructing the graph of words.

The data associated with these stages are respectively the acoustic models, the phonetic dictionary and the language models. The output of the recognizer is made of the best hypothesis lattices.

Fig. 4.
figure 4

General organisation of an ASR

5.1 The Speeral ASR System

The ASR system used in the study is Speeral [25]. The LIA (Laboratoire d’Informatique d’Avignon) speech recognition engine relies on an \(A^{*}\) decoder with HMM-based context-dependent acoustic models and trigram language models. HMMs are classic three-state left-right models while state tying is achieved by using decision trees. The acoustic features, for each 30 ms-length frame with 20 ms overlay (10 ms-time shift), were composed of 12 Perceptual Linear Predictive coefficients, the energy, and the first and second order derivatives of these 13 parameters, this represent in total 39 parameters. The acoustic models were trained on about 80 h of annotated French speech. If the participants were elderly people, the use of adapted data would be required [26], but this was not the case for this study. Given the targeted application of Sweet-Home  the computation time should not be a breach of real-time use. Thus, the 1 \(\rtimes \) RT Speeral configuration was used. This this configuration, by using a strict pruning scheme, the time spent by the system to decode one hour of speech signal is real-time.

Language Models. Two language models were built: the generic and the specialized models. The specialized language model was estimated from the sentences that the 21 participants had to read during the experiment (domotic orders, casual phrases, etc.). The generic language model was estimated on about 1000 M of words from the French newspapers Le Monde and Gigaword.

5.2 Baseline System

In order to propose a baseline system, the adaptation of both acoustic and language models were tested. Then, to improve the robustness of the recognition, multi-streams ASR was tested. Finally, a new variant of a driven decoding algorithm was used in order to take into account a-priori information and several audio channels for each speaker.

The phase 1 of the corpus was used for development and acoustic model adaptation to the speaker while the phase 2 was used for performances estimation. Results obtained on the phase 2 of the corpus were analysed using two measures: the Word Error Rate (WER) and the Classification Error Rate (CER). The WER is a good measure of the robustness, while the CER corresponds to the main goal of our research (i.e., detection of predefined sentences).

Acoustic Models Adaptation: MAP Versus MLLR. Acoustic models were adapted for each speaker by using two methods: Maximum A Posteriori (MAP) and Maximum Likelihood Linear Regression (MLLR) by using data of the first phase. These data were perfectly annotated, allowing to perform correct targeted speaker adaptation.

The Maximum Likelihood Linear Regression (MLLR) is used when a limited amount of data per class is available. MLLR is an adaptation technique that uses small amounts of data to train a linear transform which warps the Gaussian means so as to maximize the likelihood of the data: acoustically close classes are grouped and transformed together. In the case of the Maximum a posteriori approach (MAP), initial models are used as informative priors for the adaptation.

Table 2 shows different results with and without acoustic models adaptation. Results are presented for the two best streams (high SNR). Experiments were carried out with the generic language model (GLM) lightly interpolated with predefined sentences (PS) presented in the next section. Without acoustic adaptation, the best average WER is about 36 %. The results show that MAP is not very performing in this case. With MAP, the WER about 27 %. The best average WER is about 18 % with MLLR adaptation, which is the best choice for sparse and noisy data whatever the channel.

Two aspects explain the MAP performance:

  • The noisy environment is not adapted to MAP adaptation [27].

  • The lack of parameter tying in the standard MAP algorithm implies that the adaptation is not robust.

Linguistic Variability. Large vocabulary model languages such as the generic language model, are known to perform poorly on specific tasks because of the large number of equi-probable hypotheses. Better recognition can be obtained by reducing the overall linguistic space by estimating a language model on the expected sentences such as with the specialized language model. However, such a language model would be probably too specific when the speaker deviates from the original transcript. To benefit from the two language models, we propose a linear interpolation scheme where specific weights are tested on specialized and generic language models. The reduction of the linguistic variability thanks to the contribution of known predefined sentences is explored. Therefore, we interpolated the specialized model with the generic large vocabulary language model.

Two schemes of linear interpolation were considered: in the first one, the generic model had a strong weight while in the second one, the impact of the generic model was low. The ASRs were assessed after MLLR adaptation using the data of phase 1 of the corpus. Table 2 presents the WER with the generic language model (Baseline). As expected, the baseline language model obtained poor results: about 74 %. Without reliable information, the ASR system, in noisy, speaker independent and large vocabulary condition is unable to perform good recognition.

Table 2. Average WER according to different configurations by using monosource techniques

With the specialised language model the system is able to detect more predefined sentences. However, when the speaker deviates from the scenario, the language model is unable to find the correct uttered sentence. The specialised language model was thus too specific.

Finally, a light (10 %) interpolated language model led to the best results. This model combined the generic language model (with a 10 % weight) and the specialised model (with 90 % weight). These results show that a decoding based on a language model mainly learnt from the predefined sentences improves significantly the WER. The best WER is obtained when a generic language model is also considered: when the speaker deviates, the generic language model makes it possible to correctly recognise the pronounced sentences.

5.3 Conclusion About Monosource ASR

Speeral ASR system was evaluated taking into account realistic distant-speech conditions and in the context of a home automation application (voice command). The system had to perform ASR with several constraints and challenges. Indeed, the noisy, distant-speech conditions, speaker independent recognition, continuous analysis and real-time aspects, the analysis system must operate in more difficult conditions than with the classic head-set one. Therefore, it is clear that obtained results are insufficient and must be improved, multichannel analysis is an avenue worth exploring.

The application conditions also make it possible for the ASR system to benefit from multiple audio channels, from a reduced vocabulary and from the hypothesis that only one speaker should utter voice commands. Lightly interpolated language model and a MLLR acoustic adaptation did improve significantly the ASR system performance. In the next section, we propose several techniques based on this baseline in order to perform multisource ASR.

6 Techniques for Multisource Speech Recognition and Sentence Detection

Multisource ASR can improve the recognition performances thanks to information extracted in more than one channel. The ROVER method presented in Sect. 6.1 analyses the outputs of ASR performed on all channels separately. In the DDA method presented in Sect. 6.2, the information of one channel is used to guide the analysis on another channel. We also present an improved DDA method in Sect. 6.3 were a priori information about the task is taken into account.

6.1 ROVER

At the ASR combination level, a ROVER [28] was applied. ROVER is expected to improve the recognition results by providing the best agreement between the most reliable sources. It combines systems output into a single word transition network. Then, each branching point is evaluated with a vote scheme. The words with the best score are selected (number of votes weighted by confidence measures). However, this approach necessitates high computational resources when several sources need to be combined and real time is needed (in our case, 7 ASR systems must operate concurrently).

A baseline ROVER was tested using all available channels without a priori knowledge. In a second step, an a priori confidence measure based on the SNR was used: for each decoded segment \(s_i\) from the \(i^{th}\) ASR system, the associated confidence score \(\phi (s_i)\) was computed according to Eq. 1 where \(R()\) is the function computing the SNR of a segment and \(s_i\) is the segment generated by the \(i^{th}\) ASR system:

$$\begin{aligned} \phi (s_i)={2^{R(s_i)}}/{\sum _{j=1}^72^{R(s_j)}} \end{aligned}$$
(1)

For each annotated sentence a silence period \(I_{\textit{sil}}\) at the beginning and the end is taken around the speech signal period \(I_{\textit{speech}}\). The SNR is thus evaluated through the function \(R()\) according to Eq. 2.

$$\begin{aligned} R(S) = 10 * log(\frac{\sum _{n \in I_{\textit{speech}}} S[n]^2}{|I_{\textit{speech}}|} / \frac{\sum _{n \in I_{\textit{sil}}} S[n]^2}{|I_{\textit{sil}}|}) \end{aligned}$$
(2)

Finally, a ROVER using only the two best channels overall was tested in order to check whether other channels contain redundant information and whether good results can be reached with low computational cost.

The ROVER combination led to great improvements. The results show that the ROVER made ASR more robust with an average WER of 13.0 %. This aspect shows the complementarity of the streams. However, the ROVER stage increased the computation time proportionally to the number of ASR systems used. Given that the objective of the project is to build a real-time and affordable solution, computational resources are limited. Moreover, ROVER combination for two streams reduces the problem to picking the word with the highest confidence when two systems disagree. Thus, when the recogniser confidence scores are not reliable, the ROVER between two streams does not perform well and the final performance is likely to be similar to a single system. Thus, we propose in the next section a method allowing low-cost computations with only two streams, based on the Driven Decoding Algorithm. In the following, ROVER results are used as baseline.

Fig. 5.
figure 5

Driven Decoding Algorithm used with two streams: The first stream drives the second stream

6.2 Driven Decoding Algorithm

The Driven Decoding Algorithm (DDA) [29, 30] is able to simultaneously align and correct the imperfect ASR outputs [31]. DDA has been implemented within Speeral: The ASR generates assumptions as it walks the phoneme lattice. For each new step, the current assumption is aligned with the approximated hypothesis. Then, a matching score \(\alpha \) is computed and integrated within the language model:

$$\begin{aligned} \tilde{P}(w_i|w_{i-1},w_{i-2}) = P^{1-\alpha }(w_i|w_{i-1},w_{i-2}) \end{aligned}$$
(3)

where \(\tilde{P}(w_i|w_{i-1},w_{i-2})\) is the updated trigram probability of the word \(w_i\) given the history \(w_{i-2},w_{i-3}\), and \(P(w_i|w_{i-1},w_{i-2})\) is the initial probability of the trigram. When the trigram is aligned, \(\alpha \) is at a maximum and decreases according to the misalignments of the history (values of \(\alpha \) must be determined empirically using a development corpus).

Table 3. ASR system recognition WER by using multisource techniques

In the Domus smart home, uttered sentences were recorded using two microphones per room. Thus, two microphones can be used as input to DDA in order to increase the robustness of the ASR systems as presented in Fig. 5. We propose to use a variant of the DDA where the output of the first microphone is used to drive the output of the second one. This approach presents two main benefits:

  • The second ASR system speed is boosted by the approximated transcript (only 0.1 \(\rtimes \) RT)

  • While a ROVER does not allows to combine efficiently two systems without confidence scores, DDA combines easily the information

The Fig. 5 explains the Driven Decoding solution: the first Speeral pass on the stream 1 is used to drive a second pass on the stream 2, allowing to combine the information of the two streams.

Results using the 2-stream DDA are presented in Table 3. In most cases, DDA generated hypotheses that led either to the average WER of the two initial streams or to better WER. The average WER is 11.4 %. We propose to extend this approach in the next section by driving the ASR system by a priori sentences selected on the first stream.

Fig. 6.
figure 6

Driven Decoding Algorithm used with two streams and a priori sentences: The first stream drives the second stream according to a refine selection of spotted sentences

6.3 Two Level DDA

In the previous approach, the first stream of decoding was used to drive the second one: DDA aims to refine the decoding achieved during the first stream decoding. Word spotting using ASR systems is known to be focused on accuracy, since the prior probability of having the targeted terms in a transcription is low. On the other hand, transcription errors may introduce mistakes and lead to misses of correct utterances, especially on large requests: the longer the searched term, the higher the probability of encountering an erroneous word. In order to limit this risk, we introduced a two-level DDA: speech segments of the first pass are projected in \(3-best\) spotted sentences and injected via DDA into the ASR system for the second decoding pass. The first decoding pass allows to generate hypotheses. By using the edit distance explained in 6.5, closed spotted sentences are selected and used as input for the fast second pass as presented in Fig. 6. In this configuration, the first pass is used to select some sentences used to drive the second pass. In the Fig. 6, the first system outputs “Allumer la lumière” (Turn on the light). The edit distance allows to find two close sentences: “Allumez la lumière” and “Allumez la télévision” (Turn on the TV). These sentences drive the second pass and allows one to find the correct output “Allumez la lumière”.

Results using this approach are showed in Table 3. According to the WER, this approach improved significantly the ASR system quality, by taking advantage of the a priori information assessed by the predefined spotted sentences. WER is improved significantly for all speakers: the mean WER is 8.8 %. By using the two streams available the ASR system is able to combine them efficiently. The best results are obtained with the two level approach were the ASR system is driven by both the first stream and the potential spotted sentences. The next section investigates the impact of each previous proposed method on the detection of pronounced sentences.

6.4 Multisource Speech Recognition: Results

For each approach, the presented results are the average over the 21 speakers (plus standard deviation for the WER). For the sake of comparison, results of a baseline and an oracle baseline systems are provided. The baseline system outputs the best decoding amongst 7 ASR systems according to the highest SNR. The oracle baseline is computed by selecting the best WER for each speaker. The best results are achieved with DDA because the search for the best hypothesis in the lattice uses data from several channels and has more information than when decoding for each channel.

6.5 Detection of Predefined Sentences

In order to spot sentences into automatic transcripts \(T\) of size \(m\), each sentence of size \(n\) from predefined sentences \(H\) was aligned with \(T\) by using a Dynamic Time Warping (DTW) algorithm at the letter level [32]. Sequences were aligned by constructing an \(n\)-by-\(m\) matrix where the \((i^{th},j^{th})\) element of the matrix contained the distance between the two words \(T_i\) and \(H_j\) using the distance function defined below.

$$\begin{aligned} \begin{array}{ll} d(T_i, H_j) = 0 \text{ if } T_i = H_j \\ d(T_i, H_j) = 3 \text{ in } \text{ the } \text{ insertion } \text{ cases } \\ d(T_i, H_j) = 3 \text{ in } \text{ the } \text{ deletion } \text{ cases } \\ d(T_i, H_j) = 6 \text{ in } \text{ the } \text{ substitution } \text{ cases } \\ \end{array} \end{aligned}$$
(4)

The deletion, insertion and substitution costs were computed empirically. The cumulative distance \(\gamma (i,j)\) between \(H_j\) and \(T_i\) is computed as:

$$\begin{aligned} \gamma (i,j) = d(T_i, H_j)+min\{\gamma (i-1,j-1),\gamma (i-1,j),\gamma (i,j-1)\} \end{aligned}$$
(5)

Each predefined sentence is aligned and associated with an alignment score: the percentage of well aligned symbols (here letters). The sentence with the best score is then selected as best hypothesis.

This approach takes into account some recognition errors such as word declination or light variations (téléviseur, télévision etc.). Moreover, miss-decoded word are often orthographically close from the good one (due to the close pronunciation).

To test the detection of a-priori pronounced sentences, such as domotic orders (e.g., “allume la lumière”), the detection methods were applied in the following ASR configurations:

  • Baseline: Speeral system with acoustic and language model adaptation.

  • ROVER: Consensus vote between all streams.

  • DDA1: DDA drived with the first stream.

  • DDA2: DDA drived by the first stream and the spotted sentences.

The three systems based on ROVER and DDA gave the best performances, with respectively 88.2 %, 87,4 % and 92.5 % of correct classifications while the baseline system obtains 85 % of correct classification. It can be observed that the 2-level DDA based ASR system was able to detect more spotted sentences with less computational time and with more accuracy than the ROVER based one.

Sentence Detection: Results. In all best-configurations, predefined sentence recognition showed a good accuracy: the baseline recognition gave 85 %. It can be observed that in other configurations the spotting task correlated well with the WER. Thereby, ROVER and the two DDA configurations led to a significant improvement over the baseline. The best configuration based on the two-level DDA gave 92.5 % of correct classifications.

6.6 Discussion and Future Works

The goal of this study is to provide a path for vocal command recognition improvement with a focus on two aspects: distance speech recognition and sentence spotting. A distant speech French corpus was recorded with 21 speakers playing scenarios of activities of daily living in a real flat, this corpus is made of colloquial sentences, vocal commands and distress sentences. This realistic corpus was acquired in a 4-room flat equipped with microphones set in the ceiling thanks to 21 speakers. Several ASR techniques were evaluated, such as our novel approach called Driven Decoding Algorithm (DDA). They gave better results than the baseline and other approaches. Indeed, they analyse the signal on the two best SNR channels and the use of a priori knowledge (specified vocal commands and distress sentences) increases the recognition rate in the case of true positive sentences and doesnt introduce false positive.

Evaluation in Real Conditions. The technology developed in this study was then tested thanks to two other experiments in an Ambient Assisted Living context at the end of the Sweet-Home project. These experiments involved 16 non-aged participants for the first one and 11 aged or visually impaired people for the second one [33]. Each participant followed a scenario including various situations and activities of the daily life. The objective of these experiments was to evaluate the use of voice command for home automation in distant speech conditions, in real-time and in context aware conditions [34]. Unfortunately, we were not able to integrate the DDA method in time in the real-time analysis software PATSH before the beginning of these experiments. Therefore, the performance of the system was still low, the Home Automation Command Error Rate was about 38 % [33], but the results showed there is room for improvement. But, although the participants had to repeat, sometimes up to three times, the voice command, they were overall very excited about commanding their own home by voice. These results highlight the interest of the methods discussed above and especially DDA2 that chooses among available channels those that have the best SNR in order to refine the data analysis. One of the biggest problems were the response time which was unsatisfactory (for 6 participants out of 16) and the mis-understanding of the system which implied to repeat the order (8/16). These technical limitations were reduced when we improved the ASR memory management and reduced the search space. After this improvement, only one participant with special needs complained about the response time.

Interest of the Recorded Corpus. During these experiments, all data were recorded. This acquired corpus was used to evaluate the performance of the audio analysis methods presented in this chapter. It constitutes a precious resource for future work. Indeed, one of the main problems that impede researches in this domain is the need for a large amount of annotated data (for analysis, machine learning and benchmark). It is quite obvious that the acquisition of such datasets is highly expensive both in terms of material and of human resources. For instance, in the experiment presented in Sect. 4, the acquisition and the annotation of the 33-hours corpus costed approximatively 70 k€.

Therefore, the Sweet-Home multimodal corpus is a dataset recorded in realistic conditions in Domus, the fully equipped Smart Home with microphones and home automation sensors presented in Sect. 4.1 will be available for the research community [24]. This corpus was recorded thanks to participants which performed Activities of Daily living (ADL). This corpus is made of a multimodal subset, a French home automation speech subset recorded in Distant Speech conditions, and two interaction subsets, the first one being recorded by 16 persons without disabilities and the second one by 6 seniors and 5 visually impaired people. This corpus was used in studies related to ADL recognition, context aware interaction and distant speech recognition applied to home automation controlled through voice.

Future Projects. Our future project aims to develop a system capable of operating under the conditions encountered in an apartment. For this we must firstly integrate BSS techniques to reduce the noise present in the everyday life context and secondly improve the DDA2 method to detect and recognize the voice commands as well as distress calls.

7 Application of Speech Processing for Assistive Technologies

The applications of speech processing may present a greet benefit for smart homes and Ambient Assisted Living (see Sect. 7.1) but Augmentative and Alternative Communication (AAC) retains involvement from a broad community of researchers (see Sect. 7.2).

7.1 Smart Home and AAL

Anticipating and responding to the needs of persons with loss of autonomy with ICT is known as Ambient Assisted Living (AAL). ICT can contribute to the prevention and/or compensation of impairments and disabilities, to improve the quality of life, safety, communication and social inclusion of end users. They must relieve the isolation and caregiver burden. They also participate in the modernization of health and social services by facilitating home or institutional organization of professional care, their implementation, their tolerance and performance [35]. In this domain, the development of smart homes is seen as a promising way of achieving in-home daily assistance [1]. Health Smart Home has been designed to provide daily living support to compensate some disabilities (e.g., memory help), to provide training (e.g., guided muscular exercise) or to detect potentially harmful situations (e.g., fall, gas not turned off). Basically, a health smart home contains sensors used to monitor the activity of the inhabitant. Sensor data are analyzed to detect the current situation and to execute the appropriate feedback or assistance.

A rising number of studies about audio technology in smart home were conducted. This includes speech recognition [3639], sound recognition [3, 40, 41], speech synthesis [42] or dialogue [7, 8, 11, 43]. These systems are either embedded into the home automation system or in a smart companion (mobile or not) or both as in Companions [44] or CompanionAble [41] projects.

However, given the diverse profiles of the users (e.g., low/high technical skill, disabilities, etc.), complex interfaces should be avoided. Nowadays, one of the best interfaces, is the VoiceUser Interface (VUI), whose technology has reached a stage of maturity and that provides interaction using natural language so that the user does not have to learn complex computing procedures [10]. Moreover, it is well adapted to people with reduced mobility and to some emergency situations (hand free and distant interaction). Indeed, a home automation system based on voice command will be able to improve support and well-being of people in loss of autonomy. But, despite the interest presented by sound analysis techniques, the use of ASR for voice command for home automation in a real environment is still an open challenge.

Voice-User Interface in domestic environment has recently gained interest in the speech processing community as exemplified by the rising number of smart home projects that considers Automatic Speech Recognition (ASR) in their design [5, 6, 8, 3739, 4549]. However, though VUIs are frequently employed in close domains (e.g., smart phone) there are still important challenges to overcome [3]. Indeed, the task imposes several constraints to the speech technology:

  • distant speech conditions [16],

  • hand free interaction,

  • adaptation to potential users (elderly),

  • affordable by people who can have low resources,

  • noise conditions in the home,

  • real-time,

  • respect of privacy.

In recent years, the research community shows an increased interest with regards to the analysis of the speech signal in noisy conditions like the organizing of Challenges CHiME shows. The first CHiME Challenge held in 2011 was the first concerted evaluation of ASR systems in a real-world domestic environment involving both reverberation and highly dynamic background noise made up of multiple sound source [50]. The second CHiME Challenge in 2013 was supported by the IEEE AASP, MLSP and SL Technical Committees [51]. The configuration considered by this Challenge was that of speech from a single target speaker being binaurally recorded in a domestic environment involving multisource background noise. These challenges reported here are still no close enough to real conditions and future editions of the challenge will attempt to move closer to realistic conditions.

Ageing has effects on the voice and movement of the person and thereby, aged voice is characterized by some specific features such as imprecise production of consonants, tremors, hesitations and slower articulation [52]. Some studies have shown age-related degeneration with atrophy of vocal cords, calcification of laryngeal cartilages, and changes in muscles of larynx [53, 54]. For there reason, some authors highlight that ASR performance decreases with elderly voice. This phenomenon has been observed in the case of English, European Portuguese, Japanese and French [26, 5557]. Vipperla et al. [58] made a very useful and interesting longitudinal study by using records of defence speech delivered in the Supreme Court of the United States over a decade by the same judges. This study showed that an adaptation to each speaker can get closer to the scores of non-aged speakers but this implies that the ASR must be adapted to each speaker. Nevertheless, some authors established that many other effects can also be responsible for ASR performance degradation such as decline in cognitive and perceptual abilities [59, 60].

Moreover, since smart home systems for AAL often concern distress situations, it is unclear whether distress voice will challenge the applicability of these system. Speech signal contains linguistic information but it may be influenced by the health, the social status and the emotional state [61, 62]. Recent studies suggests that ASR performance decreases in case of emotional speech [63, 64], however it is still an under-researched area. In their study, Vlasenko et al. [63] demonstrated that acoustic models trained on read speech samples and adapted to acted emotional speech could provide better performance of spontaneous emotional speech recognition.

Moreover, such technology must be validated in real smart homes and with potential users. At this time, validation studies in such realistic conditions are rare [33]. In the same way, there are few user studies reported in the literature and related to speech technology application [10], they are generally related to ICT [65].

7.2 Assistive Technologies

The field of Augmentative and Alternative Communication (AAC) is multidisciplinary and vast, its focus is to develop methods and technologies to aid communication for people with complex communications needs [66]. Potential users are elderly and all people who may acquire a disability or have a degenerative disability which affects communication, this disability can result from both motor and cognitive impairments (i.e., paralysis, hearing or visual impairment, brain injury, Alzheimer...).

Speech and language processing play a major role to improve function for people with communication facilities [67]. This is highlighted by the publication of special issues of journals and by the regular organisation of workshops and conferences on this topic. In 2009, the third issue of the ACM Transactions on Accessible Computing was devoted to AAC (Volume 1, Issue 3). In 2011, the relationship between assistive technology and computational linguistics was formalized with the formation of an ACL Special Interest Group on Speech and Language Processing for Assitive Technology (SIG-SLPATFootnote 3) which gained SIG status from the International Speech Communication Association (ISCA). The last workshops SIG-SLPAT bringing together Computational Linguistics, Speech Processing and Assistive Technologies took place in Montreal, Quebec (2012), in Grenoble, France (2013) and in Baltimore, U.S. (2014). In the same way, a special session of InterspeechFootnote 4 “Speech technologies for Ambient Assisted Living” is organized in 2014. This special session aims at bringing together researchers in speech and audio technologies with people from the ambient assisted living and assistive technologies communities to meet and foster awareness between members of either community, discuss problems, techniques and datasets, and perhaps initiate common projects.

Regarding speech recognition, the most important challenges are related to the recognition of speech uttered by elderly, dysarthric or cognitively impaired speakers.

8 Future Outlook

Future challenges have been outlined in the previous Sect. 7. These challenges are essentially related to scientific and technological problems to solve, but the human aspect must not be neglected.

8.1 Scientific and Technical Challenges

In real home environment the audio signal is often perturbed by various and undetermined noises (e.g., devices, TV, music, roadwork...). But this also shows us the challenges to obtain a usable system that will not be set-up in lab conditions but in various and noisy ones. Of course, in the future, smart homes could be designed specifically to limit these effects but the current smart home development cannot be successful if we are not able to handle these issues when equipping old-fashioned or poorly insulated home. Finally, one of the most difficult problems is the blind source separation. Some techniques developed in other areas of signal processing may be considered to analyze speech captured with far-field sensors and to develop a Distant Speech Recogniser (DSR) such as blind source separation, independent component analysis (ICA), beam-forming and channel selection.

Two main categories of audio analysis are generally targeted: daily living sounds and speech. These categories represent completely different semantic information and the techniques involved for the processing of these two kinds of signal are quite distinct. However, the distinction can be seen as artificial and there is a high confusion between speech and sounds with overlapped spectrum. For instance, one problem is to know whether scream or sigh must be classified as speech or sound.

Moreover, the system must react as quickly as possible to a vocal order. For example, if the user says “Nestor allume la lumière” (Nestor turn on the light), the sentence duration is about 1s, and the processing time last generally between 1.5 and 2 s. This duration seems low but this is not true in real conditions when the user in the obscurity is waiting for the light. Thus, optimisation are needed to obtain fast recognizers.

8.2 Human Aspect

One of the main challenges to overcome for successful integration of VUI in AAL, is the adaptation of the system to the elderly users. Indeed, the ageing process is characterised by a decay of the main bio-physiological functions, affecting the social role and the integration of the ageing person in the society. Overall elderly people will be less inclined to adapt to a technology and its limitation (e.g., the constraint to pronounce words in a certain way) than younger adults and will present a very diverse set of profiles that make this population very difficult to design for.

For the elderly, there is a balance between the benefit of a monitoring through sensors and the correspondent intrusion into privacy. The system has to be protected against intrusion and has to make sure that the information reaches only the right people or can not go out of the smart home.

This is the most important aspect because if the system is not accepted by its potential users, it will never be used in practice.