Keywords

1 Introduction

Ambient Assisted Living (www.aal-europe.eu) [13] is a new special area of assistive information technologies [46] that is focused on designing smart spaces, rooms, homes and intelligent environments to support and care some disabled and elderly people. At present AAL domain includes several International projects, for example, DOMEO, HAPPY AGEING, HOPE, SOFTCARE, Sweet Home, homeService, WeCare, etc. Arrays of microphones, video cameras and other sensors are often installed in assistive smart spaces [7].

In this study, we also analyze audio and video modalities for automatic monitoring of activities and behavior of single elderly persons and people with physical, sensory or mental disabilities. In our previous recent work [8], we presented a prototype of a multimodal AAL environment with main focus on video-based methods for space and user behavior monitoring by omni-directional (fish eye) cameras, mainly for detection of accidental user falls. In this article, we focus mainly on audio-based techniques for monitoring the assistive smart space and recognition of speech and non-speech acoustic events for automatic analysis of human’s activities and detection of possible emergency situations (when some help is needed to the person). The use of audio-based processing additionally to video analysis makes many multimodal systems more accurate and robust [9, 10]. Acoustic events in AAL environments can be such as human’s speech /commands as well as artificial sounds directly or indirectly produced by a human being (for example, cough, cry, chair movements, knocking at the door, steps, etc.). Spoken language is the most meaningful acoustic information; however other auditory events give us much information too. In scientific literature, there are some recent publications on automatic detection of individual acoustic events such as cough, sounds of human fall, cry, scream, distress calls or other events, for example in works [1115].

In our research, the AAL environment is one room of 60 + square meters. We developed a software-hardware complex for audio-based monitoring of this AAL environment during the Summer Workshop on Multimodal Interfaces eNTERFACE in Pilsen. The room (physical model of the AAL environment) has 2 tables, 2 chairs and a sink, as well as it is equipped with 2 omni-directional video cameras and 4 stationary microphones in a grid. One Mobotix camera is placed on the ceiling and the second one is on the side wall; camera’s frame resolution is 640 × 480 pixels at 8 fps rate. 4 dynamic condenser microphones Oktava MK-012 of the smart environment are connected to a multichannel external sound board M-Audio ProFire 2626. Each microphone has the cardioid diagram of direction and can capture audio signals in a wide sector below the microphone with almost equal amplification. All the microphones are placed on the ceiling (about 2.5 m above the floor) in selected locations. The scheme of the physical model of our AAL environment is shown in Fig. 1.

Fig. 1.
figure 1

Scheme of the physical model of the Ambient Assisted Living environment

The paper below describes the architecture and implementation issues of the automatic recognition system in Sect. 2, as well as presents some experimental results and analysis of its evaluation in Sect. 3.

2 Architecture of the Automatic Recognition System

The recognition vocabulary includes 12 non-speech acoustic events for different types of human activities plus 5 possible spoken commands. We defined also a set of alarm audio events X = {“Cough”, “Cry”, “Fall” (a human being), “Key drop” (a metal object), “Help”, “Problem” (commands)}, which can serve as a signal on an emergency situation with the user inside the AAL environment. Figure 2 presents a tree classification of audio signals in the AAL environment including speech commands and acoustic events, which are modelled in the automatic system.

Fig. 2.
figure 2

A tree classification of audio signals of the AAL environment model

Figure 3 shows the software-hardware architecture of the automatic system for recognition of speech and non-speech acoustic events. Acoustic modeling in the system is based on single order Hidden Markov Models (HMM) with Gaussian Mixture Models (GMM) as many modern automatic speech recognition systems [16]. Our system extracts feature vectors consisting of 13 Mel-frequency cepstral coefficients (MFCC) with deltas and double deltas from the multichannel audio signals.

Fig. 3.
figure 3

Architecture of audio event/speech recognition system

The system uses HMM-based Viterbi method and software algorithm for finding an optimal model for input audio signal (Fig. 4). HMMs having a unique topology with different number of states (from 1 to 6 states per one model) represent all acoustic events and phonemes of speech commands depending on duration.

Fig. 4.
figure 4

HMM-based method for automatic recognition of audio signals

The recognition process is made in the on-line mode (the speed factor is less than 0.1 real-time) and a recognition hypothesis is issued almost immediately after energy-based voice/audio activity detection. We apply a speech recognizer dependent to its speaker/user [17] because of the system aim and usability issues. All speech commands, audio events, as well as garbage (any unknown acoustic events) and silence models are presented by a grammar that allows the system to output only one the recognition hypothesis at the same time. In this grammar, there are also some restrictions; for example, two or more “Fall” events cannot follow each other opposite to the “Step” events. The developed ASR system is bilingual and able to recognize and interpret speech commands both in English and in Russian.

In order to train probabilistic HMM-based acoustic models of the recognizer, a new audio corpus has been collected in normal room conditions with an acceptable level of background noise (SNR > 20 dB). In total, we have recorded over 1.3 h of audio data from several potential users performing certain scenarios. Above 2 K audio files were recorded; almost half of which are non-speech audio events. Approximately 2/3 audio data of each subject we used for the training and development purpose and the rest data were employed during the evaluation.

3 Results of the Experiments with the AAL Model

We have developed some scenarios for modeling basic actions performed by people in a living room (studio). The first scenario involves one person and simulates an emergency situation (drop of a metal object and fall of a human on the floor in the end). The second scenario involves up to 3 subjects, who can interact each other, and subjects may occlude each other at some frames (it is used for testing the video-based user monitoring sub-system) [8]. The main scenario involving audio-visual data supposes the following actions of a tester:

  1. (1)

    Enter the room from the door side (open & close the door).

  2. (2)

    Walk to the table 1.

  3. (3)

    Take a glass with water from the table 1.

  4. (4)

    Walk to the chair 1.

  5. (5)

    Sit on the chair 1.

  6. (6)

    Drink water.

  7. (7)

    Long cough after drinking.

  8. (8)

    Stand up.

  9. (9)

    Walk to the table 1.

  10. (10)

    Release the glass (it drops).

  11. (11)

    Walk to the sink.

  12. (12)

    Wash hands in the sink.

  13. (13)

    Exit the room (open & close the door).

  14. (14)

    Enter the room again.

  15. (15)

    Walk to the chair 2.

  16. (16)

    Sit on the chair 2.

  17. (17)

    Telephone rings on the table 2.

  18. (18)

    Say “Answer phone”.

  19. (19)

    Talk by the telephone.

  20. (20)

    Say “Hello”.

  21. (21)

    Say “I’m fine”.

  22. (22)

    Say “Don’t worry”.

  23. (23)

    Say “Good bye”.

  24. (24)

    Stand up.

  25. (25)

    Walk to the table 1.

  26. (26)

    Take a metallic cup on the Table

  27. (27)

    Free walk (several steps).

  28. (28)

    Drop the cup on the floor.

  29. (29)

    Make a step.

  30. (30)

    Fall on the floor.

  31. (31)

    Cry.

  32. (32)

    Ask for “Help”

During the multimodal database collection we have recorded audio-visual samples of the presented first Scenario from 5 different subjects (potential users). They were free to perform the fall (on the hard floor that produces a sound) as it was comfortable for them. Also a training part of the audio database has been recorded in the same room, where a factor of entering new people into the room during the recording session was avoided that allows removing major external noises. 5 check points have been defined in the room for collecting the training audio data; 4 of them were located on the floor under each microphone and the last one was in the centre of the room. Each of 5 testers performed the following sequence of user’s actions:

  1. (1)

    Come to a checkpoint.

  2. (2)

    Give a speech command or simulate a non-speech audio event.

  3. (3)

    Move to the following checkpoint (1).

All the speech commands and acoustic events were simulated many times by different testers. In total, we have recorded above 2800 audio files (in the PCM WAV format); 44 % of them contain non-speech events and the rest – speech commands. 70 % recordings for each subject were used for system’s training and the rest of the data were used in the experiments. This corpus (SARGAS DB) has been registered in the RosPatent (№ 2013613086 on 25/03/2013).

Table 1. Confusion matrix for speech command recognition (accuracy, %)

The automatic system for recognition of speech commands and audio events was evaluated using audio recordings including the audio corpus part containing data of the first Scenario made in the ambient assisted living environment. Table 1 shows the confusion matrix with the accuracy rates (in %) for recognized speech command. Presented results show that the most of speech commands were recognized with a high accuracy over 90 %, however still there were some recognition errors during test scenarios.

Table 2. Confusion matrix for acoustic event recognition (accuracy, %)

Table 2 shows another confusion matrix with the accuracy rates (in %) for recognized acoustic events in the AAL environment model. The presented results show that the lowest accuracy was observed for the non-speech audio event “Fall”. In the third of such cases this acoustic event was recognized as “Step”.

In average, the recognition accuracy for acoustic events was 93.8 % and 96.5 % – for speech commands.

4 Conclusion

We presented a software-hardware complex for audio-based monitoring of the AAL environment model. Audio signals in AAL environments can be human’s commands/speech and artificial sounds produced by a human directly or indirectly (for example, chair movements, cough, cry, knocking at the door, steps, etc.). The system uses Viterbi-based algorithm for finding an optimal model for input audio signal. HMMs having a unique topology with different number of states (1-6) model each acoustic event and command depending on its duration. The recognition is performed in on-line mode (speed factor < 0.1 real-time) and a hypothesis is issued almost immediately after energy-based voice/audio activity detection. We apply the speaker-dependent recognizer because of the system aim and usability issues. The vocabulary includes 12 non-speech acoustic events for different types of human activities plus 5 user’s spoken commands including a set of some alarm events, which can serve as a signal about an emergency situation inside the AAL environment. To train acoustic models, we have collected an audio corpus in quiet room conditions with a low level of background noise (SNR > 20 dB). Above two thousand audio files were recorded; almost half of which are non-speech audio events. Approximately 2/3 audio data of each subject we used for the training and development purpose and the rest data were employed during the evaluation. In the experiments, the best recognition accuracy for the acoustic events was 93.8 % in average and 96.5 % – for speech commands.