1 Introduction

This paper aims to contribute towards creating ambient intelligent environments which can intelligently understand the users by action and gesture recognition so that the needed services can be provided automatically and instantly to maximize the user comfort and safety while minimizing the utilized energy. In order to realize such intelligent environments, there is a need first to automatically recognize the user behavior so that the best environment action can be taken in order to satisfy the environment objectives.

This paper addresses the problem of humans’ behavior recognition from video sequences using fuzzy-based systems where we aim to develop systems which can effectively recognize fundamental types of behavior such as handwaving, walking and running from input images sequences. A robust solution to this problem would allow to automatically recognize gestures and actions in human–computer interactions, detect abnormal events in video surveillance, summarize relevant visual context in video indexing/retrival and analyze important events in an intelligent space, etc.

The problem of human behavior recognition has been a widely studied subject in the computer vision literature (Weinland et al. 2010; Aggarwal and Ryoo 2011). In most of the existing methods, the first procedure is feature extraction which is used to describe the characteristics of the subject, and the methods can be roughly classified into four categories: motion-based (Efros et al. 2003; Fathi and Mori 2008; Wang and Mori 2008; Wang et al. 2007), appearance-based (Elgammal et al. 2003; Thurau and Hlavac 2008), space-time volume-based (Blank et al. 2007; Ke et al. 2007; Laptev and Perez 2007), space-time interest points and local feature-based (Dollar et al. 2005; Laptev et al. 2008; Niebles and Fei-Fei 2007; Nowozin et al. 2007; Schuldt et al. 2004). Behavior recognition approaches are mostly based on the machine learning techniques employed in the pattern recognition literature. This includes techniques such as K-Nearest Neighbor (\(k\)-NN) (Efros et al. 2003; Wang et al. 2007; Niebles and Fei-Fei 2007; Blank et al. 2007; Thurau and Hlavac 2008; Weinland and Boyer 2008), Support Vector Machine (SVM) (Dollar et al. 2005; Laptev et al. 2008; Schuldt et al. 2004; Jhuang et al. 2007; Liu and Shah 2008; Schindler and Gool 2008), boosting-based classifiers (Nowozin et al. 2007; Fathi and Mori 2008; Laptev and Perez 2007), Hidden Markov Model (HMM) (Elgammal et al. 2003; Vezzani et al. 2010; Yamato et al. 1992). Blank et al. (2007) proposed a feature set of pose primitives for behavior representation and \(n\)-Gram models which were utilized for behavior matching and recognition. Wang et al. (2007) developed a feature set modeling behavior as a set of lowest distances from exemplars to behavior images in an exemplar-based space. Acampora et al. (2012) used type-1 fuzzy logic to analyze human’s trajectories in order to recognized their activities and behavior for abnormal event detection. Yamato et al. (1992) used discrete HMMs to recognize image sequences of six different tennis strokes among three subjects. Vezzani et al. (2010) proposed an HMM-based action recognition method using model-based feature set. However, it remains a challenging problem to achieve automatic behavior classification due to the huge complexity and uncertainties of the dynamic environments such as complicated scene background, occlusion, and varying posture and size of moving objects.

In this paper, we aim to introduce a computationally efficient robust system for behavior recognition based on fuzzy logic systems (Mendel 2001; Doctor et al. 2005; Hagras et al. 2003), which have an inherent representational power to deal with uncertainty in real-world problems. In the proposed system, we have a set of fuzzy rules with the antecedents denoting the conjoint occurrence of fuzzy propositions concerning behavioral features. The consequents of the proposed fuzzy systems refer to the possibility of a fuzzy behavior class. The membership functions for the inputs to the fuzzy system were obtained using the fuzzy C-means (FCM) clustering approach (Pal and Bezdek 1995).

We have successfully tested our system on the publicly available Weizmann human action dataset (Blank 2005) where our fuzzy-based system produced an average recognition accuracy of 94.03 %, which outperformed the traditional non-fuzzy systems based on hidden Markov models and other state-of-the-art approaches which were applied on Weizmann dataset.

2 Related works

Achieving robust behavior and activity recognition in real-world environments is highly challenging due to the high levels of behavior uncertainty, motion ambiguity, and uncertain factors of the subject such as position, orientation and speed, etc. This is because even the behavior features of different subjects which are representative of the same action classes have a wide variance. To make matters worse, the behavior of a given subject who performs multiple instances of the same action category is not unique. Thus, there are wide variations of intra- and inter-subject in behavioral characteristics which cause high levels of uncertainty and ambiguity in the behavior recognition.

Some previous approaches combine computer vision and fuzzy logic to recognize extracted representations of behavior patterns. In this field, fuzzy logic has shown to be a powerful tool in recognizing the human behavior and dealing with the uncertainty. In Chang et al. (2009), a fuzzy rule-based human activity recognition system for e-health was introduced and achieved an accuracy of about 90 %. In Medjahed et al. (2009), human activities of a daily living recognition system using hybrid sensors based on fuzzy logic system was proposed and the analysis result was robust. Work in Ioannidou et al. (2006) reported an interactive computer graphics environment that encompasses a set of fuzzy logic analysis tools and a fuzzy inference model. In Gokmen et al. (2010), fuzzy logic was employed to recognize students’ behavior so as to evaluate their performance in a control course laboratory. However, most of these approaches utilize complicated feature models which increase the level of complexity in constructing the fuzzy logic system. In our proposed method, with the use of fuzzy logic and a simplified feature model, we achieve a more flexible representation of human behavior and a better performance at recognition speed.

3 The Proposed fuzzy logic-based system for the automation of human behavior recognition

It is worth pointing out that recognizing a human behavior into one of several behavior classes falls under the generic class of pattern recognition problem which aims to determine the mapping between behavioral feature space and action categories. Unfortunately, behavior features of different subjects which are representative of the same action classes have a wide variance. Fuzzy logic system is an established field of research to handle uncertainties in complicated real-world problems. In this paper, we will employ fuzzy logic systems in order to handle the faced uncertainties associated with humans’ behavior recognition in intelligent spaces.

Figure 1 shows an overview of the proposed system where in the training stage, human silhouette is detected and extracted using our previous work based on an interval type-2 fuzzy logic system (IT2FLS) (Yao et al. 2012). After that, from the extracted silhouette, input feature vectors for our fuzzy-based recognition method are computed based on a model-based feature set (to be discussed in Sect. 3.2) to describe the shape and motion characteristics. The fuzzy membership functions of the inputs to the fuzzy systems are then learned via FCM clustering. During the testing stage, we tried first to detect human subjects which were later tracked to extract the silhouette image based on which input shape-motion features are computed and used as input values for the fuzzy-based recognition system. In our fuzzy system, each membership function corresponds a behavior model, each output degree represents the likelihood between the behavior in current frame and the trained behavior model in the knowledge base. The behavior in current frame is then classified and recognized by selecting the candidate model which has the highest output degree. In the following subsection, we will present the various components of the proposed system.

Fig. 1
figure 1

Overview of our proposed system

3.1 Silhouette extraction

Accurate human silhouette segmentation from a video sequence is important and fundamental for advanced video procedures such as pedestrian detection, human activity detection and behavior recognition. To obtain robust silhouette segmentation, the Gaussian Mixture Models (GMM) was proposed in Katz et al. (2003) to extract the foreground images. However, it is unreasonable to simply consider GMM foreground as human silhouette in real-world environments due to the noise factors including varying light condition, reflections/shadows problems and moving objects attached to human silhouette. To handle these problems, a Type-1 Fuzzy Logic System (T1FLS) was proposed in Chen et al. (2006). This T1FLS is capable of handling to an extent the uncertainties mentioned above; however, the extracted silhouette will be degraded due to misclassification. Hence, in Yao et al. (2012), we proposed an IT2FLS which was able to handle the high uncertainty levels present in real-world dynamic environments while also effectively reducing the misclassification of extracted silhouette. By utilizing our proposed IT2FLS, the average accuracy for silhouette extraction is improved to 99.94 % which was 8.26 % higher than the accuracy achieved by the T1FLS employed in Chen et al. (2006); meanwhile, the average misclassification of our proposed IT2FLS is reduced to 5.71 pixels which was 446.26 lower than the misclassification of the T1FLS in Chen et al. (2006).

3.2 Feature representation

Our approach uses an efficient featureset (Vezzani et al. 2010) with low computational complexity based on multi-feature model including movement speed and appearance shape. Based on the extracted silhouette image, as shown in Fig. 2, the silhouette region is separated into five slices \(S_1 \), \(S_2 \), \(S_3 \), \(S_4 \), \(S_5 \) according to a polar coordinates partitioning centered in the gravity center {\(x_c (t)\),\( y_c (t)\)}. Ideally, the divided slices should be located at the areas of the head, the arms and the legs of the human silhouette. Suppose that we are working on frame \(t\), and the human silhouette image of frame \(t\) is extracted by the proposed algorithm based on IT2FLS. Then, in the obtained silhouette image, as shown in Fig. 2c, the area of the entire human silhouette is denoted by the letter \(A_t \) while the areas of each silhouette slice \(\hbox {are denoted by }\{A_t^i \}_{i=1\ldots 5} \). Based on these values, the 7-dimensional feature set for our fuzzy-based recognition system is constructed, which is similar to 17-dimensional the feature set in Vezzani et al. (2010), but using less input feature categories and lower computational complexity. The seven input features are motion speed in horizontal direction (\(O^1)\), motion speed in vertical direction (\(O^2)\), area ratio of the head silhouette (\(O^3)\), area ratio of the right hand silhouette (\(O^4)\), area ratio of the right leg silhouette (\(O^5)\), area ratio of the left hand silhouette (\(O^6)\), area ratio of the left leg silhouette (\(O^7)\). Thus the feature set contains both motion information (\(O^1\), \(O^2)\) and shape description (\(O^3\)...\(O^7)\). The 7-dimensional feature set for the fuzzy system are obtained as follows:

$$\begin{aligned} O_{t}&= \{ O_{t}^{1} \cdots O_{t}^{7} \}\nonumber \\&= \left\{ \begin{array}{l} O_{t}^{1} = \sum \limits _{{i = 0}}^{2} {|x_{c} (t - i) - x_{c} (t - i - 1)|/3} \\ O_{t}^{2} = \sum \limits _{{i = 0}}^{2} {|y_{c} (t - i) - y_{c} (t - i - 1)|/3}\\ O_{t}^{{3...7}} = \frac{{A_{t}^{k} }}{{A_{t} }},k = 1...5 \\ \end{array}\right\} \end{aligned}$$
(1)
Fig. 2
figure 2

a Original image, b silhouette image extracted by our proposed method based on IT2FLS, c silhouette slice partitions

3.3 The proposed fuzzy system for behavior recognition

In our fuzzy system, the antecedents are seven linguistic variables which are: motion speed of the centroid of the human silhouette in horizontal direction (\(O^1)\) and motion speed of the centroid of the human silhouette in vertical direction (\(O^2)\) based on which the movement speed of the human silhouette can be described, area ratio of the head silhouette (\(O^3)\) which is the percentage of the head silhouette pixel count of the entire human silhouette pixel count, and similarly, area ratio of the right hand silhouette (\(O^4)\), area ratio of the right leg silhouette (\(O^5)\), area ratio of the left hand silhouette (\(O^6)\), area ratio of the left leg silhouette (\(O^7)\). By using the above linguistic variables, we can effectively model the movement information and the general appearance feature of a generic subject. Additionally, the complexity of using this 7-dimensional feature set is low enough to construct a computationally efficient type-2 fuzzy logic system obtaining reasonable recognition accuracy. Each of these antecedents are represented by four fuzzy sets which are VERY LOW, LOW, MEDIUM, and HIGH. The output of the fuzzy system is the behavior possibility which is represented by two fuzzy sets which are LOW and HIGH. The fuzzy Membership Function (MFs) shown in Fig. 3 have been obtained via FCM clustering.

Fig. 3
figure 3

Membership functions from FCM for our fuzzy-based recognition system

Suppose, we measure \(\{O^1\ldots O^7\}\) on silhouette images expressing the possibilities of the candidate behavior classes: running, walking, jumping-in-place, jumping-jack, jumping-forward, galloping-sideways, waving-two-hands, skipping, bending, waving-one-hand. The mapping between measurement and behavior classes is accomplished by fuzzy rules. In our system, the size of the rule based is \(191\), and the rule base is constructed via learning from the input/output data by expert engineers. One illustrative fuzzy rule from our rule base could be written as follows:

Rule \(i\): IF (horizontal-motion-speed is HIGH) & (head-silhouette-ratio is MEDIUM)

(vertical-motion-speed is VERY LOW) & (leftHand-silhouette-ratio is HIGH) &

(rightHand-silhouette-ratio is LOW)& (leftLeg-silhouette-ratio is LOW) &

(rightLeg-silhouette-ratio is LOW) THEN

(running-possibility is HIGH) & (otherBehaviours- possibility is LOW)

Each behavior class uses the same output membership function that is shown in Fig. 3h. In our system, we use product t-norm to represent the AND logical connective and the implication operation. The behavior recognition is done via selecting the best candidate behavior class with the highest firing strength as the recognized behavior type. However, if two different candidate behavior classes are assigned with a same output degree, it means that these two candidate behavior classes have significantly high behavioral similarity and cannot be distinguished effectively in current frame.

4 Experiments and results

We have tested the proposed system on the widely used benchmark datasets for humans’ behavior recognition, the Weizmann human action dataset (Blank 2005). The Weizmann actions dataset consists of 5,687 frames and \(10\) different categories of behavior classes: running, walking, jumping-in-place-on-two-legs (pjump), jumping-forward (jump), bending, jumping-jack (jack), galloping-sideways (side), skipping, waving-two-hands (wave2), waving-one-hand (wave1). Video sequences in this dataset are captured with stationary camera and simple background. However, it provides a good experiment environment to investigate the recognition accuracy of the proposed method when the amount of behavior categories is large. Example frames of the behavior categories are shown in Fig. 4. Each behavior category is performed once or sometimes twice per video by nine different people (subjects) resulting in \(93\) video sequences in total. Similar to Vezzani et al. (2010), our experiments are performed by leave-one-out cross-validation as suggested by Scovanner et al. (2007). During the testing stage, our model was evaluated for per-frame and per-video recognition. Specifically per-frame recognition means performing the analysis algorithm on each frame and then obtaining a recognition result for each individual frame while per-frame recognition denotes achieving a global recognition result for the entire video sequence. The average accuracy of our method and comparison with other traditional non-fuzzy-based algorithms are reported in Tables 1 and 2.

Fig. 4
figure 4

Weizmann dataset for human behavior recognition

Table 1 Confusion matrix for per-frame average accuracy of the Weizmann human action dataset (the overall average accuracy is 94.03 %), horizontal rows are ground truths, and vertical columns are recognition results
Table 2 Comparison of overall average recognition accuracy with previous traditional non-fuzzy methods on the Weizmann dataset

Table 1 shows the confusion matrix for average accuracy of per-frame recognition by using our fuzzy-based method for the \(10\) behavior classes on the Weizmann dataset. We can see that the proposed system correctly recognizes most actions at 100 % average accuracy including the behaviors “skipping”, “running”, “side- galloping”, “waving one hand” and “wave two hands”, and in this dataset, “skipping” is commonly considered as one of the most challenging behavior categories to recognize. In our system, the recognition accuracy for “pjump” is 94.23 %, the 5.77 % misclassification rate is due to the behavior similarity between “pjump” and “jack”. It is worth mentioning that we are not presenting the confusion matrix for per-video recognition since it is simply a perfect diagonal matrix, and the per-video recognition accuracies for the behavior classes are all 100 %. Table 2 shows that the proposed system outperforms the traditional non-fuzzy-based recognition method which used hidden Markov models (HMM) (Vezzani et al. 2010), and is comparable to other state-of-the-art approaches on the Weizmann dataset. In order to make a fair comparison with Vezzani et al. (2010), a similar input feature set has been used, \(7\) feature categories out of \(17\) feature categories in proposed Vezzani et al. (2010). In our experiments, we selected several traditional non-fuzzy algorithms of behavior recognition to compare the performance. For a fair comparison, the selected methods were also tested on the same dataset as our proposed fuzzy-based approach. Additionally, these traditional algorithms are with high citation rates and were reported in the high-quality journals and conferences. As shown in Table 2, our fuzzy-based method achieves \(7.33\) % higher average per-frame accuracy than the traditional HMM-based algorithm with \(10\) less categories of input feature. Our approach outperformed the per-frame recognition accuracy of other state-of-the-art method based on hidden conditional random field (hCRF) (Wang and Mori 2008) by \(3.74\) % and our method has also outperformed the codebook-based algorithm (Niebles and Fei-Fei 2007) which were also applied on Weizmann dataset by \(39.03\) % recognition accuracy. The per-video recognition of our proposed method is \(100\) % outperforming the traditional non-fuzzy approaches including hCRF-based method (Wang and Mori 2008), SVM-based approach (Jhuang et al. 2007) and codebook-based algorithm (Niebles and Fei-Fei 2007) by \(2.78\), \(1.20\) and \(27.20\) %, respectively.

It should also be pointed that our approaches are computationally efficient where in the proposed system, learning the model usually takes less than 1 min, and the fuzzy-based recognition system (including the background subtraction and updating, human silhouette extraction, multi-target tracking, feature extraction and behavior recognition), is working in real-time performance, processing \(30\) frames per second which is the maximum. As shown in Table 2, our method improves \(100\) % of the computation performance when compared with the conventional HMM-based algorithm (Vezzani et al. 2010) which has a frame rate of \(15\) frames per second. We have also outperformed the computation speed of SVM-based approach in Jhuang et al. (2007) by 6,876.74 % where the SVM approach can only process \(0.43\) frames per second.

5 Conclusions and future work

In this paper, we have presented a computationally efficient fuzzy logic-based system for the automatic recognition of human behavior using machine vision for applications in intelligent environment. It is hoped that the proposed method will be an enabling step towards the realization of ambient intelligent environments which can automatically detect the human behavior and adapt the user environment accordingly.

To the authors’ knowledge, this is the first paper applying fuzzy logic systems to visual-based humans’ behavior recognition. In so doing, the original images are first captured from the input video sequences and the extracted human silhouette are generated using our proposed method based on IT2FLS. After that, the input features are computed from the extracted silhouette images using a 7-dimensional model-based feature set including motion information and shape descriptors. Finally, human behaviors are recognized based on the input feature set by using the proposed fuzzy-based recognition method.

We have successfully tested our system on the publicly available Weizmann human action dataset (Blank 2005) where our fuzzy-based system produced an average recognition accuracy of \(94.03\) %, which outperformed the traditional non-fuzzy systems based on hidden Markov models by an enhancement of \(7.33\) % accuracy and outperformed the recognition accuracy of other state-of-the-art approaches including hCRF-based method and codebook-based algorithm which were also applied on Weizmann dataset by \(3.74\) and 39.03 %, respectively. Moreover, our system provides a relatively computationally efficient and robust response where our method can process \(30\) frames per second, which improves \(100\) % of the analysis speed when compared to the HMM-based algorithm in Vezzani et al. (2010) which can only process \(15\) frames per second. We have also outperformed the computation speed of SVM-based approach in Jhuang et al. (2007) by \(6,876.74\) % where the SVM approach can only process \(0.43\) frames per second.

As possible future ongoing research, we intend to extend the proposed system to employ type-2 fuzzy logic systems to handle the high uncertainty levels available in dynamic real-world environments such as behavioral similarities, occlusion, illumination and shadow problems from the real-life environments or public datasets (Ahad et al. 2011) with more challenging behavior categories, for example, pointing, boxing, digging and hand clapping. We also aim to proceed with the automatic learning which will enable the system to be more robust in dynamic environments and enable the system parameters to be adaptive to the given environment conditions. This system will find applications in various domains of major impact on the future of scaled-up intelligent spaces.