Keywords

1 Introduction

Gestures are parts of human movement which contain certain information. They can convey certain meanings (like in Sign Language), or commands, or can be used to point to certain objects in the surroundings. From a computational point of view they can be thought of as spatiotemporal signals that contain certain information which is vital for robots or machines to interact with their environments. A lot of research has been carried out in the recent years to develop techniques for gesture recognition [1]. From computer gaming to touch pads, and other such devices, we see a lot of applications of gesture recognition as results of this research.

As the computational power of the machines is rising with advances in IC manufacturing technology we see a lot of research activity in the field of Human Robot Interaction (HRI). The dream of making robots a household commodity is suddenly looking very much realizable in near future. Recent research is aimed at making the human-robot interaction natural and with as fewer constraints as possible. Humans interact with each other naturally, robustly identifying gestures as well. The robots have come a long way in understanding their environment, being now able to see, hear and feel to a certain limited extent, but a lot more is yet to be achieved. Understanding gestures robustly and naturally is a task that still remains an open area for research.

In this paper we present a gesture recognition technique that can robustly detect gestures without requiring any temporal segmentation beforehand. The proposed method performs the higher level tasks of gesture spotting, i.e., determining the start and end frames of the gestures (temporal segmentation) and classification, as well as low level tasks of spatial segmentation of the hand at each frame.

The algorithm starts by spatially segmenting the hand for every incoming frame by combining skin detection and motion calculation. This information is then sent to the Time Proportionate Intensity Accumulator unit which stores this information as an intensity image (TPI image). A high intensity on the TPI image corresponds to a recent impression and lower intensities correspond to previous impressions. Information from each new frame, after preprocessing is sent to the classifier unit which in turn classifies whether or not a gesture has just been performed. If a gesture is performed it is duly classified as being one of the gestures in the system vocabulary. Figure 40.1 is a block level representation of the proposed gesture recognition system.

Fig. 40.1
figure 1

Block diagram of the proposed gesture recognition system

A key aspect of the proposed approach is its very low computational cost. The algorithm can classify the gestures in real time mainly because it does not model the gestures as Markov Chains. Thus it does not have to match a Hidden Markov Model with a video query comprising of a series of frames.

The rest of the paper is organized as follows. Section 40.2 describes the recent research work in gesture recognition and how it relates to our work. Section 40.3 discusses the preprocessing step of spatial segmentation of hand and temporal projection. Section 40.4 is about the nonlinear classifiers used for gesture recognition and their offline training. In Sect. 40.5 we discuss the experimental setup, the video sets used and the results of classification, concluding with recommendations and remarks on future work in Sect. 40.6.

2 Related Work

An important discriminating feature of the proposed algorithm from existing gesture recognition algorithms is that it does not require temporal segmentation of the gesture as preprocessing like other dynamic approaches [25]. In many gesture recognition algorithms [6, 7], spatial and temporal segmentation is done at a lower level and certain features of shape and velocity are extracted as preprocessing steps. These features are then passed to the classifier for recognition [6]. Recognition results deteriorate if these preprocessing segmentations fail. Although we perform a preprocessing step involving spatial segmentation of the hand, its failure in some frames does not make the next stages of the algorithm to fail. Ambiguities in hand segmentation causes some noise in the background in our algorithm but the nonlinear classifiers can handle such noise.

Pavlovic et al. [8], discuss the template matching based segmentation of the hand useful in determining the posture of the hand. They trained a hand shape determining classifier to detect hand posture and used skin color based classifier to spatially segment the hand. Aaron and James [9], discuss temporal templates for learning the history of motion. The proposed method is similar in the sense that it also captures the history of motion onto a spatial plane but differs in linear fading concept that we introduce.

Some algorithms like [3, 10, 11], extract global features from each frame like motion fields or use a transformed set of images such as intensity thresholded images or difference images as inputs to gesture recognition modules. These algorithms do not incorporate tolerance to background movements which can cause these algorithms to fail in noisy environments. We train our nonlinear classifiers for noisy environments as well thus it outperforms these other algorithms.

Most of the gesture recognition algorithms model the gestures as Markov Chains [1214], with fixed or variable transition probabilities. The recognition of gestures in these algorithms becomes a problem of state by state aligning the query video sequence with the gesture model. This involves computations that increase exponentially with the gesture models in the vocabulary. To overcome this issue of time complexity these algorithms devise certain mechanisms to prune out certain hypotheses and rely on Dynamic Programming (DP) to reduce the computations. As we do not model gestures as Markov Chains our algorithm has to perform significantly less number of computations for classification.

The proposed method is similar to algorithms like [11] and [15], which model the gestures as rigid 3D patterns. These algorithms do not perform well if the gesturing speed is changed. On the other hand we learn the tolerances in gesturing speed from the training data and do not face these problems.

Finding the start and end frames in a gesture is referred to as gesture spotting. Algorithms can be divided in two categories based on the mechanisms they adopt for gesture spotting. One approach is to temporally segment the gestures before classification. This is usually achieved by inserting intervals between the gestures [16, 17]. The other approach indirectly performs the task of gesture segmentation based on results of certain cost functions over a window in time which slides over the temporal axis of the incoming video stream [7, 18, 19]. Our approach also lies in this second category where we find the start and end frames of a gesture during the classification.

Then there is the problem of sub gestures, i.e., when some gestures are part of other longer gestures. Classifiers usually either do not consider this possibility by imposing limitations on the gestures themselves [19] or they require additional looping over all the gestures in the vocabulary to determine these sub gestures. The proposed algorithm in its output not only classifies a gesture it also gives a confidence measure and expectancy of a super gesture.

3 Preprocessing and Hand Segmentation

In this section we describe the preprocessing steps performed on each frame before it is passed on to the classifier stage. First the spatial segmentation techniques to get the most probable hand locations are presented. Next discussed are the methods to incorporate multiple hand candidates and background noise removal. At the end of this section we describe in detail, the working of Time Proportionate Intensity Accumulator unit to get the projections of the temporal axis onto the spatial plain (TPI image) which in turn allows us to use existing techniques of feature based classifiers to recognize the gestures.

3.1 Hand Segmentation Based on Skin Pixel Estimation

First of all a skin likelihood image is computed using the point operation of Eq. (40.1). The mean μ s , and variance \( \sum\nolimits_{S} \) from a generic skin model of [20] are used. Next the motion mask is calculated by taking the difference of the current frame and the previous frame. Hand likelihood image is obtained by applying the motion mask to the skin likelihood image.

$$ p(x|skin\;pixel) = \frac{1}{{\left( {2\pi } \right)^{\frac{1}{2}} |\sum_{s} |^{\frac{1}{2}} }}exp\left( { - \frac{1}{2}\left( {x - \mu_{s} } \right)^{T} \mathop \sum\limits_{s}^{ - 1} \left( {x - \mu_{s} } \right)} \right) $$
(40.1)

The hand likelihood image is filtered with a 3 × 3 order statistic median filter to remove background noise. This hand likelihood image is then passed onto the next stage of k-means clustering.

3.2 Hand Localization Using k-mean Clustering

This module takes as input the binary hand likelihood image. The white pixels in this image correspond to a high likelihood of hand location. The indices of all non-zero pixels are extracted from this image and are used to find K clusters. Each cluster corresponding to high probability of hand locations. The number of member points for each cluster determines the size of the impression made by this hand hypothesis on the TPI image in the next stage. This is approach is similar to the one used in [21] except that instead of using an integral image, and moving window, k-mean algorithm is used for more accurate hand localization.

The algorithm can be extended to accommodate two handed gestures for example and to make the algorithm robust. This also helps reduce the background noise due to moving distracters in the background. Figure 40.2 shows the results of outlier rejected hand segmentation.

Fig. 40.2
figure 2

Preprocessing of an incoming frame and corresponding impression onto the time proportionate intensity accumulator

3.3 Time Proportionate Intensity Projections

We get K candidate hand locations from the clustering unit. We place K blobs on the Time Proportionate Intensity Accumulator at the spatial locations received from the previous unit. We assign them maximum intensity. Upon processing of the next frame we receive further K blobs to be placed on the accumulator. Here we decrement the accumulator first by the fading factor α before placing the blobs. This way the TPI image periodically forgets or fades away the impressions made by previous frames, Fig. 40.3.

Fig. 40.3
figure 3

Example time proportionate intensity (TPI) accumulators from training sample videos (K = 1)

We have used a linear fade where the intensity of a pixel fades in the accumulator by a constant α after each new frame.

$$ I_{t} = I_{t - 1} - \alpha $$
(40.2)

To get training data, TPI image is thresholded (Eq. 40.3) to zero to forget all the information n frames ago thus temporally segmenting the gesture. The gesture length for each gesture in the training set videos is available in the ground truth files. Average gesture length is also learned from the training data for each gesture class.

$$ threshold = max(I) - n\alpha $$
(40.3)

4 Spatiotemporal Matching

In this section the implementation of our gesture class learning method using nonlinear classifiers (Neural Networks and Support Vector Machines) is presented first, and then we discuss the classification mechanism used.

4.1 Model Learning and Non Linear Classifiers

We have used the video data sets by Athitsos [18] for experimentation. These video sets include a training set, an easy data set and a difficult data set. Each set contains 30 videos where signers make signs of the Palm’s graffiti digits 0 to 9, Fig. 40.4. In each set of videos there are a total of 10 different signers with 3 videos from each signer. The signers wear colored gloves in the training data set only. All the videos are accompanied by ground truth text files which contain the temporal segmentation of the gestures. Meaning that start and end frames of all the gestures in each video are given. These data are useful for training and cross validation purposes.

Fig. 40.4
figure 4

Palm’s Graffiti digits

4.2 Classification Using Support Vector Machines

Model learning phase includes feeding the TPI image state just after the gesture is completed to the SVM trainer. The TPI image is thresholded to forget all the information before the start of the current gesture. The m × n TPI image is down sampled to 25 × 25 image and then reshaped into 1 × 625 feature vector that along with the gesture label is used to train the SVM parameter vector theta θ. The optimization objective function for SVMs is a minimization problem given in Eq. (40.4). The kernel function we have used is the Gaussian kernel of Eq. (40.5).

$$ \mathop {\hbox{min} }\limits_{\theta } C\sum\limits_{i = 1}^{m} \left[ {y^{(i)} cost_{1} \left( {\theta^{T} f^{(i)} } \right) + \left( {1 - y^{(i)} } \right)cost_{0} \left( {\theta^{T} f^{(i)} } \right)} \right] + \frac{1}{2}\sum\limits_{j = 1}^{n} \theta_{j}^{2} $$
(40.4)
$$ f^{(i)} = k\left( {x^{(i)} ,l^{(i)} } \right) = exp\left( { - \gamma {\parallel }x^{(i)} - l^{(i)} {\parallel }^{2} } \right),\quad \gamma = \frac{1}{{2\sigma^{2} }},\;and\;\gamma > 0 $$
(40.5)

The training data is divided into 11 classes. 10 classes for gestures 0 to 9 and one class for training examples where no gesture has been performed. Incomplete gestures and partially observed gestures are placed in this class. For the implementation of the SVMs we have used the library package LIBSVM [22].

For each incoming frame we get the current state of the TPI image as described in the previous sections. A window with this TPI image is passed to the classifier after each frame.

We use SVMs trained as multi-class classifiers. LIBSVM implements the multi-class classifiers using one against one approach. Therefore this TPI image is matched against all the models in the vocabulary. A positive match (based on majority voting) gives the gesture class as well as the temporal segmentation of the gesture as we now know the end frame of the gesture. The average time (in number of frames) for that particular gesture class was already learnt in the model learning stage.

Temporal segmentation can be improved if after initial classification we move the time window and recalculate the cost finding the frame corresponding to the minimum cost as the start frame. Of course this will be under the hypothesis that the cost function is convex with a single minimum.

4.3 Classification Using Artificial Neural Networks

Artificial Neural Network (ANN) is a computing system, composed of large number of highly interconnected units (called neurons) that emulate the organization and operation of biological nervous system. This is one of the most widely used technique in classification and pattern recognition problems [23].

For the learning of ANN, special training algorithms are developed based on the learning rules similar to learning mechanisms of biological systems. There are many types and architectures of neural networks, fundamentally depending on their learning mechanisms. For gesture recognition, we have chosen a three layer Multilayer Perceptron Neural Network (MLPNN) with back propagation training algorithm, consisting of one input, hidden and output layer each. Architecture of a typical MLPNN with three layers is shown in Fig. 40.5. The input layer of our network consists of 625 features and output layer has ten neuron; equal to the number of target classes. The number of nodes in the hidden layer has great influence on the performance of the network. An optimum number of nodes in the hidden layer are selected after trial and error in network performance. For the implementation of the ANNs we have used the OpenCV CvANN Class. For back propagation based training, the library uses the algorithm proposed in [24].

Fig. 40.5
figure 5

Architecture of a multi-layered artificial neural network

5 Experiments and Results

This section explains the experimental setup. Specifically we discuss the choice of data set of videos, the number of training examples, the validation data set and the test sets. Here we also present the results of classification and compare them with existing methods.

5.1 Results Using Support Vector Machine

The training data set has only 30 videos, 3 from each of the 10 signers. Extracting a single TPI image per gesture class from every video leads to only 30 samples for training purposes. These are very few when training the SVMs to obtain good results. One way of increasing the samples is to extract multiple TPI images per gesture class from each video. This way multiple TPI images from above 80 % gesture completion can be extracted to increase the training examples. Following tables present the early stage results of the experimentation. As seen in Table 40.1, the accuracy of classification for the difficult data set is very poor. This is due to the background motion in the difficult data set. We are currently working to devise methods to incorporate multiple hand hypotheses with low time complexity.

Table 40.1 Class wise accuracy and false positives for easy (E) and difficult (D) data sets

5.2 Results Using Artificial Neural Networks

We have a total of 4,634 samples of hand gestures corresponding to ten target classes; 0–9. Sixty percent of the data representing each class is used for training, twenty percent is used for cross validation (CV) and the rest twenty percent is considered as test data. The network is trained with back propagation algorithm and a regularization parameter is also tuned to overcome the problem of over fitting as well as under fitting of data. The number of epochs is limited to 100. Using the best results of network on CV data, an optimum architecture of the network is settled with 196 nodes in the hidden layer. The network is then used to classify the test data. The overall network accuracy is presented in Table 40.2, and confusion matrix for test data is presented in Table 40.3.

Table 40.2 Accuracy accross the training, cross validation and the test data sets
Table 40.3 The confusion matrix for ANN

6 Future Work and Conclusions

We presented a method of recognizing gestures with minimal computations thus making run-time gesture classification possible. Currently the algorithm is designed to recognize one-handed gestures. It can be modified to be able to accommodate two handed gestures or even other classes of gestures. Detection of right and left hands using the existing techniques of Viola and Jones [25] is possible. In this way two handed gestures can be recognized in run-time without having to compromise the speed.

One major drawback of the algorithm is the drop in accuracy of classification with movements in the background (tested on difficult data set). This is mainly due to the implementation decision of using one single reliable hand location, for impression on the TPI image. Although we do train our classifier with the background noise, thus increasing the accuracy in recognition other techniques need to be used to heuristically minimize the noise.

Hence the algorithm is suitable for time critical applications or for slower systems, which are incapable of running the traditional Markov Chains based algorithms successfully, and with fairly minimal background movements.