Hand Gesture Recognition for Human Computer Interaction: A Comparative Study of Different Image Features

Trigueiros, Paulo; Ribeiro, Fernando; Reis, Luís Paulo

doi:10.1007/978-3-662-44440-5_10

Paulo Trigueiros³,
Fernando Ribeiro³ &
Luís Paulo Reis⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 449))

Included in the following conference series:

International Conference on Agents and Artificial Intelligence

777 Accesses
2 Citations

Abstract

Hand gesture recognition for human computer interaction, being a natural way of human computer interaction, is an area of active research in computer vision and machine learning. This is an area with many different possible applications, giving users a simpler and more natural way to communicate with robots/systems interfaces, without the need for extra devices. So, the primary goal of gesture recognition research is to create systems, which can identify specific human gestures and use them to convey information or for device control. For that, vision-based hand gesture interfaces require fast and extremely robust hand detection, and gesture recognition in real time. In this study we try to identify hand features that, isolated, respond better in various situations in human-computer interaction. The extracted features are used to train a set of classifiers with the help of RapidMiner in order to find the best learner. A dataset with our own gesture vocabulary consisted of 10 gestures, recorded from 20 users was created for later processing. Experimental results show that the radial signature and the centroid distance are the features that when used separately obtain better results, with an accuracy of 91 % and 90,1 % respectively obtained with a Neural Network classifier. These to methods have also the advantage of being simple in terms of computational complexity, which make them good candidates for real-time hand gesture recognition.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Machine Learning-Based Approach to Identify Hand Gestures

Hand Gesture Recognition: A Review

A Survey on Vision-Based Hand Gesture Recognition

Keywords

1 Introduction

Hand gesture recognition, being a natural way of human computer interaction, is an area of active current research, with many different possible applications, in order to create simpler and more natural forms of interaction, without using extra devices [1, 2]. To achieve natural human-computer interaction, the human hand could be considered as an input device. Hand gestures are a powerful way of human communication, with lots of potential applications, and vision-based hand gesture recognition techniques have many proven advantages compared with traditional devices. Compared with traditional HCI (Human Computer Interaction) devices, hand gestures are less intrusive and more convenient to explore, for example, three-dimensional (3D) virtual worlds. However, the expressiveness of hand gestures has not been fully explored for HCI applications. So, hand gesture recognition has become a challenging topic of research. However, recognizing the shape (posture) and the movement (gesture) of the hand in images or videos is a complex task [3].

The approach normally used for the problem of vision-based hand gesture recognition consists of identifying the pixels on the image that constitute the hand, extract features from those identified pixels in order to classify the hand, and use those features to train classifiers that can be used to recognize the occurrence of specific pose or sequence of poses as gestures.

In this paper we present a comparative study of seven different algorithms for hand feature extraction, for static hand gesture classification. The features were analysed with RapidMiner (http://rapid-i.com) in order to find the best learner, among the following four: k-NN, Naïve Bayes, ANN and SVM. We defined our own gesture vocabulary, with 10 gestures as shown in Fig. 1, and we have recorded videos from 20 users performing the gestures, without any previous training, for later processing. Our goal in the present study is to learn features that, isolated, respond better in various situations in human-computer interaction. The results show that the radial signature and the centroid distance are the features that when used separately obtain better results, being at the same time simple in terms of computational complexity. The features were selected due to their computational simplicity and efficiency in terms of computation time, and also because of the good recognition rates shown in other areas of study, like human detection [4]. The rest of the paper is as follows. First we review related work in Sect. 2. Section 3 introduces the actual data pre-processing stage and feature extraction. Machine learning for the purpose of gesture classification is introduced in Sect. 4. Datasets and experimental methodology are explained in Sect. 5. Section 6 presents and discusses the results. Conclusions and future work are drawn in Sect. 7.

2 Related Work

Hand gesture recognition is a challenging task in which two main approaches can be distinguished: hand model based and appearance-based methods [5, 6]. Although appearance-based methods are view-dependent, they are more efficient in computation time. They aim at recognizing a gesture among a vocabulary, with template gestures learned from training data, whereas hand model-based methods are used to recover the exact 3D hand pose. Appearance-based models extract features that are used to represent the object under study. These methods must have, in the majority of cases, invariance properties to translation, rotation and scale changes. There are many studies on gesture recognition and methodologies well presented in [7, 8]. Wang et al. [9] used the discrete Adaboost learning algorithm integrated with SIFT features for accomplishing in-plane rotation invariant, scale invariant and multi-view hand detection. Conceil et al. [6] compared two different shape descriptors, Fourier descriptors and Hu moments, for the recognition of 11 hand postures in a vision based approach. They concluded that Fourier descriptors gives good recognition rates in comparison with Hu moments. Barczak et al. [10] performed a performance comparison of Fourier descriptors and geometric moment invariants on an American Sign Language database. The results showed that both descriptors are unable to differentiate some classes in the database. Bourennane et al. [3] presented a shape descriptor comparison for hand posture recognition from video, with the objective of finding a good compromise between accuracy of recognition and computational load for a real-time application. They run experiments on two families of contour-based Fourier descriptors and two sets of region based moments, all of them invariant to translation, rotation and scale-changes of hands. They performed systematic tests on the Triesch benchmark database [11] and on their own with more realistic conditions, as they claim. The overall result of the research showed that the common set Fourier descriptors when combined with the k-nearest neighbour classifier had the highest recognition rate, reaching 100 % in the learning set and 88 % in the test set. Huynh [12] presents an evaluation of the SIFT (scale invariant feature transform), Colour SIFT, and SURF (speeded up robust features) descriptors on very low resolution images. The performance of the three descriptors are compared against each other on the precision and recall measures using ground truth correct matching data. His experimental results showed that both SIFT and colour SIFT are more robust under changes of viewing angle and viewing distance but SURF is superior under changes of illumination and blurring. In terms of computation time, the SURF descriptors offer themselves as a good alternative to SIFT and CSIFT. Fang et al. [13] to address the problem of large number of labelled samples, the usually costly time spent on training, conversion or normalization of features into a unified feature space, presented a hand posture recognition approach with what they called a co-training strategy [14]. The main idea is to train two different classifiers with each other and improve the performance of both classifiers with unlabelled samples. They claim that their method improves the recognition performance with less labelled data in a semi-supervised way. Rayi et al. [15] used the centroid distance Fourier descriptors as hand shape descriptors in sign language recognition. Their test results showed that the Fourier descriptors and the Manhattan distance-based classifier achieved recognition rates of 95 % with small computational latency. Classification involves a learning procedure, for which the number of training images and the number of gestures are important facts. Machine learning algorithms have been applied successfully to many fields of research like, face recognition [16], automatic recognition of a musical gesture by a computer [17], classification of robotic soccer formations [18], classifying human physical activity from on-body accelerometers [19], automatic road-sign detection [20, 21], and static hand gesture classification [2]. K-Nearest Neighbour (k-NN) was used in [16, 18]. This classifier represents each example as a data in d–dimensional space, where d is the number of attributes. Given a test sample, the proximity to the rest of the data points in the training set is computed using a measure of similarity or dissimilarity. In the distance calculation, the standard Euclidean distance is normally used, however other metrics can be used [22]. An artificial neural network is a mathematical /computational model that attempts to simulate the structure of biological neural systems. They accept features as inputs and produce decisions as outputs [23]. Maung et al. [1, 18, 21, 24] used it in a gesture recognition system, Faria et al. [18] used it for the classification of robotic soccer formations, Vicen-Buéno [21] used it applied to the problem of traffic sign recognition and Stephan et al. used it for static hand gesture recognition for human-computer interaction. Support Vector Machines (SVM’s) is a technique based on statistical learning theory, which works very well with high-dimensional data. The objective of this algorithm is to find the optimal separating hyper plane between two classes by maximizing the margin between them [25]. Faria et al. [16, 18] used it to classify robotic soccer formations and the classification of facial expressions, Ke et al. [26] used it in the implementation of a real-time hand gesture recognition system for human robot interaction, Maldonado-Báscon [20] used it for the recognition of road-signs and Masaki et al. used it in conjunction with SOM (Self-Organizing Map) for the automatic learning of a gesture recognition mode. Trigueiros et al. [2] have made a comparative study of four machine learning algorithms applied to two hand features datasets. In their study the datasets had a mixture of hand features. In this paper all the features extracted are analysed individually with machine learning algorithms to understand their performance and robustness in terms of scale, translation and rotation invariant static hand gesture recognition.

3 Pre-processing and Feature Extraction

Hand segmentation and feature extraction is a crucial step in computer vision applications for hand gesture recognition. The pre-processing stage prepares the input image and extracts features used later with the classification algorithms. In the present study, we used seven data sets with different features extracted from the segmented hand. The hand features used for the training datasets are: the Radial Signature (RS), the Radial Signature Fourier Descriptors (RSFDs), the Centroid Distance (CD), the Centroid Distance Fourier descriptors (CDFDs), the Histogram of Oriented Gradients (HoG), the Shi-Tomasi Corner Detector and the Uniform Local Binary Patterns (ULBP).

For the problem at hand, two types of images obtained with a Kinect camera were used in the feature extraction phase. The first one, the hand grey scale image was used in the HoG operator, the LBP (local binary pattern) operator and the Shi-Tomasi corner detector. The second one, the segmented hand blob, was used in the radial signature and the centroid distance signature after contour extraction.

3.1 Radial Signature

Shape signature is used to represent the shape contour of an object. The shape signature itself is a one-dimensional function that is constructed from the contour coordinates. The radial signature is one of several types of shape signatures.

A simple method to assess the gesture would be to measure the number of pixels from the hand centroid to the edges of the hand along a number of equally spaced radials [27]. For the present feature extraction problem, 100 equally spaced radials were used. To count the number of pixels along a given radial we only take into account the ones that are part of the hand, eliminating those that fall inside gaps, like the ones that appear between fingers or between the palm and a finger (Fig. 2). All the radial measurements can be scaled so that the longest radial has a constant length. With this measure, we can have a radial length signature that is invariant to hand distance from the camera.

3.2 Histogram of Gradients (HoG)

Pixel intensities can be sensitive to lighting variations, which lead to classification problems within the same gesture under different light conditions. The use of local orientation measures avoids this kind of problem, and the histogram gives us translation invariance. Orientation histograms summarize how much of each shape is oriented in each possible direction, independent of the position of the hand inside the camera frame [28]. This statistical technique is most appropriate for close-ups of the hand. In our work, the hand is extracted and separated from the background, which provides a uniform black background, which makes this statistical technique a good method for the identification of different static hand poses, as it can be seen in Fig. 3.

This method is insensitive to small changes in the size of the hand, but it is sensitive to changes in hand orientation.

We have calculated the local orientation using image gradients, represented by horizontal and vertical image pixel differences. If $ {\text{d}}_{\text{x}} $ and $ {\text{d}}_{\text{y}} $ are the outputs of the derivative operators, then the gradient direction is $ { \arctan }({\text{d}}_{\text{x}} ,{\text{d}}_{\text{y}} ) $, and the contrast is $ \sqrt {d_{x}^{2} + d_{y}^{2} } $. A contrast threshold is set as some amount k times the mean image contrast, below which we assume the orientation measurement is inaccurate. A value of $ {\text{k}} = 1. 2 $ was used in the experiments. We then blur the histogram in the angular domain as in [29], with a (1 4 6 4 1) filter, which gives a gradual fall-off in the distance between orientation histograms.

This feature descriptor was extensively used in many other areas like human detection [4, 30], in conjunction with other operators like the Scale Invariant Feature Transformation (SIFT) [31], the Kanade-Lucas-Tomasi (KLT) feature tracker [32] and local binary patterns for static hand-gesture recognition [33]. Lu et al. [34] and Kaniche et al. [32] used temporal HOGs for action categorization and gesture recognition.

3.3 Centroid Distance Signature

The centroid distance signature is another type of shape signature. The centroid distance function is expressed by the distance of the hand contour boundary points, from the centroid $ \left( {x_{c} ,y_{c} } \right) $ of the shape. In our study we used $ {\text{N}} = 128 $ as the number of equally sampled points on the contour.

$$ d\left( i \right) = \sqrt {\left[ {x_{i} - x_{c} } \right]^{2} + \left[ {y_{i} - y_{c} } \right]^{2} } , i = 0,\ldots , N - 1 $$

(1)

where $ d\left( i \right) $, is the calculated distance, and $ x_{i} $ and $ y_{i} $ are the coordinates of contour points. This way, we obtain a one-dimensional function that represents the hand shape. Due to the subtraction of centroid, which represents the hand position, from boundary coordinates, the centroid distance representation is invariant to translation. Rayi Yanu Tara et al. [15] demonstrated that this function is translation invariant and that a rotation of that hand results in a circularly shift version of the original image.

3.4 Local Binary Patterns

LBP (local binary pattern) is a grey scale invariant local texture operator with powerful discrimination and low computational complexity [35–38]. This operator labels the pixels of the image by thresholding the neighbourhood of each pixel $ {\text{g}}_{0}\, \left( {{\text{p}} = 0 \ldots {\text{P}} - 1} \right) $, being P the values of equally spaced pixels on a circle of radius R (R > 0), by the grey value of its center $ ( {\text{g}}_{\text{c}} ) $ and considers the result as a binary code that describes the local texture [35, 37, 38].

The code is derived as follows:

$$ {\text{LBP}}_{{{\text{P}},{\text{R}}}} = \mathop \sum \limits_{{{\text{p}} = 0}}^{{{\text{P}} - 1}} {\text{s}}\left( {{\text{g}}_{\text{p}} - {\text{g}}_{\text{c}} } \right)2^{\text{p}} $$

(2)

where

$$ {\text{s}}\left( {\text{x}} \right) = \left\{ {\begin{array}{*{20}c} {1, {\text{x}} \ge 0} \\ {0, {\text{x}} < 0} \\ \end{array} } \right. $$

(3)

Figure 4 illustrates the computation of $ {\text{LBP}}_{8,1} $ for a single pixel in a rectangular 3 × 3 neighbourhood. $ {\text{g}}_{0} $ is always assigned to be the grey value of neighbour to the right of $ {\text{g}}_{\text{c}} $. In the general definition, LBP is defined in a circular symmetric neighbourhood, which requires interpolation of the intensity values for exact computation. The coordinates of $ {\text{g}}_{0} $ are given by $ \left( { - {\text{R}}\sin \left( {2\uppi {\text{p}}/{\text{P}}} \right), {\text{R}}\cos \left( {2\uppi {\text{p}}/{\text{P}}} \right)} \right) $ [35].

The $ {\text{LBP}}_{{{\text{P}},{\text{R}}}} $ operator produces $ 2^{\text{P}} $ different output values, corresponding to the $ 2^{\text{P}} $ different binary patterns that can be formed by the P pixels in the neighbourhood set. As a rotation of a textured input image causes the LBP patterns to translate into a different location and to rotate about their origin, if rotation invariance is needed, it can be achieved by rotation invariance mapping. In this mapping, each LBP binary code is circularly rotated into its minimum value

$$ {\text{LBP}}_{{{\text{P}},{\text{R}}}}^{\text{ri}} = \mathop {\hbox{min} }\limits_{\text{i}} {\text{ROR}}\left( {{\text{LBP}}_{{{\text{P}},{\text{R}}}} ,{\text{i}}} \right) $$

(4)

where $ {\text{ROR}}\left( {{\text{x}}, {\text{i}}} \right) $ denotes the circular bitwise right shift on the P-bit number $ {\text{x}}, $ $ {\text{i}} $ steps. For example, 8-bit LBP codes 00111100b, 11110000b, and 00001111b all map to the minimum code 00001111b. For P = 8 a total of 36 unique different values is achieved. This operator was designated as LBPROT in [39]. Ojala et al. [35] had shown however, that LBPROT as such does not provide very good discrimination. They have observed that certain local binary patterns are fundamental properties of texture, providing the vast majority of all 3 × 3 patterns presented in observed textures. They called this fundamental patterns “uniform” as they have one thing in common – uniform circular structure that contains very few spatial transitions. They introduced a uniformity measure U(pattern), which corresponds to the number of spatial transitions (bitwise 0/1 changes) in the “pattern”. Patterns that have a U value of at most 2 are designated uniform and the following operator for grey-scale and rotation invariant texture description was proposed:

$$ {\text{LBP}}_{\text{P,R}}^{\text{riu2}} { = }\left\{ {\begin{array}{*{20}l} {\sum\nolimits_{\text{p = 0}}^{\text{P - 1}} {{\text{s}}\left( {{\text{g}}_{\text{p}} - {\text{g}}_{\text{p}} } \right) , {\text{if U}}\left( {{\text{LBP}}_{\text{P,R}} } \right) \le 2} } \\ {\text{P + 1, otherwise}} \\ \end{array} } \right. $$

(5)

Equation (5) assigns a unique label corresponding to the number of “1” bits in the uniform pattern, while the non-uniform are grouped under the “miscellaneous” label $ \left( {{\text{P}} + 1} \right) $. In practice the mapping from $ {\text{LBP}}_{{{\text{P}},{\text{R}}}} $ to $ {\text{LBP}}_{{{\text{P}},{\text{R}}}}^{{{\text{riu}}2}} $ is best implemented with a lookup table of $ 2^{\text{P}} $ elements. The final texture feature employed in texture analysis is the histogram of the operator output (i.e., pattern labels).

In the present work, we used the histogram of the uniform local binary pattern operator, with R (radius) equal to 1 and P (number of pixels in the neighbourhood) equal to 8, as a feature vector for the hand pose classification.

3.5 Fourier Descriptors

Instead of using the original image representation in the spatial domain, feature values can also be derived after applying a Fourier transformation. The feature vector calculated from a data representation in the transform domain, is called Fourier descriptor [40]. The Fourier descriptor is another feature describing the boundary of a region [23, 41], and is considered to be more robust with respect to noise and minor boundary modifications. In the present study Fourier descriptors were obtained for the histograms calculated from the radial signature and the centroid distance. For computational efficiency of the FFT, the number of points is chosen to be a power of two [6]. The normalized length is generally chosen to be equal to the calculated histogram signature length (N). Hence the Fourier Transform leads to N Fourier coefficients $ {\text{C}}_{\text{k}} $:

$$ {\text{C}}_{\text{k}} = \mathop \sum \limits_{{{\text{i}} = 0}}^{{{\text{N}} - 1}} {\text{z}}_{\text{i}} { \exp }\left( {\frac{{2\uppi {\text{jik}}}}{\text{N}}} \right),\quad {\text{k}} = 0, \ldots , {\text{N}} - 1^{{}} $$

(6)

Table 1 shows the relation between motions in the image and transform domains, which can be used in some types of invariance.

Table 1. Equivalence between motions in the image and transform domains.

Full size table

The first coefficient $ {\text{C}}_{0 } $ is discarded since it only contains the hand position. Hand rotation affects only the phase information, thus if rotation invariance is necessary, it can be achieved by taking the magnitude of the coefficients. Division of the coefficients by the magnitude of the second coefficient, $ {\text{C}}_{1} $, on the other hand, achieves scale invariance. This way we obtain N-1 Fourier descriptors $ {\text{I}}_{\text{k}} $:

$$ {\text{I}}_{\text{k}} = \frac{{\left| {{\text{C}}_{\text{k}} } \right|}}{{\left| {{\text{C}}_{1} } \right|}}, {\text{k}} = 2, \ldots , {\text{N}} - 1 $$

(7)

Conceil et al. [6], showed that with 20 coefficients the hand shape is well reconstructed, so we used this in our experiments. Centroid Distance Fourier descriptors, obtained by applying Fourier transform on a centroid distance signature, were empirically proven to have higher performance than other Fourier descriptors [41, 42].

3.6 The Shi-Tomasi Corner Detector

The Shi-Tomasi corner detector algorithm [43] is an improved version of the Harris corner detector [44]. The improvement is in how a certain region within the image is scored (and thus treated as a corner or not). Where the Harris corner detector determines the score $ R $ with the eigenvalues $ \lambda_{1} $ and $ \lambda_{2} $ of two regions (the second region is a shifted version of the first one to see if the difference between the two is big enough to say if there is a corner or not) in the following way:

$$ {\text{R}} = { \det }\left( {\uplambda_{1} \uplambda_{2} } \right) - {\text{k}}\left( {\uplambda_{1} + \uplambda_{2} } \right)^{2} $$

(8)

Shi and Tomasi just use the minimum of both eigenvalues

$$ R = {\text{min}}\left( {\lambda_{1} ,\lambda_{2} } \right) $$

(9)

and if R is greater than a certain predefined value, it can be marked as a corner. They demonstrated experimentally in their paper, that this score criterion is much better.

4 Machine Learning

The study and computer modelling of learning processes in their multiple manifestations constitutes the topic of machine learning [45]. Machine learning is the task of programming computers to optimize a performance criterion using example data or past experience [46]. For that, machine learning uses statistic theory in building mathematical models, since the core task is to make inference from sample data. In machine learning two entities, the teacher and the learner, play a crucial role. The teacher is the entity that has the required knowledge to perform a given task. The learner is the entity that has to learn the knowledge to perform the task. We can distinguish learning strategies by the amount of inference the learner performs on the information provided by the teacher. The learning problem can be stated as follows: given an example set of limited size, find a concise data description [45]. In our study, supervised learning was used, where the classification classes are known in advance. In supervised learning, given a sample of input-output pairs, called the training sample, the task is to find a deterministic function or model that maps any input to an output that can predict future observations, minimizing the error as much as possible. The models were learned from the extracted hand features with the help of the RapidMiner tool. The best learners identified for the produced datasets were the k-NN (k-nearest neighbour), the ANN (artificial neural network) and the SVM (support vector machines).

5 Datasets and Experimental Methodology

For data analysis, careful feature selection, dataset preparation and data transformation are important phases. In order to construct the right model it is necessary to understand the data under study. Successful data mining involves far more than selecting a learning algorithm and running it over your data [22]. In order to process the recorded videos, a C++ application, using openFrameworks and the OpenCV [47] and OpenNI [48] libraries, was developed. The application runs through all the recorded video files, and extracts for each algorithm the respective features. The features thus obtained are saved in text files, and converted later to Excel files so that they can be imported into Rapid Miner for data analysis, and find the best learner for each one. The experiments were performed in an Intel Core i7 (2,8 GHz) Mac OSX computer with 4 GB DDR3. The experiments were performed under the assumption of the k-fold method. The k-fold cross validation is used to determine how accurately a learning algorithm will be able to predict data that it was not trained with [16, 45]. A value of k = 10 (10-fold cross validation) was used, giving a good rule of approximation, although the best value depends on the used algorithm and the dataset [22, 46]. The algorithms performance, based on the counts of test records correctly and incorrectly predicted by the model, was analysed. Table 2 summarizes the best learners for each dataset with the corresponding parameters.

Table 2. ML algorithms identified as best learners for each dataset and used parameters.

Full size table

6 Results and Discussion

After analysing the different datasets, the obtained results were in most of the cases encouraging, although in other cases weaker than one could expect. In order to analyse how classification errors are distributed among classes, a confusion matrix was computed for each learner with the help of RapidMiner. Following we present the different results obtained with each dataset, in terms of best learner, the respective confusion matrix and the average accuracy recognition rate.

For the Radial Signature dataset, the best learner was the Neural Network with an accuracy of 91,0 %. Table 3 shows the respective confusion matrix. For the Centroid Distance dataset, the best learner was the neural network, with an accuracy of 90,1 %. Table 4 shows the respective confusion matrix. The k-NN classifier, with a value of k = 1, was the one that obtained the best values for the Radial Signature Fourier Descriptors and the Centroid Distance Fourier Descriptors with an accuracy of 82,28 % and 79,53 % respectively. Tables 5 and 6 show their respective confusion matrixes. For the LBP operator and the HoG operator, the best learner was the SVM with a RBF (radial basis function) kernel type and soft margins with C = 6 and C = 2 and a bias (offset) of 0.032 and 0.149 respectively. The achieved accuracy was 89,3 % for the LBP operator and 61,46 % for the HoG operator. The SVM library used was the libSVM [49], since it supports multi-class classification. The obtained confusion matrixes are shown in Tables 7 and 8. For the Shi-Tomasi corner detector the best learner was the neural network with a learning rate of 0.1, but with very poor results. As it can be seen from the HoG confusion matrix and the Shi-Tomasi corner detector confusion matrix, a lot of misclassification occurred, resulting from similar results for different gestures (Tables 8 and 9).

Table 3. Radial signature dataset confusion matrix.

Full size table

Table 4. Centroid distance dataset confusion matrix.

Full size table

Table 5. Radial signature Fourier confusion matrix.

Full size table

Table 6. Centroid distance Fourier confusion matrix.

Full size table

Table 7. Local binary patterns dataset confusion matrix.

Full size table

Table 8. Histogram of gradients dataset confusion matrix.

Full size table

Table 9. Shi-Tomasi corner detector confusion matrix.

Full size table

7 Conclusions and Future Work

This paper presented a comparative study of seven different algorithms for hand feature extraction, aimed at static hand gesture classification and recognition, for human computer interaction. We defined our own gesture vocabulary, with 10 gestures (Fig. 1), and we have recorded videos from 20 persons performing the gestures for hand feature extraction. The study main goal was to test the robustness of all the algorithms, applied individually to scale, translation and rotation invariance. After analysing the data and the obtained results we conclude that further pre-processing on the video frames is necessary in order to minimize the number of different feature values obtained for the same hand posture. The depth video images obtained with the Kinect have low resolution and some noise, so it was concluded that some imprecision on data recordings results from those problems, leading to more difficult class learning. There are several interpretations of noise as explained in [46]. Due to this situation, it was decided that a temporal filtering and/or a spatial filtering should be used and will be tested and analysed to see if better results are achieved. It has been found that the radial signature and the centroid distance are the best shape descriptors discussed in this paper in terms of robustness and computation complexity. Sometimes we have to apply the principle known as Occam’s razor, which states that “simpler explanations are more plausible and any unnecessary complexity should be shaved off”. The Shi-Tomasi corner detector implemented in OpenCV was the one that achieved the weaker results, and we will try it with the bag-of-features approach [50, 51]. Better results were expected from the Fourier descriptors, after having analysed related work on the area, so we will evaluate them further after having implemented the video streaming temporal filtering. In the local binary pattern operator, different radius and number of neighbours will be tested to analyse if better results are obtained.

Recent studies and implementations for image noise minimization, without degrading the performance in terms of frame rate was a cumulative average of hand position. We were able to prove in the recent implementations, that this method was able to improve feature extraction accuracy with implications in the final gesture classification.

References

Maung, T.H.H.: Real-time hand tracking and gesture recognition system using neural networks. Proc. World Acad. Sci. Eng. Technol. 50, 466–470 (2009)
Google Scholar
Trigueiros, P., Ribeiro, F., Reis, L.P.: A comparison of machine learning algorithms applied to hand gesture recognition. In: 7ª Conferência Ibérica de Sistemas e Tecnologias de Informação, Madrid, Spain (2012)
Google Scholar
Bourennane, S., Fossati, C.: Comparison of shape descriptors for hand posture recognition in video. Signal Image Video Process. 6(1), 147–157 (2010)
Article Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: International Conference on Computer Vision and Pattern Recognition, Grenoble, France (2005)
Google Scholar
Ong, S., Ranganath, S.: Automatic sign language analysis: a survey and the future beyond lexical meaning. IEEE Trans. Pattern Anal. Mach. Intell. 27(6), 873–891 (2005)
Article Google Scholar
Conseil, S., Bourenname, S., Martin, L.: Comparison of Fourier descriptors and Hu moments for hand posture recognition. In: 15th European Signal Processing Conference (EUSIPCO), Poznan, Poland, pp. 1960–1964 (2007)
Google Scholar
Mitra, S., Acharya, T.: Gesture recognition: a survey. IEEE Trans. Syst. Man Cybern. 37, 311–324 (2007)
Article Google Scholar
Murthy, G.R.S., Jadon, R.S.: A review of vision based hand gestures recognition. Int. J. Inf. Technol. Knowl. Manag. 2(2), 405–410 (2009)
Google Scholar
Wang, C.-C., Wang, K.-C.: Hand posture recognition using Adaboost with SIFT for human robot interaction. In: Proceedings of the International Conference on Advanced Robotics (ICAR’07), Jeju, Korea (2008)
Google Scholar
Barczak, A.L.C., et al.: Analysis of feature invariance and discrimination for hand images: Fourier descriptors versus moment invariants. In: International Conference Image and Vision Computing, New Zealand (2011)
Google Scholar
Triesch, J., von der Malsburg, C.: Robust classification of hand postures against complex backgrounds. In: International Conference on Automatic Face and Gesture Recognition, Killington, Vermont, USA (1996)
Google Scholar
Huynh, D.Q.: Evaluation of Three local descriptors on low resolution images for robot navigation. In: Image and Vision Computing (IVCNZ ’09), Wellington, pp. 113–118 (2009)
Google Scholar
Fang, Y., et al.: Hand posture recognition with co-training. In: 19th International Conference on Pattern Recognition (ICPR ’08), Tampa, FL (2008)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, ACM, Madison, Wisconsin, USA, pp. 92–100 (1998)
Google Scholar
Tara, R.Y., Santosa, P.I., Adji, T.B.: Sign language recognition in robot teleoperation using centroid distance Fourier descriptors. Int. J. Comput. Appl. 48(2), 8–12 (2012)
Google Scholar
Faria, B.M., Lau, N., Reis, L.P.: Classification of facial expressions using data mining and machine learning algorithms. In: 4ª Conferência Ibérica de Sistemas e Tecnologias de Informação, Póvoa de Varim, Portugal (2009)
Google Scholar
Gillian, N.E.: Gesture recognition for musician computer interaction, in Music Department 2011, Faculty of Arts, Humanities and Social Sciences, Belfast, p. 206 (2011)
Google Scholar
Faria, B.M., et al.: Machine learning algorithms applied to the classification of robotic soccer formations ans opponent teams. In: IEEE Conference on Cybernetics and Intelligent Systems (CIS), Singapore, pp. 344–349 (2010)
Google Scholar
Mannini, A., Sabatini, A.M.: Machine learning methods for classifying human physical activity from on-body accelerometers. Sensors 10(2), 1154–1175 (2010)
Article Google Scholar
Maldonado-Báscon, S., et al.: Road-Sign detection and recognition based on support vector machines. IEEE Trans. Intell. Transp. Syst. 8, 264–278 (2007)
Article Google Scholar
Vicen-Bueno, R., et al.: Complexity Reduction in Neural Networks Applied to Traffic Sign Recognition Tasks (2004)
Google Scholar
Witten, I.H., Frank, E., Hall, M.A.: Data Mining - Pratical Machine Learning Tools and Techniques, 3rd edn. Elsevier, Amsterdam (2011)
Google Scholar
Snyder, W.E., Qi, H.: Machine Vision. Cambridge University Press, New York (2004)
MATH Google Scholar
Stephan, J.J., Khudayer, S.: Gesture recognition for human-computer interaction (HCI). Int. J. Adv. Comput. Technol. 2(4), 30–35 (2010)
Google Scholar
Ben-Hur, A., Weston, J.: A user’s guide to support vector machines. In: Carugo, O., Eisenhaber, F. (eds.) Data Mining Techniques for the Life Sciences, pp. 223–239. Humana Press, Totowa (2008)
Google Scholar
Ke, W., et al.: Real-Time Hand Gesture Recognition for Service Robot, pp. 976–979 (2010)
Google Scholar
Lockton, R.: Hand Gesture Recognition Using Computer Vision. Oxford University, Oxford (2002)
Google Scholar
Roth, M., et al.: Computer vision for interactive computer graphics. IEEE Comput. Graph. Appl. 18, 42–53 (1998)
Google Scholar
Freeman, W.T., Roth, M.: Orientation Histograms for Hand Gesture Recognition. Mitsubishi Electric Research Laboratories, Cambridge Research Center (1994)
Google Scholar
Dalal, Navneet, Triggs, Bill, Schmid, Cordelia: Human detection using oriented histograms of flow and appearance. In: Leonardis, Aleš, Bischof, Horst, Pinz, Axel (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006)
Chapter Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Kaaniche, M.-B., Bremond, F.: Tracking HOG descriptors for gesture recognition. In: IEEE International Conference on Advanced Video and Signal-Based Surveillance. IEEE Computer Society Press (2009)
Google Scholar
Ding, Y., Pang, H., Wu, X.: Static hand-gesture recognition using HOG and improved LBP features. Int. J. Digit. Content Technol. Appl. 5(11), 236–243 (2011)
Article Google Scholar
Lu, W.-L., Little, J.J.: Simultaneous tracking and action recognition using the PCA-HOG descriptor. In: Proceedings of the 3rd Canadian Conference on Computer and Robot Vision, p. 6. IEEE Computer Society (2006)
Google Scholar
Ojala, T., PeitiKainen, M., Maenpã, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002)
Article Google Scholar
Hruz, M., Trojanova, J., Zelezny, M.: Local binary pattern based features for sign language recognition. Pattern Recogn. Image Anal. 21(3), 398–401 (2011)
Article Google Scholar
Unay, D., et al.: Robustness of local binary patterns in brain MR image analysis. In: 29th Annual Conference of the IEEE EMBS, Lyon, France. IEEE (2007)
Google Scholar
PietiKäinen, M., et al.: Computer Vision Using Local Binary Patterns, vol. 40. Springer, London (2011)
Google Scholar
Pietikainen, M., Ojala, T., Xu, Z.: Rotation-Invariant Texture Classification using Feature Distributions. Pattern Recogn. 33, 43–52 (2000)
Article Google Scholar
Treiber, M.: An Introduction to Object Recognition. Springer, London (2010)
Book MATH Google Scholar
Zhang, D., Lu, G.: A comparative study of Fourier descriptors for shape representation and retrieval. In: Proceedings of 5th Asian Conference on Computer Vision (ACCV). Springer, Melbourne, Australia (2002)
Google Scholar
Shih, F.Y.: Image Processing and Pattern Recognition: Fundamentals and Techniques. Wiley, New York (2008)
Google Scholar
Shi, J., Tomasi, C.: Good features to track. In: International Conference on Computer Vision and Pattern Recognition, pp. 593–600. Springer, Seattle (1994)
Google Scholar
Harris, C., Stephens, M.: A combined corner and edge detector. In: The Fourth Alvey Vision Conference (1988)
Google Scholar
Camastra, F., Vinciarelli, A.: Machine Learning for Audio, Image and Video Analysis. Springer, London (2008)
Book MATH Google Scholar
Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2004)
Google Scholar
Bradski, G., Kaehler, A., (eds.): Learning OpenCV: Computer Vision with the OpenCV library. O’Reilly Media (2008)
Google Scholar
OpenNI: The standard framework for 3D sensing (2013). http://www.openni.org/
Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27 (2011)
Article Google Scholar
Jiang, Y.-G., Ngo, C.-W., Yang, J.: Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval, pp. 494–501. ACM, Amsterdam (2007)
Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178. IEEE Computer Society (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Electrónica Industrial da Universidade do Minho, Campus de Azurém, 4800-058, Guimarães, Portugal
Paulo Trigueiros & Fernando Ribeiro
EEUM – Escola de Engenharia da Universidade do Minho – DSI, Campus de Azurém, 4800-058, Guimarães, Portugal
Luís Paulo Reis

Authors

Paulo Trigueiros
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar
Luís Paulo Reis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paulo Trigueiros .

Editor information

Editors and Affiliations

INSTICC, Setúbal, Portugal
Joaquim Filipe
Instituto de Telecomunicações, Lisbon, Portugal
Ana Fred

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Trigueiros, P., Ribeiro, F., Reis, L.P. (2014). Hand Gesture Recognition for Human Computer Interaction: A Comparative Study of Different Image Features. In: Filipe, J., Fred, A. (eds) Agents and Artificial Intelligence. ICAART 2013. Communications in Computer and Information Science, vol 449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44440-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-662-44440-5_10
Published: 31 October 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44439-9
Online ISBN: 978-3-662-44440-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hand Gesture Recognition for Human Computer Interaction: A Comparative Study of Different Image Features

Abstract

Similar content being viewed by others