Keywords

1 Introduction

Hand gesture recognition, being a natural way of human computer interaction, is an area of active current research, with many different possible applications, in order to create simpler and more natural forms of interaction, without using extra devices [1, 2]. To achieve natural human-computer interaction, the human hand could be considered as an input device. Hand gestures are a powerful way of human communication, with lots of potential applications, and vision-based hand gesture recognition techniques have many proven advantages compared with traditional devices. Compared with traditional HCI (Human Computer Interaction) devices, hand gestures are less intrusive and more convenient to explore, for example, three-dimensional (3D) virtual worlds. However, the expressiveness of hand gestures has not been fully explored for HCI applications. So, hand gesture recognition has become a challenging topic of research. However, recognizing the shape (posture) and the movement (gesture) of the hand in images or videos is a complex task [3].

The approach normally used for the problem of vision-based hand gesture recognition consists of identifying the pixels on the image that constitute the hand, extract features from those identified pixels in order to classify the hand, and use those features to train classifiers that can be used to recognize the occurrence of specific pose or sequence of poses as gestures.

In this paper we present a comparative study of seven different algorithms for hand feature extraction, for static hand gesture classification. The features were analysed with RapidMiner (http://rapid-i.com) in order to find the best learner, among the following four: k-NN, Naïve Bayes, ANN and SVM. We defined our own gesture vocabulary, with 10 gestures as shown in Fig. 1, and we have recorded videos from 20 users performing the gestures, without any previous training, for later processing. Our goal in the present study is to learn features that, isolated, respond better in various situations in human-computer interaction. The results show that the radial signature and the centroid distance are the features that when used separately obtain better results, being at the same time simple in terms of computational complexity. The features were selected due to their computational simplicity and efficiency in terms of computation time, and also because of the good recognition rates shown in other areas of study, like human detection [4]. The rest of the paper is as follows. First we review related work in Sect. 2. Section 3 introduces the actual data pre-processing stage and feature extraction. Machine learning for the purpose of gesture classification is introduced in Sect. 4. Datasets and experimental methodology are explained in Sect. 5. Section 6 presents and discusses the results. Conclusions and future work are drawn in Sect. 7.

Fig. 1.
figure 1

The defined gesture vocabulary.

2 Related Work

Hand gesture recognition is a challenging task in which two main approaches can be distinguished: hand model based and appearance-based methods [5, 6]. Although appearance-based methods are view-dependent, they are more efficient in computation time. They aim at recognizing a gesture among a vocabulary, with template gestures learned from training data, whereas hand model-based methods are used to recover the exact 3D hand pose. Appearance-based models extract features that are used to represent the object under study. These methods must have, in the majority of cases, invariance properties to translation, rotation and scale changes. There are many studies on gesture recognition and methodologies well presented in [7, 8]. Wang et al. [9] used the discrete Adaboost learning algorithm integrated with SIFT features for accomplishing in-plane rotation invariant, scale invariant and multi-view hand detection. Conceil et al. [6] compared two different shape descriptors, Fourier descriptors and Hu moments, for the recognition of 11 hand postures in a vision based approach. They concluded that Fourier descriptors gives good recognition rates in comparison with Hu moments. Barczak et al. [10] performed a performance comparison of Fourier descriptors and geometric moment invariants on an American Sign Language database. The results showed that both descriptors are unable to differentiate some classes in the database. Bourennane et al. [3] presented a shape descriptor comparison for hand posture recognition from video, with the objective of finding a good compromise between accuracy of recognition and computational load for a real-time application. They run experiments on two families of contour-based Fourier descriptors and two sets of region based moments, all of them invariant to translation, rotation and scale-changes of hands. They performed systematic tests on the Triesch benchmark database [11] and on their own with more realistic conditions, as they claim. The overall result of the research showed that the common set Fourier descriptors when combined with the k-nearest neighbour classifier had the highest recognition rate, reaching 100 % in the learning set and 88 % in the test set. Huynh [12] presents an evaluation of the SIFT (scale invariant feature transform), Colour SIFT, and SURF (speeded up robust features) descriptors on very low resolution images. The performance of the three descriptors are compared against each other on the precision and recall measures using ground truth correct matching data. His experimental results showed that both SIFT and colour SIFT are more robust under changes of viewing angle and viewing distance but SURF is superior under changes of illumination and blurring. In terms of computation time, the SURF descriptors offer themselves as a good alternative to SIFT and CSIFT. Fang et al. [13] to address the problem of large number of labelled samples, the usually costly time spent on training, conversion or normalization of features into a unified feature space, presented a hand posture recognition approach with what they called a co-training strategy [14]. The main idea is to train two different classifiers with each other and improve the performance of both classifiers with unlabelled samples. They claim that their method improves the recognition performance with less labelled data in a semi-supervised way. Rayi et al. [15] used the centroid distance Fourier descriptors as hand shape descriptors in sign language recognition. Their test results showed that the Fourier descriptors and the Manhattan distance-based classifier achieved recognition rates of 95 % with small computational latency. Classification involves a learning procedure, for which the number of training images and the number of gestures are important facts. Machine learning algorithms have been applied successfully to many fields of research like, face recognition [16], automatic recognition of a musical gesture by a computer [17], classification of robotic soccer formations [18], classifying human physical activity from on-body accelerometers [19], automatic road-sign detection [20, 21], and static hand gesture classification [2]. K-Nearest Neighbour (k-NN) was used in [16, 18]. This classifier represents each example as a data in d–dimensional space, where d is the number of attributes. Given a test sample, the proximity to the rest of the data points in the training set is computed using a measure of similarity or dissimilarity. In the distance calculation, the standard Euclidean distance is normally used, however other metrics can be used [22]. An artificial neural network is a mathematical /computational model that attempts to simulate the structure of biological neural systems. They accept features as inputs and produce decisions as outputs [23]. Maung et al. [1, 18, 21, 24] used it in a gesture recognition system, Faria et al. [18] used it for the classification of robotic soccer formations, Vicen-Buéno [21] used it applied to the problem of traffic sign recognition and Stephan et al. used it for static hand gesture recognition for human-computer interaction. Support Vector Machines (SVM’s) is a technique based on statistical learning theory, which works very well with high-dimensional data. The objective of this algorithm is to find the optimal separating hyper plane between two classes by maximizing the margin between them [25]. Faria et al. [16, 18] used it to classify robotic soccer formations and the classification of facial expressions, Ke et al. [26] used it in the implementation of a real-time hand gesture recognition system for human robot interaction, Maldonado-Báscon [20] used it for the recognition of road-signs and Masaki et al. used it in conjunction with SOM (Self-Organizing Map) for the automatic learning of a gesture recognition mode. Trigueiros et al. [2] have made a comparative study of four machine learning algorithms applied to two hand features datasets. In their study the datasets had a mixture of hand features. In this paper all the features extracted are analysed individually with machine learning algorithms to understand their performance and robustness in terms of scale, translation and rotation invariant static hand gesture recognition.

3 Pre-processing and Feature Extraction

Hand segmentation and feature extraction is a crucial step in computer vision applications for hand gesture recognition. The pre-processing stage prepares the input image and extracts features used later with the classification algorithms. In the present study, we used seven data sets with different features extracted from the segmented hand. The hand features used for the training datasets are: the Radial Signature (RS), the Radial Signature Fourier Descriptors (RSFDs), the Centroid Distance (CD), the Centroid Distance Fourier descriptors (CDFDs), the Histogram of Oriented Gradients (HoG), the Shi-Tomasi Corner Detector and the Uniform Local Binary Patterns (ULBP).

For the problem at hand, two types of images obtained with a Kinect camera were used in the feature extraction phase. The first one, the hand grey scale image was used in the HoG operator, the LBP (local binary pattern) operator and the Shi-Tomasi corner detector. The second one, the segmented hand blob, was used in the radial signature and the centroid distance signature after contour extraction.

3.1 Radial Signature

Shape signature is used to represent the shape contour of an object. The shape signature itself is a one-dimensional function that is constructed from the contour coordinates. The radial signature is one of several types of shape signatures.

A simple method to assess the gesture would be to measure the number of pixels from the hand centroid to the edges of the hand along a number of equally spaced radials [27]. For the present feature extraction problem, 100 equally spaced radials were used. To count the number of pixels along a given radial we only take into account the ones that are part of the hand, eliminating those that fall inside gaps, like the ones that appear between fingers or between the palm and a finger (Fig. 2). All the radial measurements can be scaled so that the longest radial has a constant length. With this measure, we can have a radial length signature that is invariant to hand distance from the camera.

Fig. 2.
figure 2

Hand radial signature: hand with drawn radials (left); obtained radial signature (right).

3.2 Histogram of Gradients (HoG)

Pixel intensities can be sensitive to lighting variations, which lead to classification problems within the same gesture under different light conditions. The use of local orientation measures avoids this kind of problem, and the histogram gives us translation invariance. Orientation histograms summarize how much of each shape is oriented in each possible direction, independent of the position of the hand inside the camera frame [28]. This statistical technique is most appropriate for close-ups of the hand. In our work, the hand is extracted and separated from the background, which provides a uniform black background, which makes this statistical technique a good method for the identification of different static hand poses, as it can be seen in Fig. 3.

Fig. 3.
figure 3

Hand gradients (left), Histogram of gradients (right).

This method is insensitive to small changes in the size of the hand, but it is sensitive to changes in hand orientation.

We have calculated the local orientation using image gradients, represented by horizontal and vertical image pixel differences. If \( {\text{d}}_{\text{x}} \) and \( {\text{d}}_{\text{y}} \) are the outputs of the derivative operators, then the gradient direction is \( { \arctan }({\text{d}}_{\text{x}} ,{\text{d}}_{\text{y}} ) \), and the contrast is \( \sqrt {d_{x}^{2} + d_{y}^{2} } \). A contrast threshold is set as some amount k times the mean image contrast, below which we assume the orientation measurement is inaccurate. A value of \( {\text{k}} = 1. 2 \) was used in the experiments. We then blur the histogram in the angular domain as in [29], with a (1 4 6 4 1) filter, which gives a gradual fall-off in the distance between orientation histograms.

This feature descriptor was extensively used in many other areas like human detection [4, 30], in conjunction with other operators like the Scale Invariant Feature Transformation (SIFT) [31], the Kanade-Lucas-Tomasi (KLT) feature tracker [32] and local binary patterns for static hand-gesture recognition [33]. Lu et al. [34] and Kaniche et al. [32] used temporal HOGs for action categorization and gesture recognition.

3.3 Centroid Distance Signature

The centroid distance signature is another type of shape signature. The centroid distance function is expressed by the distance of the hand contour boundary points, from the centroid \( \left( {x_{c} ,y_{c} } \right) \) of the shape. In our study we used \( {\text{N}} = 128 \) as the number of equally sampled points on the contour.

$$ d\left( i \right) = \sqrt {\left[ {x_{i} - x_{c} } \right]^{2} + \left[ {y_{i} - y_{c} } \right]^{2} } , i = 0,\ldots , N - 1 $$
(1)

where \( d\left( i \right) \), is the calculated distance, and \( x_{i} \) and \( y_{i} \) are the coordinates of contour points. This way, we obtain a one-dimensional function that represents the hand shape. Due to the subtraction of centroid, which represents the hand position, from boundary coordinates, the centroid distance representation is invariant to translation. Rayi Yanu Tara et al. [15] demonstrated that this function is translation invariant and that a rotation of that hand results in a circularly shift version of the original image.

3.4 Local Binary Patterns

LBP (local binary pattern) is a grey scale invariant local texture operator with powerful discrimination and low computational complexity [3538]. This operator labels the pixels of the image by thresholding the neighbourhood of each pixel \( {\text{g}}_{0}\, \left( {{\text{p}} = 0 \ldots {\text{P}} - 1} \right) \), being P the values of equally spaced pixels on a circle of radius R (R > 0), by the grey value of its center \( ( {\text{g}}_{\text{c}} ) \) and considers the result as a binary code that describes the local texture [35, 37, 38].

The code is derived as follows:

$$ {\text{LBP}}_{{{\text{P}},{\text{R}}}} = \mathop \sum \limits_{{{\text{p}} = 0}}^{{{\text{P}} - 1}} {\text{s}}\left( {{\text{g}}_{\text{p}} - {\text{g}}_{\text{c}} } \right)2^{\text{p}} $$
(2)

where

$$ {\text{s}}\left( {\text{x}} \right) = \left\{ {\begin{array}{*{20}c} {1, {\text{x}} \ge 0} \\ {0, {\text{x}} < 0} \\ \end{array} } \right. $$
(3)

Figure 4 illustrates the computation of \( {\text{LBP}}_{8,1} \) for a single pixel in a rectangular 3 × 3 neighbourhood. \( {\text{g}}_{0} \) is always assigned to be the grey value of neighbour to the right of \( {\text{g}}_{\text{c}} \). In the general definition, LBP is defined in a circular symmetric neighbourhood, which requires interpolation of the intensity values for exact computation. The coordinates of \( {\text{g}}_{0} \) are given by \( \left( { - {\text{R}}\sin \left( {2\uppi {\text{p}}/{\text{P}}} \right), {\text{R}}\cos \left( {2\uppi {\text{p}}/{\text{P}}} \right)} \right) \) [35].

Fig. 4.
figure 4

Example of computing \( \varvec{LBP}_{8,1} \) pixel neighbourhood (left); threshold version (middle); resulting binary code (right).

The \( {\text{LBP}}_{{{\text{P}},{\text{R}}}} \) operator produces \( 2^{\text{P}} \) different output values, corresponding to the \( 2^{\text{P}} \) different binary patterns that can be formed by the P pixels in the neighbourhood set. As a rotation of a textured input image causes the LBP patterns to translate into a different location and to rotate about their origin, if rotation invariance is needed, it can be achieved by rotation invariance mapping. In this mapping, each LBP binary code is circularly rotated into its minimum value

$$ {\text{LBP}}_{{{\text{P}},{\text{R}}}}^{\text{ri}} = \mathop {\hbox{min} }\limits_{\text{i}} {\text{ROR}}\left( {{\text{LBP}}_{{{\text{P}},{\text{R}}}} ,{\text{i}}} \right) $$
(4)

where \( {\text{ROR}}\left( {{\text{x}}, {\text{i}}} \right) \) denotes the circular bitwise right shift on the P-bit number \( {\text{x}}, \) \( {\text{i}} \) steps. For example, 8-bit LBP codes 00111100b, 11110000b, and 00001111b all map to the minimum code 00001111b. For P = 8 a total of 36 unique different values is achieved. This operator was designated as LBPROT in [39]. Ojala et al. [35] had shown however, that LBPROT as such does not provide very good discrimination. They have observed that certain local binary patterns are fundamental properties of texture, providing the vast majority of all 3 × 3 patterns presented in observed textures. They called this fundamental patterns “uniform” as they have one thing in common – uniform circular structure that contains very few spatial transitions. They introduced a uniformity measure U(pattern), which corresponds to the number of spatial transitions (bitwise 0/1 changes) in the “pattern”. Patterns that have a U value of at most 2 are designated uniform and the following operator for grey-scale and rotation invariant texture description was proposed:

$$ {\text{LBP}}_{\text{P,R}}^{\text{riu2}} { = }\left\{ {\begin{array}{*{20}l} {\sum\nolimits_{\text{p = 0}}^{\text{P - 1}} {{\text{s}}\left( {{\text{g}}_{\text{p}} - {\text{g}}_{\text{p}} } \right) , {\text{if U}}\left( {{\text{LBP}}_{\text{P,R}} } \right) \le 2} } \\ {\text{P + 1, otherwise}} \\ \end{array} } \right. $$
(5)

Equation (5) assigns a unique label corresponding to the number of “1” bits in the uniform pattern, while the non-uniform are grouped under the “miscellaneous” label \( \left( {{\text{P}} + 1} \right) \). In practice the mapping from \( {\text{LBP}}_{{{\text{P}},{\text{R}}}} \) to \( {\text{LBP}}_{{{\text{P}},{\text{R}}}}^{{{\text{riu}}2}} \) is best implemented with a lookup table of \( 2^{\text{P}} \) elements. The final texture feature employed in texture analysis is the histogram of the operator output (i.e., pattern labels).

In the present work, we used the histogram of the uniform local binary pattern operator, with R (radius) equal to 1 and P (number of pixels in the neighbourhood) equal to 8, as a feature vector for the hand pose classification.

3.5 Fourier Descriptors

Instead of using the original image representation in the spatial domain, feature values can also be derived after applying a Fourier transformation. The feature vector calculated from a data representation in the transform domain, is called Fourier descriptor [40]. The Fourier descriptor is another feature describing the boundary of a region [23, 41], and is considered to be more robust with respect to noise and minor boundary modifications. In the present study Fourier descriptors were obtained for the histograms calculated from the radial signature and the centroid distance. For computational efficiency of the FFT, the number of points is chosen to be a power of two [6]. The normalized length is generally chosen to be equal to the calculated histogram signature length (N). Hence the Fourier Transform leads to N Fourier coefficients \( {\text{C}}_{\text{k}} \):

$$ {\text{C}}_{\text{k}} = \mathop \sum \limits_{{{\text{i}} = 0}}^{{{\text{N}} - 1}} {\text{z}}_{\text{i}} { \exp }\left( {\frac{{2\uppi {\text{jik}}}}{\text{N}}} \right),\quad {\text{k}} = 0, \ldots , {\text{N}} - 1^{{}} $$
(6)

Table 1 shows the relation between motions in the image and transform domains, which can be used in some types of invariance.

Table 1. Equivalence between motions in the image and transform domains.

The first coefficient \( {\text{C}}_{0 } \) is discarded since it only contains the hand position. Hand rotation affects only the phase information, thus if rotation invariance is necessary, it can be achieved by taking the magnitude of the coefficients. Division of the coefficients by the magnitude of the second coefficient, \( {\text{C}}_{1} \), on the other hand, achieves scale invariance. This way we obtain N-1 Fourier descriptors \( {\text{I}}_{\text{k}} \):

$$ {\text{I}}_{\text{k}} = \frac{{\left| {{\text{C}}_{\text{k}} } \right|}}{{\left| {{\text{C}}_{1} } \right|}}, {\text{k}} = 2, \ldots , {\text{N}} - 1 $$
(7)

Conceil et al. [6], showed that with 20 coefficients the hand shape is well reconstructed, so we used this in our experiments. Centroid Distance Fourier descriptors, obtained by applying Fourier transform on a centroid distance signature, were empirically proven to have higher performance than other Fourier descriptors [41, 42].

3.6 The Shi-Tomasi Corner Detector

The Shi-Tomasi corner detector algorithm [43] is an improved version of the Harris corner detector [44]. The improvement is in how a certain region within the image is scored (and thus treated as a corner or not). Where the Harris corner detector determines the score \( R \) with the eigenvalues \( \lambda_{1} \) and \( \lambda_{2} \) of two regions (the second region is a shifted version of the first one to see if the difference between the two is big enough to say if there is a corner or not) in the following way:

$$ {\text{R}} = { \det }\left( {\uplambda_{1} \uplambda_{2} } \right) - {\text{k}}\left( {\uplambda_{1} + \uplambda_{2} } \right)^{2} $$
(8)

Shi and Tomasi just use the minimum of both eigenvalues

$$ R = {\text{min}}\left( {\lambda_{1} ,\lambda_{2} } \right) $$
(9)

and if R is greater than a certain predefined value, it can be marked as a corner. They demonstrated experimentally in their paper, that this score criterion is much better.

4 Machine Learning

The study and computer modelling of learning processes in their multiple manifestations constitutes the topic of machine learning [45]. Machine learning is the task of programming computers to optimize a performance criterion using example data or past experience [46]. For that, machine learning uses statistic theory in building mathematical models, since the core task is to make inference from sample data. In machine learning two entities, the teacher and the learner, play a crucial role. The teacher is the entity that has the required knowledge to perform a given task. The learner is the entity that has to learn the knowledge to perform the task. We can distinguish learning strategies by the amount of inference the learner performs on the information provided by the teacher. The learning problem can be stated as follows: given an example set of limited size, find a concise data description [45]. In our study, supervised learning was used, where the classification classes are known in advance. In supervised learning, given a sample of input-output pairs, called the training sample, the task is to find a deterministic function or model that maps any input to an output that can predict future observations, minimizing the error as much as possible. The models were learned from the extracted hand features with the help of the RapidMiner tool. The best learners identified for the produced datasets were the k-NN (k-nearest neighbour), the ANN (artificial neural network) and the SVM (support vector machines).

5 Datasets and Experimental Methodology

For data analysis, careful feature selection, dataset preparation and data transformation are important phases. In order to construct the right model it is necessary to understand the data under study. Successful data mining involves far more than selecting a learning algorithm and running it over your data [22]. In order to process the recorded videos, a C++ application, using openFrameworks and the OpenCV [47] and OpenNI [48] libraries, was developed. The application runs through all the recorded video files, and extracts for each algorithm the respective features. The features thus obtained are saved in text files, and converted later to Excel files so that they can be imported into Rapid Miner for data analysis, and find the best learner for each one. The experiments were performed in an Intel Core i7 (2,8 GHz) Mac OSX computer with 4 GB DDR3. The experiments were performed under the assumption of the k-fold method. The k-fold cross validation is used to determine how accurately a learning algorithm will be able to predict data that it was not trained with [16, 45]. A value of k = 10 (10-fold cross validation) was used, giving a good rule of approximation, although the best value depends on the used algorithm and the dataset [22, 46]. The algorithms performance, based on the counts of test records correctly and incorrectly predicted by the model, was analysed. Table 2 summarizes the best learners for each dataset with the corresponding parameters.

Table 2. ML algorithms identified as best learners for each dataset and used parameters.

6 Results and Discussion

After analysing the different datasets, the obtained results were in most of the cases encouraging, although in other cases weaker than one could expect. In order to analyse how classification errors are distributed among classes, a confusion matrix was computed for each learner with the help of RapidMiner. Following we present the different results obtained with each dataset, in terms of best learner, the respective confusion matrix and the average accuracy recognition rate.

For the Radial Signature dataset, the best learner was the Neural Network with an accuracy of 91,0 %. Table 3 shows the respective confusion matrix. For the Centroid Distance dataset, the best learner was the neural network, with an accuracy of 90,1 %. Table 4 shows the respective confusion matrix. The k-NN classifier, with a value of k = 1, was the one that obtained the best values for the Radial Signature Fourier Descriptors and the Centroid Distance Fourier Descriptors with an accuracy of 82,28 % and 79,53 % respectively. Tables 5 and 6 show their respective confusion matrixes. For the LBP operator and the HoG operator, the best learner was the SVM with a RBF (radial basis function) kernel type and soft margins with C = 6 and C = 2 and a bias (offset) of 0.032 and 0.149 respectively. The achieved accuracy was 89,3 % for the LBP operator and 61,46 % for the HoG operator. The SVM library used was the libSVM [49], since it supports multi-class classification. The obtained confusion matrixes are shown in Tables 7 and 8. For the Shi-Tomasi corner detector the best learner was the neural network with a learning rate of 0.1, but with very poor results. As it can be seen from the HoG confusion matrix and the Shi-Tomasi corner detector confusion matrix, a lot of misclassification occurred, resulting from similar results for different gestures (Tables 8 and 9).

Table 3. Radial signature dataset confusion matrix.
Table 4. Centroid distance dataset confusion matrix.
Table 5. Radial signature Fourier confusion matrix.
Table 6. Centroid distance Fourier confusion matrix.
Table 7. Local binary patterns dataset confusion matrix.
Table 8. Histogram of gradients dataset confusion matrix.
Table 9. Shi-Tomasi corner detector confusion matrix.

7 Conclusions and Future Work

This paper presented a comparative study of seven different algorithms for hand feature extraction, aimed at static hand gesture classification and recognition, for human computer interaction. We defined our own gesture vocabulary, with 10 gestures (Fig. 1), and we have recorded videos from 20 persons performing the gestures for hand feature extraction. The study main goal was to test the robustness of all the algorithms, applied individually to scale, translation and rotation invariance. After analysing the data and the obtained results we conclude that further pre-processing on the video frames is necessary in order to minimize the number of different feature values obtained for the same hand posture. The depth video images obtained with the Kinect have low resolution and some noise, so it was concluded that some imprecision on data recordings results from those problems, leading to more difficult class learning. There are several interpretations of noise as explained in [46]. Due to this situation, it was decided that a temporal filtering and/or a spatial filtering should be used and will be tested and analysed to see if better results are achieved. It has been found that the radial signature and the centroid distance are the best shape descriptors discussed in this paper in terms of robustness and computation complexity. Sometimes we have to apply the principle known as Occam’s razor, which states that “simpler explanations are more plausible and any unnecessary complexity should be shaved off”. The Shi-Tomasi corner detector implemented in OpenCV was the one that achieved the weaker results, and we will try it with the bag-of-features approach [50, 51]. Better results were expected from the Fourier descriptors, after having analysed related work on the area, so we will evaluate them further after having implemented the video streaming temporal filtering. In the local binary pattern operator, different radius and number of neighbours will be tested to analyse if better results are obtained.

Recent studies and implementations for image noise minimization, without degrading the performance in terms of frame rate was a cumulative average of hand position. We were able to prove in the recent implementations, that this method was able to improve feature extraction accuracy with implications in the final gesture classification.