Real-time robust vision-based hand gesture recognition using stereo images

Liu, Kui; Kehtarnavaz, Nasser

doi:10.1007/s11554-013-0333-6

Real-time robust vision-based hand gesture recognition using stereo images

Original Research Paper
Published: 26 February 2013

Volume 11, pages 201–209, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Real-time robust vision-based hand gesture recognition using stereo images

Download PDF

Kui Liu¹ &
Nasser Kehtarnavaz¹

1449 Accesses
45 Citations
Explore all metrics

Abstract

This paper presents a real-time and robust approach to recognize two types of gestures consisting of seven motional gestures and six finger spelling gestures. This approach utilizes stereo images captured by a stereo webcam to achieve robust recognition under realistic lighting conditions and in various backgrounds. It incorporates several existing computationally efficient techniques and introduces a rule-based approach to merge the information from a pair of stereo images leading to an improved hand detection compared to using single images. The results obtained indicate that high recognition rates under realistic conditions are obtained in real-time on PC platforms at the rate of 30 frames per second. It is shown that its outcome is comparable to two existing approaches while it is computationally more efficient than these approaches.

A Real Time Gesture Recognition System for Human Computer Interaction

A Survey of the Constraints Encountered in Dynamic Vision-Based Sign Language Hand Gesture Recognition

A Study of Real-Time Hand Gesture Recognition Using SIFT on Binary Images

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Hand gesture recognition has started playing a major role in various human–computer interaction applications. Both non-vision and vision-based approaches have been used to achieve hand gesture recognition. An example of a non-vision approach was reported in [1] where finger bending was detected by a pair of wired gloves. In general, vision-based approaches are more natural as they require no hand fitting devices. Vision-based approaches can be divided into active and passive sensing. Active sensing approaches have proven successful for gesture recognition, in particular through the use of Kinect [2, 3] and time-of-flight cameras [4]. However, these approaches have challenges to overcome when it comes to deploying them on resource constraint and computation-limited devices such as smart phones.

Many passive vision-based hand gesture recognition techniques have been introduced in the literature, e.g. [5–7], where images from a single camera are used to achieve gesture recognition. Hand gestures can be classified into two categories: static and motional gestures. A recognition technique for static gestures was reported in [8], where features derived from elastic graph matching were used to identify hand postures in complex backgrounds leading to a recognition rate of 85 %. In [9], a learning approach based on a disjunctive normal form was used leading to a recognition rate of 93 %. This approach involved the use of normalized hand moments and compactness. In [10], finger spell recognition was achieved at the processing rate of 125 ms per image frame using the CamShift algorithm. In [11], principal component analysis was used for hand gesture recognition. As far as motional hand gesture recognition is concerned, the following three major approaches have been utilized: optical flow, model-based and HMM. In [5], a hand gesture model was devised using an Adaboost classifier and Haar features together with a Kalman predictor to cope with false detection. In [7], a model-based tracking of hand gestures was considered. The HMM approach to hand gesture recognition was covered in [12].

The use of stereo images for real-time passive vision-based hand gesture recognition has been fairly limited in the literature. In [13] and [14], a stereo camera with dedicated hardware was utilized to generate depth maps for hand gesture recognition; however, no real-time processing rates were reported in these references. Our solution presented in this paper is an attempt to perform hand gesture recognition in real-time based on stereo images that are captured by an inexpensive stereo webcam, such as the one in [15]. The motivation here has been to increase the robustness of hand detection and hence the robustness of hand gesture recognition using a pair of low-resolution stereo images, instead of images taken from a single camera. The challenge in this attempt has been to achieve the increase in robustness in a computationally efficient manner so that a real-time throughput is reached. The introduced approach establishes a balance between robustness and computational complexity. On one hand, the developed solution is designed to be robust to different backgrounds and lighting conditions. On the other hand, it is designed to incorporate time-efficient and relatively simple functions to achieve a real-time throughput. The introduced approach combines or merges the information from the left and right images of a stereo camera to increase the robustness of hand detection while meeting the real-time constraint.

In Sect. 2, the details of our approach are discussed. The results obtained are then reported in Sect. 3. This section also includes comparisons with two existing approaches. Finally, the conclusion is stated in Sect. 4.

2 Real-time robust hand gesture recognition

Two types of hand gestures have been considered in this paper: directional hand movement and finger number spelling. The developed recognition system consists of four main components: online color calibration of hand color, color-based hand detection, hand tracking, and finally hand gesture recognition.

2.1 Online color calibration

The goal of the online color calibration component is to adapt our subsequent color processing to the color characteristic of the light source under which images are captured. We have previously used this technique quite successfully for face detection in [16] to cope with unknown color characteristics of various light sources encountered in practice. The calibration is done at the beginning and only once when the system is turned on. It involves building a GMM model in the CrCb color space to represent the color characteristics of the hand being captured in an online or on-the-fly manner. The calibration is performed easily by the user simply placing his or her hand in a box displayed at the image center, see Fig. 1. Representative skin color pixels are collected within this box using a two-cluster k-means clustering algorithm separating skin pixels from non-skin pixels. A GMM model is then trained and used for a region growing hand color segmentation within a region-of-interest specified by a tracking module mentioned next. More details of the online color calibration can be found in [16].

2.2 Hand detection

There are two main steps in our hand detection approach which include hand tracking and robustness improvement using stereo images.

2.2.1 Hand tracking

The existing tracking methods including optical flow, either sparse [17] or dense [18], and Kalman filtering pose challenges as far as the real-time aspect is concerned due to their computational complexity. To have a computationally efficient tracking, the CamShift algorithm [19] is adopted in our approach. In this algorithm, the hue component of color is used for tracking. Figure 2 shows a sample hue histogram associated with a hand. The histogram within a window is used as the hand tracking feature together with a searching window. The center of the window is used as the seed point for the so-called flood fill region growing operation [20] to achieve segmentation in a computationally efficient manner. The CamShift algorithm works similar to the MeanShift algorithm but it also copes with dynamically changing distributions by readjusting the search window size.

The segmented areas from the left and right images are merged by aligning the left and right images as was previously reported in [21]. The merged area is then used for hand contour extraction. The flow chart of all the components involved in our introduced approach appears in Fig. 3.

2.2.2 Increasing robustness via stereo images

The following rules are introduced to merge the information from the left and right cameras leading to a more robust hand detection as compared to using a single camera image. Let $ x_{i + 1} $ denotes the current hand mask and $ x_{i} $ the previous hand mask. The superscripts l and r indicate the left and right camera label for the masks. Let S represent the mask area and δ a percentage parameter reflecting the mask area difference between the frames. Our experimentations have revealed that a δ value in the range 25–30 % can cope with the variability in hand motions made by various subjects. Due to the continuity of motion, it is not physically possible to have a large mask area difference between the frames either left to right or current to previous. Even when the hand is approaching the camera, the mask area is expected to grow consistently. Only when the current mask does not exhibit a large difference from the previous one in both the left and right images, the masks get merged. For instance, first the mask areas between a current left frame $ S(x_{i + 1}^{\text{l}} ) $ and a current right frame $ S(x_{i + 1}^{\text{r}} ) $ are compared. If there exists relatively little difference between them as per Eq. (1), the change is considered to be consistent. Next, the change in the mask areas between a previous and a current frame is examined. If both of the left and right area changes, that is $ |S(x_{i + 1}^{\text{l}} ) - S(x_{i}^{\text{l}} )| $ and $ |S(x_{i + 1}^{\text{r}} ) - S(x_{i}^{\text{r}} )| $, show a consistent change between a current and a previous image as per Eq. (2), the current left and right hand masks are merged and get updated as per Eq. (3). If one of the areas leads to an inconsistent change as per Eq. (4) or (6), the update process in Eqs. (5) and (7) is done. If there exists a large difference between a current left or a current right frame and the image side which does not change consistently as per Eqs. (9) and (11), it is not used for the next time frame and only the consistent image side is used as per Eqs. (10) and (12). Otherwise, no update is done for the next time frame as per Eq. (13).

$$ {\text{if }}|S(x_{i + 1}^{\text{l}} ) - S(x_{i + 1}^{\text{r}} )| < \delta *S(x_{i + 1}^{\text{l}} ) $$

(1)

$$ {\text{if }}|S(x_{i + 1}^{\text{l}} ) - S(x_{i}^{\text{l}} )| < \delta *S(x_{i + 1}^{\text{l}} ){\text{ and }}|S({\text{x}}_{i + 1}^{\text{r}} ) - S({\text{x}}_{i}^{\text{r}} )| < \delta *S({\text{x}}_{i + 1}^{\text{r}} ) $$

(2)

$$ x_{i + 1} = x_{i + 1}^{\text{l}} + x_{i + 1}^{\text{r}} $$

(3)

$$ {\text{elseif }}|S({\text{x}}_{i + 1}^{l} ) - S({\text{x}}_{i}^{l} )| < \delta *S({\text{x}}_{i + 1}^{l} ) \;{\rm and}\; |S({\text{x}}_{i + 1}^{r} ) - S({\text{x}}_{i}^{r} )| > \delta *S({\text{x}}_{i + 1}^{r} ) $$

(4)

$$ x_{i + 1} = x_{i + 1}^{\text{l}} $$

(5)

$$ {\text{elseif }}|S(x_{i + 1}^{\text{l}} ) - S(x_{i}^{\text{l}} )| > \delta *S(x_{i + 1}^{\text{l}} ){\text{ and }}|S(x_{i + 1}^{\text{r}} ) - S(x_{i}^{\text{r}} )| < \delta *S(x_{i + 1}^{\text{r}} ) $$

(6)

$$ x_{i + 1} = x_{i + 1}^{\text{r}} $$

(7)

else

$$ x_{i + 1} = x_{i} $$

(8)

else

$$ \begin{aligned} {\text{if }}|S(x_{i + 1}^{\text{l}} ) - S(x_{i}^{\text{l}} )| < \delta *S(x_{i + 1}^{\text{l}} ){\text{ and }}|S(x_{i + 1}^{\text{r}} ) - S(x_{i}^{\text{r}} )| > \delta *S(x_{i + 1}^{\text{r}} ) \\ {\text{and }}S(x_{i + 1}^{\text{l}} ) < S(x_{i + 1}^{\text{r}} ) \hfill \\ \end{aligned} $$

(9)

$$ {\text{x}}_{i + 1} = {\text{x}}_{i + 1}^{l} $$

(10)

$$ \begin{gathered} {\text{elseif}}\;|S({\text{x}}_{i + 1}^{l} ) - S({\text{x}}_{i}^{l} )| > \delta *S({\text{x}}_{i + 1}^{l} ) {\text{ and }} |S({\text{x}}_{i + 1}^{r} ) - S({\text{x}}_{i}^{r} )| < \delta *S({\text{x}}_{i + 1}^{r} ) \hfill \\ {\text{and }}S({\text{x}}_{i + 1}^{l} ) > S({\text{x}}_{i + 1}^{r} ) \end{gathered} $$

(11)

$$ x_{i + 1} = x_{i + 1}^{\text{r}} $$

(12)

else

$$ x_{i + 1} = x_{i} $$

(13)

2.2.3 Hand contour detection

For gestures that use the hand contour, it is required that the hand contour is obtained in a computationally efficient manner. Many contour detection techniques have been discussed in the literature involving the following four major approaches: prior knowledge [22], morphology [23], level set [24], and active contour model [25]. In our system, the morphology approach is adopted due to its computational efficiency.

2.3 Hand gesture recognition

As pointed out earlier, two types of hand gestures are considered in our system. The first type of gestures is motional gestures consisting of seven directional hand gestures of rotation, forward, backward, left, right, up, and down. The second type of hand gestures is finger spelling consisting of six numbers of zero, one, two, three, four, and five. The first type of hand gesture is recognized via the dynamic time warping technique while the second type of hand gesture is recognized via the convex hull technique.

2.3.1 Motional gesture recognition

As was reported in [26], although disparity can provide the hand depth information using a stereo camera, it loses its sensitivity when the hand is hold far from the camera. Here, for forward and backward movements, the contour area variance is used due to its simplicity. For the other motional hand gestures, the Dynamic Time Warping (DTW) algorithm is used as this algorithm is capable of generating the dynamic distance between an unknown gesture signal and a set of reference gesture signals in a computationally efficient manner while coping with different speeds of motional hand gestures. The sample gesture signal comes from the seed point of the CamShift tracking. The details of the DTW algorithm are discussed in [27, 28].

A tree structure is used to indicate the priority level of the hand gestures (see Fig. 4). For rotation hand gesture, the circumscribed angle of the seed point in the CamShift tracking is used to serve as the motional signal of “rotation” gesture at the highest priority level followed by “forward” and “backward” gestures. The disparity of the seed point is used as the motional signal for these gestures. An actual sample “rotation” signal and its corresponding reference signal are shown in Fig. 5. The gradient or position difference of the seed point between consecutive frames is considered to be the motional signal of “left” gesture. An actual sample “left” signal and its corresponding reference signal are shown in Fig. 6.

2.3.2 Finger spelling recognition

Finger spell recognition is done by first going through the hand contour extraction. Then, the convex hull of the detected contour is used to determine the number of finger tips as reported in [29]. From number 2 to number 5, the recognition is easily achieved by the number of the defects from the convex hull. For instance, if there are n complete convex hulls in a hand contour, it implies that there are n + 1 fingers. However, this rule does not hold for numbers 0 and 1. The area of the convex hull is thus used instead. As per Eq. (14), the area of the convex hull S _convex is compared with the contour area S _contour,

$$ {\text{Number}} = \left\{ {\begin{array}{*{20}c} 0 & {S_{\text{convex}} \le \alpha *S_{\text{contour}} } \\ 1 & {\text{otherwise}} \\ \end{array} } \right. $$

(14)

where α denotes a parameter related to the camera distance range where gestures are made. The finger spell recognition requires a relatively high resolution of the hand contour. If the distance range is considered to be too far from the camera, the recognition will suffer due to images having low resolution. If the distance range is considered too close to the camera, the recognition will also suffer, this time due to too much variations in the hand contour. Our experimentations indicated that the following α’s provided relatively consistent outcome: α = 10 % for 15–35 cm camera distance range and 20 % for 10–15 cm camera distance range. Sample finger spell contours and recognized numbers are shown in Fig. 7.

In our experimentations, the following operating distance ranges and corresponding α were found to match well: α = 10 % for 15–35 cm distance from the camera and 20 % for 10–15 cm distance from the camera.

3 Recognition results and real-time processing

Examples are provided to show the increase in robustness when the information from two cameras are merged versus using a single camera. The algorithm was written in C and run on a PC with a dual core 2.1 GHz processor. The input images were captured with the stereo webcam Novo Minoru, which is a relatively inexpensive stereo webcam generating low-resolution images of size 640 × 480. Additional stereo images were examined using a Fuji stereo digital camera. Table 1 gives a comparison of the hand detection outcome using a single image versus a pair of stereo images under different lighting conditions. As can be seen from this table, both the single and stereo image approaches achieved a frame rate of 31–32 frames per second (fps) or about 30 ms per frame. However, as shown in Table 1, by merging the information from the left and right images, the average percentage detection rate was considerably improved (by nearly 60 %).

Table 1 Comparison of hand detection rates when using single images versus pairs of stereo images

Full size table

In our experiments, 50 cases of each movement were considered under various lighting and background conditions. Tables 2 and 3 provide the recognition confusion matrices when using single images versus when using stereo images. As can be seen from Table 3, by combining the information from the left and right cameras, in particular for backward and forward motions, the overall recognition rate was significantly improved. For motional hand gestures, the average recognition rate reached 93 % when using stereo images as compared to 66 % when using single images. Notice that there was a 4 % of no detection. This was caused due to the low resolution of the captured hand images.

Table 2 Motional hand gesture recognition confusion matrix when using single images

Full size table

Table 3 Motional hand gesture recognition confusion matrix when using pairs of stereo images

Full size table

The finger spell recognition comparison outcome when using single images versus when using stereo images is provided in Tables 4 and 5. For finger spell recognition, the average recognition rate reached 92 % when using stereo images as compared to 62 % when using single images.

Table 4 Finger spelling recognition confusion matrix when using single images

Full size table

Table 5 Finger spell recognition confusion matrix when using pairs of stereo images

Full size table

In Table 6, the total processing time for all the components in our real-time approach is listed, which is approximately 30 ms per frame. Note that the online color calibration took 1 s but it is not included in the table since it is done only once at the beginning when the system is turned on.

Table 6 Average and standard deviation processing times of the components of our introduced approach

Full size table

In a different set of experiments, our introduced approach was compared to two existing approaches in the literature that have been shown to provide high recognition rates, namely optical flow and HMM.

Table 7 provides a comparison of our introduced approach to these approaches. As shown in this table, although the recognition rates between the three approaches were more or less comparable, our approach achieved a higher frame rate leading to a real-time throughput.

Table 7 Comparison of average and standard deviation recognition and frame rates between two existing approaches and our introduced approach

Full size table

4 Conclusion

In this paper, a real-time and robust approach to hand gesture recognition based on a pair of stereo images has been introduced. It has been shown that by merging the information from the left and right images of a stereo image pair, a robust hand detection is achieved leading to high recognition rates for two types of hand gestures. An average recognition rate of 93 % for seven motional hand gestures and an average recognition rate of 92 % for finger spelling hand gestures have been obtained under realistic lighting conditions and in various backgrounds. A careful selection of existing computationally efficient approaches has lead to a real-time processing rate of 30 fps on the PC platform using an inexpensive stereo webcam.

References

Zimmerman, T., Lanier, J., Blanchard, C., Bryson, S., Harvill, Y.: A hand gesture interface device. In: Proceedings of the SIGCHI/GI Conference on Human Factors in Computing Systems and Graphics Interface, vol. 18, no. 4, pp. 189–192. Toronto, Canada (1987)
Keskin, C., Kirac, F., Kara, Y., Akarun, L.: Real time hand pose estimation using depth sensors. In: Proceedings of IEEE International Conference on Computer Vision Workshops, pp. 1228–1234. Barcelona, Spain (2011)
Ren, Z., Meng, J., Yuan, J., Zhang Z.: Robust hand gesture recognition with kinect sensor. In: Proceedings of the ACM International Conference on Multimedia, pp. 759–760. Scottsdale, AZ (2011)
Van Den Bergh, M., Van Gool, L.: Combining RGB and ToF cameras for real-time 3D hand gesture interaction. IEEE Workshop on Applications of Computer Vision, pp. 66–72. Kona, HI (2011)
Rodriguez, S., Picon, A., Villodas, A.: Robust vision-based hand tracking using single camera for ubiquitous 3D gesture interaction. In: Proceedings of IEEE Symposium on 3D User Interfaces, pp. 135–136. Waltham, MA (2010)
Gorce, D., Fleet, D., Paragios, N.: Model-based 3D hand pose estimation from monocular video. IEEE Trans. Pattern Anal. Mach. Intell. 33(9), 1793–1805 (2011)
Article Google Scholar
Ren, Y., Zhang, F.: Hand gesture recognition based on MEB-SVM. In: Proceedings of International Conference on Embedded Software and Systems, pp. 344–349. Hangzhou, China (2009)
Triesch, J., von der Malsburg, C.: Robust classification of hand postures against complex backgrounds. Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, pp. 170–175. Killington, VT (1996)
Quek, F., Zhao, M.: Inductive learning in hand pose recognition. In: Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, pp. 78–83. Killington, VT (1996)
Park, A., Yun, S., Kim, J., Min, S., Jung, K.: Real-time vision based Korean finger spelling recognition system. Int. J. Electr. Comput. Eng. 4, 110–115 (2009)
Google Scholar
Murase, H., Nayar, S.: Visual learning and recognition of 3D objects from appearance. Int. J. Comput. Vis. 14, 5–24 (1995)
Article Google Scholar
Bradski, G., Davis, J.: Motion segmentation and pose recognition with motion history gradients. In: Proceedings of IEEE Workshop on Applications of Computer Vision, pp. 238–244. Palm Springs, CA (2000)
Lee, D., Hong, K.: A Hand gesture recognition system based on difference image entropy. In: Proceedings of the IEEE International Conference on Advanced Information Management and Service, pp. 410–413. Suwon, South Korea (2010)
Lee, D., Hong, K.: Game interface using hand gesture recognition. In: Proceedings of the IEEE International Conference on Computer Sciences and Convergence Information Technology, pp. 1092–1097. Suwon, South Korea (2010)
Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47, 7–42 (2002)
Article MATH Google Scholar
Rahman, M., Ren, J., Kehtarnavaz, N.: Real-time implementation of robust face detection on mobile platforms. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1353–1356. Taipei, Taiwan (2009)
Liu, K., Du, Q., Yang, H., Ma, B.: Optical flow and principal component analysis-based motion detection in outdoor videos. EURASIP Journal on Advances in Signal Processing 680623, 2010
Bruhn, A., Weichert, J., Schnorr, C.: Lucas/Kanade meets Horn/Schunck: combining local and global optic flow methods. Int. J. Comput. Vis. 61(3), 211–231 (2005)
Article Google Scholar
Bradski, G.: Computer video face tracking for use in a perceptual user interface. Intel Technol. J. Q2, (1998)
Heckbert, P.: A Seed Fill Algorithm From Graphics Gems. Academic Press, New York (1990)
Google Scholar
Rahman, M., Kehtarnavaz, N., Ren, J.: A hybrid face detection approach for real-time deployment on mobile devices. Proceedings of IEEE Conference on Image Processing, pp. 3233–3236. Cairo, Egypt (2009)
Perez, P., Blake, A., Gangnet, M.: Jetstream: probabilistic contour extraction with particles. In: Proceedings of International Conference on Computer Vision, vol. 2, pp. 524–531. Vancouver, Canada (2001)
Soille, P.: Morphological Image Analysis: Principles and Applications. Springer, New York (2003)
Google Scholar
Liu, L., Zhang, S., Zhang, Y., Ye, X.: Human contour extraction using level set. In: Proceedings of IEEE International Conference on Computer and Information Technology, pp. 608–613. Shanghai, China (2005)
Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. Int. J. Comput. Vis. 1(4), 321–331 (1998)
Article Google Scholar
Mahotra, S., Patllola, C., Kehtarnavaz, N.: Real-time computation of disparity for hand-pair gesture recognition using video stereo images. J. Real-Time Image Process. 7, 257–266 (2012)
Article Google Scholar
Wang, L., Liao, M., Gong, M., Yang , R., Nister, D.: High-quality real-time stereo using adaptive cost aggregation and dynamic programming. In: Proceedings of IEEE International Symposium on 3D Data Processing, Visualization, and Transmission, pp. 798–805. Chapel Hill, NC (2006)
Senin, P.: Dynamic Time warping algorithm review. In: Technical Report. Information and Computer Science Department, University of Hawaii at Manoa (2008)
Li, X., An, J., Min, J., Hong, K.: Hand gesture recognition by stereo camera using the thinning method. In: Proceedings of IEEE International Conference on Multimedia Technology, pp. 3077-3080. Hangzhou, China (2011)

Download references

Author information

Authors and Affiliations

University of Texas at Dallas, Richardson, TX, USA
Kui Liu & Nasser Kehtarnavaz

Authors

Kui Liu
View author publications
You can also search for this author in PubMed Google Scholar
Nasser Kehtarnavaz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kui Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, K., Kehtarnavaz, N. Real-time robust vision-based hand gesture recognition using stereo images. J Real-Time Image Proc 11, 201–209 (2016). https://doi.org/10.1007/s11554-013-0333-6

Download citation

Received: 04 December 2012
Accepted: 11 February 2013
Published: 26 February 2013
Issue Date: January 2016
DOI: https://doi.org/10.1007/s11554-013-0333-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Real-time robust vision-based hand gesture recognition using stereo images

Abstract

Similar content being viewed by others

A Real Time Gesture Recognition System for Human Computer Interaction

A Survey of the Constraints Encountered in Dynamic Vision-Based Sign Language Hand Gesture Recognition

A Study of Real-Time Hand Gesture Recognition Using SIFT on Binary Images

1 Introduction