1 Introduction

Hand gesture recognition has started playing a major role in various human–computer interaction applications. Both non-vision and vision-based approaches have been used to achieve hand gesture recognition. An example of a non-vision approach was reported in [1] where finger bending was detected by a pair of wired gloves. In general, vision-based approaches are more natural as they require no hand fitting devices. Vision-based approaches can be divided into active and passive sensing. Active sensing approaches have proven successful for gesture recognition, in particular through the use of Kinect [2, 3] and time-of-flight cameras [4]. However, these approaches have challenges to overcome when it comes to deploying them on resource constraint and computation-limited devices such as smart phones.

Many passive vision-based hand gesture recognition techniques have been introduced in the literature, e.g. [57], where images from a single camera are used to achieve gesture recognition. Hand gestures can be classified into two categories: static and motional gestures. A recognition technique for static gestures was reported in [8], where features derived from elastic graph matching were used to identify hand postures in complex backgrounds leading to a recognition rate of 85 %. In [9], a learning approach based on a disjunctive normal form was used leading to a recognition rate of 93 %. This approach involved the use of normalized hand moments and compactness. In [10], finger spell recognition was achieved at the processing rate of 125 ms per image frame using the CamShift algorithm. In [11], principal component analysis was used for hand gesture recognition. As far as motional hand gesture recognition is concerned, the following three major approaches have been utilized: optical flow, model-based and HMM. In [5], a hand gesture model was devised using an Adaboost classifier and Haar features together with a Kalman predictor to cope with false detection. In [7], a model-based tracking of hand gestures was considered. The HMM approach to hand gesture recognition was covered in [12].

The use of stereo images for real-time passive vision-based hand gesture recognition has been fairly limited in the literature. In [13] and [14], a stereo camera with dedicated hardware was utilized to generate depth maps for hand gesture recognition; however, no real-time processing rates were reported in these references. Our solution presented in this paper is an attempt to perform hand gesture recognition in real-time based on stereo images that are captured by an inexpensive stereo webcam, such as the one in [15]. The motivation here has been to increase the robustness of hand detection and hence the robustness of hand gesture recognition using a pair of low-resolution stereo images, instead of images taken from a single camera. The challenge in this attempt has been to achieve the increase in robustness in a computationally efficient manner so that a real-time throughput is reached. The introduced approach establishes a balance between robustness and computational complexity. On one hand, the developed solution is designed to be robust to different backgrounds and lighting conditions. On the other hand, it is designed to incorporate time-efficient and relatively simple functions to achieve a real-time throughput. The introduced approach combines or merges the information from the left and right images of a stereo camera to increase the robustness of hand detection while meeting the real-time constraint.

In Sect. 2, the details of our approach are discussed. The results obtained are then reported in Sect. 3. This section also includes comparisons with two existing approaches. Finally, the conclusion is stated in Sect. 4.

2 Real-time robust hand gesture recognition

Two types of hand gestures have been considered in this paper: directional hand movement and finger number spelling. The developed recognition system consists of four main components: online color calibration of hand color, color-based hand detection, hand tracking, and finally hand gesture recognition.

2.1 Online color calibration

The goal of the online color calibration component is to adapt our subsequent color processing to the color characteristic of the light source under which images are captured. We have previously used this technique quite successfully for face detection in [16] to cope with unknown color characteristics of various light sources encountered in practice. The calibration is done at the beginning and only once when the system is turned on. It involves building a GMM model in the CrCb color space to represent the color characteristics of the hand being captured in an online or on-the-fly manner. The calibration is performed easily by the user simply placing his or her hand in a box displayed at the image center, see Fig. 1. Representative skin color pixels are collected within this box using a two-cluster k-means clustering algorithm separating skin pixels from non-skin pixels. A GMM model is then trained and used for a region growing hand color segmentation within a region-of-interest specified by a tracking module mentioned next. More details of the online color calibration can be found in [16].

Fig. 1
figure 1

Online color calibration: a left camera calibration box, b right camera calibration box

2.2 Hand detection

There are two main steps in our hand detection approach which include hand tracking and robustness improvement using stereo images.

2.2.1 Hand tracking

The existing tracking methods including optical flow, either sparse [17] or dense [18], and Kalman filtering pose challenges as far as the real-time aspect is concerned due to their computational complexity. To have a computationally efficient tracking, the CamShift algorithm [19] is adopted in our approach. In this algorithm, the hue component of color is used for tracking. Figure 2 shows a sample hue histogram associated with a hand. The histogram within a window is used as the hand tracking feature together with a searching window. The center of the window is used as the seed point for the so-called flood fill region growing operation [20] to achieve segmentation in a computationally efficient manner. The CamShift algorithm works similar to the MeanShift algorithm but it also copes with dynamically changing distributions by readjusting the search window size.

Fig. 2
figure 2

Sample hue histogram used for CamShift hand tracking

The segmented areas from the left and right images are merged by aligning the left and right images as was previously reported in [21]. The merged area is then used for hand contour extraction. The flow chart of all the components involved in our introduced approach appears in Fig. 3.

Fig. 3
figure 3

Flowchart of the introduced real-time solution using stereo images

2.2.2 Increasing robustness via stereo images

The following rules are introduced to merge the information from the left and right cameras leading to a more robust hand detection as compared to using a single camera image. Let \( x_{i + 1} \) denotes the current hand mask and \( x_{i} \) the previous hand mask. The superscripts l and r indicate the left and right camera label for the masks. Let S represent the mask area and δ a percentage parameter reflecting the mask area difference between the frames. Our experimentations have revealed that a δ value in the range 25–30 % can cope with the variability in hand motions made by various subjects. Due to the continuity of motion, it is not physically possible to have a large mask area difference between the frames either left to right or current to previous. Even when the hand is approaching the camera, the mask area is expected to grow consistently. Only when the current mask does not exhibit a large difference from the previous one in both the left and right images, the masks get merged. For instance, first the mask areas between a current left frame \( S(x_{i + 1}^{\text{l}} ) \) and a current right frame \( S(x_{i + 1}^{\text{r}} ) \) are compared. If there exists relatively little difference between them as per Eq. (1), the change is considered to be consistent. Next, the change in the mask areas between a previous and a current frame is examined. If both of the left and right area changes, that is \( |S(x_{i + 1}^{\text{l}} ) - S(x_{i}^{\text{l}} )| \) and \( |S(x_{i + 1}^{\text{r}} ) - S(x_{i}^{\text{r}} )| \), show a consistent change between a current and a previous image as per Eq. (2), the current left and right hand masks are merged and get updated as per Eq. (3). If one of the areas leads to an inconsistent change as per Eq. (4) or (6), the update process in Eqs. (5) and (7) is done. If there exists a large difference between a current left or a current right frame and the image side which does not change consistently as per Eqs. (9) and (11), it is not used for the next time frame and only the consistent image side is used as per Eqs. (10) and (12). Otherwise, no update is done for the next time frame as per Eq. (13).

$$ {\text{if }}|S(x_{i + 1}^{\text{l}} ) - S(x_{i + 1}^{\text{r}} )| < \delta *S(x_{i + 1}^{\text{l}} ) $$
(1)
$$ {\text{if }}|S(x_{i + 1}^{\text{l}} ) - S(x_{i}^{\text{l}} )| < \delta *S(x_{i + 1}^{\text{l}} ){\text{ and }}|S({\text{x}}_{i + 1}^{\text{r}} ) - S({\text{x}}_{i}^{\text{r}} )| < \delta *S({\text{x}}_{i + 1}^{\text{r}} ) $$
(2)
$$ x_{i + 1} = x_{i + 1}^{\text{l}} + x_{i + 1}^{\text{r}} $$
(3)
$$ {\text{elseif }}|S({\text{x}}_{i + 1}^{l} ) - S({\text{x}}_{i}^{l} )| < \delta *S({\text{x}}_{i + 1}^{l} ) \;{\rm and}\; |S({\text{x}}_{i + 1}^{r} ) - S({\text{x}}_{i}^{r} )| > \delta *S({\text{x}}_{i + 1}^{r} ) $$
(4)
$$ x_{i + 1} = x_{i + 1}^{\text{l}} $$
(5)
$$ {\text{elseif }}|S(x_{i + 1}^{\text{l}} ) - S(x_{i}^{\text{l}} )| > \delta *S(x_{i + 1}^{\text{l}} ){\text{ and }}|S(x_{i + 1}^{\text{r}} ) - S(x_{i}^{\text{r}} )| < \delta *S(x_{i + 1}^{\text{r}} ) $$
(6)
$$ x_{i + 1} = x_{i + 1}^{\text{r}} $$
(7)

else

$$ x_{i + 1} = x_{i} $$
(8)

else

$$ \begin{aligned} {\text{if }}|S(x_{i + 1}^{\text{l}} ) - S(x_{i}^{\text{l}} )| < \delta *S(x_{i + 1}^{\text{l}} ){\text{ and }}|S(x_{i + 1}^{\text{r}} ) - S(x_{i}^{\text{r}} )| > \delta *S(x_{i + 1}^{\text{r}} ) \\ {\text{and }}S(x_{i + 1}^{\text{l}} ) < S(x_{i + 1}^{\text{r}} ) \hfill \\ \end{aligned} $$
(9)
$$ {\text{x}}_{i + 1} = {\text{x}}_{i + 1}^{l} $$
(10)
$$ \begin{gathered} {\text{elseif}}\;|S({\text{x}}_{i + 1}^{l} ) - S({\text{x}}_{i}^{l} )| > \delta *S({\text{x}}_{i + 1}^{l} ) {\text{ and }} |S({\text{x}}_{i + 1}^{r} ) - S({\text{x}}_{i}^{r} )| < \delta *S({\text{x}}_{i + 1}^{r} ) \hfill \\ {\text{and }}S({\text{x}}_{i + 1}^{l} ) > S({\text{x}}_{i + 1}^{r} ) \end{gathered} $$
(11)
$$ x_{i + 1} = x_{i + 1}^{\text{r}} $$
(12)

else

$$ x_{i + 1} = x_{i} $$
(13)

2.2.3 Hand contour detection

For gestures that use the hand contour, it is required that the hand contour is obtained in a computationally efficient manner. Many contour detection techniques have been discussed in the literature involving the following four major approaches: prior knowledge [22], morphology [23], level set [24], and active contour model [25]. In our system, the morphology approach is adopted due to its computational efficiency.

2.3 Hand gesture recognition

As pointed out earlier, two types of hand gestures are considered in our system. The first type of gestures is motional gestures consisting of seven directional hand gestures of rotation, forward, backward, left, right, up, and down. The second type of hand gestures is finger spelling consisting of six numbers of zero, one, two, three, four, and five. The first type of hand gesture is recognized via the dynamic time warping technique while the second type of hand gesture is recognized via the convex hull technique.

2.3.1 Motional gesture recognition

As was reported in [26], although disparity can provide the hand depth information using a stereo camera, it loses its sensitivity when the hand is hold far from the camera. Here, for forward and backward movements, the contour area variance is used due to its simplicity. For the other motional hand gestures, the Dynamic Time Warping (DTW) algorithm is used as this algorithm is capable of generating the dynamic distance between an unknown gesture signal and a set of reference gesture signals in a computationally efficient manner while coping with different speeds of motional hand gestures. The sample gesture signal comes from the seed point of the CamShift tracking. The details of the DTW algorithm are discussed in [27, 28].

A tree structure is used to indicate the priority level of the hand gestures (see Fig. 4). For rotation hand gesture, the circumscribed angle of the seed point in the CamShift tracking is used to serve as the motional signal of “rotation” gesture at the highest priority level followed by “forward” and “backward” gestures. The disparity of the seed point is used as the motional signal for these gestures. An actual sample “rotation” signal and its corresponding reference signal are shown in Fig. 5. The gradient or position difference of the seed point between consecutive frames is considered to be the motional signal of “left” gesture. An actual sample “left” signal and its corresponding reference signal are shown in Fig. 6.

Fig. 4
figure 4

Priority level of the hand gestures

Fig. 5
figure 5

A sample “rotation” signal and corresponding reference signal

Fig. 6
figure 6

A sample “left” signal and corresponding reference signal

2.3.2 Finger spelling recognition

Finger spell recognition is done by first going through the hand contour extraction. Then, the convex hull of the detected contour is used to determine the number of finger tips as reported in [29]. From number 2 to number 5, the recognition is easily achieved by the number of the defects from the convex hull. For instance, if there are n complete convex hulls in a hand contour, it implies that there are n + 1 fingers. However, this rule does not hold for numbers 0 and 1. The area of the convex hull is thus used instead. As per Eq. (14), the area of the convex hull S convex is compared with the contour area S contour,

$$ {\text{Number}} = \left\{ {\begin{array}{*{20}c} 0 & {S_{\text{convex}} \le \alpha *S_{\text{contour}} } \\ 1 & {\text{otherwise}} \\ \end{array} } \right. $$
(14)

where α denotes a parameter related to the camera distance range where gestures are made. The finger spell recognition requires a relatively high resolution of the hand contour. If the distance range is considered to be too far from the camera, the recognition will suffer due to images having low resolution. If the distance range is considered too close to the camera, the recognition will also suffer, this time due to too much variations in the hand contour. Our experimentations indicated that the following α’s provided relatively consistent outcome: α = 10 % for 15–35 cm camera distance range and 20 % for 10–15 cm camera distance range. Sample finger spell contours and recognized numbers are shown in Fig. 7.

Fig. 7
figure 7

Finger spelling contours and recognized numbers

In our experimentations, the following operating distance ranges and corresponding α were found to match well: α = 10 % for 15–35 cm distance from the camera and 20 % for 10–15 cm distance from the camera.

3 Recognition results and real-time processing

Examples are provided to show the increase in robustness when the information from two cameras are merged versus using a single camera. The algorithm was written in C and run on a PC with a dual core 2.1 GHz processor. The input images were captured with the stereo webcam Novo Minoru, which is a relatively inexpensive stereo webcam generating low-resolution images of size 640 × 480. Additional stereo images were examined using a Fuji stereo digital camera. Table 1 gives a comparison of the hand detection outcome using a single image versus a pair of stereo images under different lighting conditions. As can be seen from this table, both the single and stereo image approaches achieved a frame rate of 31–32 frames per second (fps) or about 30 ms per frame. However, as shown in Table 1, by merging the information from the left and right images, the average percentage detection rate was considerably improved (by nearly 60 %).

Table 1 Comparison of hand detection rates when using single images versus pairs of stereo images

In our experiments, 50 cases of each movement were considered under various lighting and background conditions. Tables 2 and 3 provide the recognition confusion matrices when using single images versus when using stereo images. As can be seen from Table 3, by combining the information from the left and right cameras, in particular for backward and forward motions, the overall recognition rate was significantly improved. For motional hand gestures, the average recognition rate reached 93 % when using stereo images as compared to 66 % when using single images. Notice that there was a 4 % of no detection. This was caused due to the low resolution of the captured hand images.

Table 2 Motional hand gesture recognition confusion matrix when using single images
Table 3 Motional hand gesture recognition confusion matrix when using pairs of stereo images

The finger spell recognition comparison outcome when using single images versus when using stereo images is provided in Tables 4 and 5. For finger spell recognition, the average recognition rate reached 92 % when using stereo images as compared to 62 % when using single images.

Table 4 Finger spelling recognition confusion matrix when using single images
Table 5 Finger spell recognition confusion matrix when using pairs of stereo images

In Table 6, the total processing time for all the components in our real-time approach is listed, which is approximately 30 ms per frame. Note that the online color calibration took 1 s but it is not included in the table since it is done only once at the beginning when the system is turned on.

Table 6 Average and standard deviation processing times of the components of our introduced approach

In a different set of experiments, our introduced approach was compared to two existing approaches in the literature that have been shown to provide high recognition rates, namely optical flow and HMM.

Table 7 provides a comparison of our introduced approach to these approaches. As shown in this table, although the recognition rates between the three approaches were more or less comparable, our approach achieved a higher frame rate leading to a real-time throughput.

Table 7 Comparison of average and standard deviation recognition and frame rates between two existing approaches and our introduced approach

4 Conclusion

In this paper, a real-time and robust approach to hand gesture recognition based on a pair of stereo images has been introduced. It has been shown that by merging the information from the left and right images of a stereo image pair, a robust hand detection is achieved leading to high recognition rates for two types of hand gestures. An average recognition rate of 93 % for seven motional hand gestures and an average recognition rate of 92 % for finger spelling hand gestures have been obtained under realistic lighting conditions and in various backgrounds. A careful selection of existing computationally efficient approaches has lead to a real-time processing rate of 30 fps on the PC platform using an inexpensive stereo webcam.