1 Introduction

The problem of efficient and accurate recognition of hand gesture is theoretically interesting and challenging. Real-time hand gesture recognition affords users the ability to interact with computers in more natural and intuitive ways. Thus, it is fully used in virtual reality and computer games [1]. Conventional hand gesture recognition systems detect and segment hands based on the methods including employing color gloves [2], and skin color detection [37], both of which have advantages and drawbacks. Other hand gesture recognition systems detect and segment hands using marker-aided methods [2, 8]. However, these methods are inconvenient compared with markerless vision based solutions. The key problem in gesture interaction is how to make hand gesture understood by computers. Extra instruments or sensors, such as data gloves, might be very easy to collect hand state information. However, these equipments are expensive and inconvenient to users. Thus, markerless and vision based hand gesture interaction has many appealing advantages.

Since Lindberg [9] published his work on scale-space framework for geometric features detection, scale-space feature detection has been widely applied in object recognition, image processing and registering etc. Bretzner et al. [3] and Fang et al. [10, 11] have employed scale-space feature detection method to detect blob and ridge structures of a hand. Both of them define palm and fingers as blob and ridges. However, the scale space feature detection is time-consuming for real-time applications. Moreover, it is difficult for this method to perform in cluttered background because shapes that are similar to palm and fingers in background might interfere with the detection results. Although Fang et al. [11] improves the detection method to reduce the computational cost for real-time application, the accuracy of recognition results is decreased.

A significant amount of literature has been devoted to the problem of skin color detection and segmentation [3, 4, 7]. Lee et al. [7] use skin color segmentation for hand AR (Augmented Reality), which requires a high accuracy of hand contours. They employ a skin color based classifier with an adaptively learned skin color histogram. But this method needs a lot of training samples. Argyros et al. [4] select a special color space to reduce the effect caused by background and illumination. Moreover, a technique is proposed that permits the avoidance of much of the burden involved in the process of generating training data in their work.

Depth imaging technology has advanced dramatically over the last few years. Depth cameras offer several advantages over traditional intensity sensors, working in low light levels, being color and texture invariant and resolving silhouette ambiguities in pose [12]. Thanks to the recent development of inexpensive depth cameras, such as Kinect sensor, new opportunities for gesture recognition emerge. Although there are many successful applications for human body tracking [12] and face recognition [13], it is still an challenge to use the low-resolution depth map for hand gesture recognition. A robust hand gesture recognition system using Kinect depth map is developed and used in some applications successfully [8]. However, the user need to wear a black belt, which is inconvenient. Another research on 3D tracking of hand articulations using Kinect [14] presents a good work on modeling a hand, but they segment the hand using skin color method, which can be easily confused by face, bared arm and skin-liked objects. Real-time human pose recognition from single depth images is proposed in [12]. They present a new method to predict 3D positions of body joints from a single depth image with training data, which proves the practical applicability of depth information. Another type of tracking and recognition method is based on time-of-flight (ToF) camera. By employing a ToF camera, a system, which is capable of recognizing gestures at the finger level in real-time, is constructed in [1517]. A method for human full-body pose estimation from ToF camera images is presented in [18]. Their method can track various full-body movements, including self-occlusions and estimate 3D full-body poses with a high accuracy. However, the method based on ToF camera is hard to provide an accurate result on hand gesture recognition because of its low-resolution. The original depth data from depth sensor, such as Kinect, contains a numerous occlusions and uncovered areas due to the nature of the device and environments. Several studies try to inpaint a low-resolution depth image to achieve an qualified depth map [19, 20].

This paper presents a novel method that segments hand precisely based on depth information without any marker. With the help of depth map filtering, an qualified hand contour is available in real-time. Then, a new robust hand recognition method is proposed. Our hand recognition method is based on approximate convex shape decomposition which is very useful in some graphics and vision tasks. The method is well designed for real-time applications compared with the conventional convex shape decomposition which is time-consuming. By employing this shape decomposition method, the hand is decomposed to palm and fingers, which are useful for gesture recognition. Fingertips are detected using a smart method with the finger shapes acquired from the decomposition. We provide a method for hand skeleton extraction, which is successfully used for single hand and two-hand gestures recognition. A simple hand gesture dataset is collected to test the efficiency and accuracy of our method. Initial results of this work have been presented in [21]. The better experiment results reveal that the employed method is very efficient.

The rest of the paper is organized as follows. In Section 2 we present the method for hand detection and segmentation. Section 3 explains in detail the hand shape decomposition and representation approaches. In Section 4, the proposed two-hand gesture recognition method are demonstrated. Experiments and test results of the hand shape decomposition and the hand gesture recognition are shown in Section 5. Section 6 presents the main conclusions of this work.

2 Hand Detection and Segmentation

The proposed hand detection scheme consists of three steps: foreground segmentation, palm localization and hand segmentation. We make several assumptions on hand gesture. First, we assume the hand is the nearest object to the camera. Second, we assume the distance between hand and camera is in the range [0.5, 3.5] in meters. Third, we assume the angles between hand and camera plane are constrained by: \(-20^{\circ}\leqslant \alpha_{x} \leqslant 20^{\circ}, -20^{\circ}\leqslant \alpha_{y} \leqslant 20^{\circ}, -180^{\circ}\leqslant \alpha _{z} \leqslant 180^{\circ}\), where \((\alpha _{x}, \alpha _{y}, \alpha _{z})\) are the three rotation angles between palm plane and camera plane. It starts with thresholding the depth frame to obtain the foreground F. F is given by:

$$ F=\left \{ \left ( p,z \left ( p \right ) \right )|z\left ( p \right )< z_{0}+z_{D} \right \}, $$
(1)

where (p, z (p)) denotes the pixel in the depth image at cordinate p with value z (p), z 0 is the minimal value of the depth image and \(z_{D}\) is a threshold. We set \(z_{D} = 100mm\) to ensure that the whole hand region is extracted from the depth frame. The detection result of the rough hand region is shown in Fig. 1a and b. In order to detect a more precise hand shape, we define another two thresholds \(d_{1}\) and \(d_{2}\) as shown in Fig. 1c, where \(d_{1}+d_{2}=z_{D}\). Experimentally, we set \(d_{1} = 70mm, d_{2} = 30mm\). Then, the rough hand region is segmented into two parts described in Fig. 1d. Distance Transform operation is employed to calculate the distance map of each part, which is shown in Fig. 1e. The point with maximum distance is selected as the center of each part. We define \(R_{in}\left ( x,y \right )\) and \(R_{out}\left ( x,y \right )\) as the input and output regions of hand; define \(l\left ( x,y \right )\) as the cut line function. The accurate hand region is computed employing the following rule:

$$ R_{out}\left ( x, y \right ) = l\left ( x,y \right )<0\cap R_{in}\left ( x,y \right ). $$
(2)

In detail, \(R_{in}\left ( x,y \right )\) denotes the hand region before segmentation (coarse hand region); \(R_{out}\left ( x,y \right )\) is the accurate hand region; \(l\left ( x,y \right )<0\) is the half plane segmented by l. So, the \(R_{out}\left ( x,y \right )\) is the overlap region of \(l\left ( x,y \right )<0\) and \(R_{in}\left ( x,y \right )\). Thus a more precise hand region is detected, as shown in Fig. 1g. The cut line l is a line perpendicular to the line segment \(\mathbf {c_{0}c_{1}}\). Then \(\mathbf {c_{0}c_{1}}\) is cut into two parts. Experimentally, the intersection point of l and \(\mathbf {c_{0}c_{1}}\) is set to the midpoint of \(\mathbf {c_{0}c_{1}}\). Thus, given the two points \(c_0\) and \(c_1\), the cut line l is easy to compute. In Fig. 1f, \(c_0\) and \(c_1\) demonstrate the centers of the two parts; the line perpendicular to \(\mathbf {c_{0}c_{1}}\) is the cut line l.

Figure 1
figure 1

Graphical illustration of the proposed hand detection and segmentation method. a The rough hand region; b The binary region of (a); c The segmentation threshold \(d_1\) and \(d_2\); d The two parts of the rough hand region; e Distance Transform of d. In f, \(c_{0}\) and \(c_{1}\) demonstrate the centers of the two parts. The line perpendicular to \(\mathbf {c_{0}c_{1}}\) is the cut line. g presents the final hand shape after cutting.

Due to the nature of the depth sensor, the hand region on the depth map may be have holes and cracks, which will seriously affect the accuracy of hand shape decomposition. Although some inpainting and filtering methods [20, 22, 23] are able to get a better result, the algorithms are usually too complex to be used in real-time applications. We just employ some simple morphological operations to achieve an qualified result.

3 Hand Shape Decomposition and Representation

Shape decomposition and representation is very useful in shape analysis, shape matching, topology extraction, collision detection and other geometric processing methods employing divide-and-conquer strategies [24]. Lien et al. [25] propose methods to decompose polygons into approximately convex parts. Their methods usually result in smaller number of parts. Mi et al. [26] present methods to decompose shapes taking into account relativity to determine part boundaries and achieve a better result. These methods are usually complicated and time-consuming.

3.1 Radius Based Convex Shape Decomposition

We now present our main idea about convex shape decomposition. Our method is inspired by the convex shape decomposition idea [24], which employs the Reeb graph and Morse functions to compute candidate cuts. However, their algorithms compute multiple Morse functions from a number of directions, which is inefficient. As proposed in [24], each decomposed part may not be strictly convex, thus a parameter \(\varepsilon \) which indicates the convex tolerance of the decomposed parts is defined. Formally, for a shape S, R(S, ε) is defined as a decomposition that the concavity of every decomposed part is no more than \(\varepsilon \). So, \(R\left ( S,\varepsilon \right )=\cup _{i=1}^{n}P_{i}\), \(\forall _{i\neq j}P_{i}\cap P_{j}= \emptyset \) and \(\forall _{i\leqslant n}Concavity\left ( P_{i} \right )\leqslant \varepsilon \), where n is the number of decomposed parts, \(P_{i}\) is a decomposed part and the degree of its concavity is denoted by \(Concavity\left ( P_{i} \right )\).

The \(Concavity\left ( P_{i} \right )\) is measured by projecting the shape contour in multiple Morse functions, which is obtained by changing the projecting direction. As shown in Fig. 2a, Morse function \(f: M\to S\), is constructed using the Height Function. In Fig. 2b the Reeb graph is determined by the changes in the number of connected components of Morse function \(f^{-1}\). The Reeb graph has three nodes, which reflects partial topological information of the shape in Fig. 2a. However, multiple Morse functions must be computed because the topological information of the shapes is assumed to be unknown to users. This is similar to brute force computing. In order to better use this method on hand shape decomposition, a new Morse function is proposed as shown in Fig. 3a. The new Morse function f is constructed as follows: for every point p in this object, f (p) is the distance between the point p and the central point o, thus called Radius Function. Same as the Height based Morse function, the Reeb graph is shown in Fig. 3b. The radius based Morse function is efficient in hand shape decomposition due to the fact that only one Morse function is computed for the decomposition when the central point o is specified. The feasibility of radius based Morse function is based on the topological information of the hand, which is already known to users. As we know, the topological structure of a hand can be defined as a palm and some fingers which are outward around the palm. Moreover, the angle of any two fingers is less than \(\pi \).

Figure 2
figure 2

a Height based Morse function of a hand gesture; b The Reeb Graph.

Figure 3
figure 3

a Radius based Morse function of a hand gesture; b The Reeb Graph.

3.2 Candidate Cuts

In order to solve the problem \(\forall _{i\leqslant n}Concavity\left ( P_{i} \right )\leqslant \varepsilon \), candidate cuts that can separate a shape S with \(Concavity\left ( S_{i} \right )> \varepsilon \), are employed. The way to find a shape \(S_{i}\) with \(Concavity\left ( S_{i} \right )> \varepsilon \) is to use the Reeb graph constructed from Radius based Morse function f. The cuts between adjacent nodes of the Reeb graph are all candidate cuts. All the n candidate cuts of a shape S form a candidate cut set, denoted by \(C\left ( S \right )= \left \{cut_{1},\cdots ,cut_{n} \right \}\). The final decomposition consists of a subset of \(C\left ( S \right )\), denoted by \(I\left ( S \right )\subseteq C\left ( S \right )\). A binary variable is assigned to each c u t i in C S, as is shown:

$$ x_{i} = \left\{\begin{array}{ll} 1 & cut_{i}\in I\left (S \right ) \\ 0 & cut_{i}\notin I\left ( S \right ) \end{array}\right. $$
(3)

Thus \(\mathbf {x}=\left ( x_{1},x_{2},\cdots ,x_{n} \right )^{\mathrm {T}}\) is a binary vector indicating the selectivity of cuts from \(C\left ( S \right )\). Each is assigned a value to weight the cost of the cut, denoted by \(w\left ( cut_{i} \right )\). Define \(\mathbf {w}=\left ( w\left ( cut_{1} \right ),w\left ( cut_{2} \right ),\cdots ,w\left ( cut_{n} \right ) \right )^{\mathrm {T}}\), thus a decomposition problem is translated in to a integer linear programming. We use the same method proposed in [24] to solve the programming problem:

$$ \mathrm{min}\ \mathbf{w}^{\mathrm{T}}\mathbf{x}\qquad x_{i}\in \left \{ 0,1 \right \} $$
(4)

For a given cut \(cut_{i}\), \(w\left ( cut_{i} \right )\) is defined as function (4):

$$ w\left ( cut_{i} \right )=\frac{length\left ( cut_{i} \right )}{dist\left ( cut_{i},o \right )-r} $$
(5)

where \(length\left ( cut_{i} \right )\) is the length of \(cut_{i}\), \(dist\left ( cut_{i},o \right )\) is the distance between central point o and \(cut_{i}\), r is the palm radius calculated by Distance Transform of the shape. The central point o is specified when the accurate hand region is segmented. As shown in Fig. 1f, \(c_{0}\), \(c_{1}\) are two centers. Thus o is calculated as following:

$$\mathbf{c}_{0}\mathbf{o}=\lambda \cdot \mathbf{c}_{0}\mathbf{c}_{1}\qquad0< \lambda < 1 $$
(6)

Thus, The procedure of radius based convex shape decomposition method is summarized as Algorithm 1.

3.3 Hand Shape Decomposition

The proposed shape decomposition method is very useful, at least in two applications. First, it’s right for shape representation. After decomposition, since every decomposed part is approximately convex, it can be approximately represented by its convex hull; thus, a compact representation of original object is obtained. Such representation captures not only all the important topological information, but also all the important geometric information of the original object. Second, it’s easy to extract the topology of the shape. After decomposition, if we regard each part as a node and two nodes have an edge if and only if they are adjacent, a graph, named convex graph [24], is obtained. Convex graph captures all important topological information of the shape, which is useful in pattern recognition.

To use the aforementioned algorithm for hand shape decomposition, two parameters \(\varepsilon \) and \(\lambda \) are specified. \(\varepsilon \) is the threshold of shape concavity, \(\lambda \) is the parameter to specify the central point o. Figure 4 shows the results of shape decomposition with different central point and shape concavity. In Fig. 4b, c, d use the same threshold of shape concavity \(\varepsilon = 10\), and (e) \(\varepsilon =4\). The final parameters are set based on a large number of experiments. With a set of proper parameters, hand shapes extracted from real environments using depth camera are correctly separated with different colors as shown in Fig. 5c. In Fig. 5, row (a) is the color images used for a better view of the hand gestures; row (b) is binary images obtained by the proposed hand detection and segmentation methods using depth information; row (d) presents the convex graphs of these hand shapes.

Figure 4
figure 4

Shape decomposition results with two different parameters (\(\lambda \) employed to specify the central point o and the threshold of shape concavity \(\varepsilon \)). a is the original shape contour. b, c, d, e are decomposition results with different parameters where the black circle denotes the Morse function center.

Figure 5
figure 5

Hand shape decomposition. a The color images of some hand gestures. The color images are only used for a better view of the decomposition results. b The binary hand maps obtained using the proposed hand detection and segmentation methods. c Hand shape decomposition results. d The convex graphs of these hand shapes.

3.4 Fingertips Detection

Fingertips are detected from the result of the hand shape decomposition. From the shape decomposition the fingers and palm are easy to find. The decomposed shape with hand center is a palm shape and others are finger shapes. For a finger shape \(S_{f}\), there is a corresponding cut denoted as \(cut_{i}\). The fingertip point \(t_{tip}\) is defined as:

$$t_{tip}=\left \{ t_{j}|\mathrm{max}\ dist\left ( cut_{i},t_{j} \right ),t_{j}\in S_{f} \right \} $$
(7)

which means that the fingertip is the point on the finger shape with the maximum distance against the cut line. Then, we define \(T\left ( S \right )\) as the fingertips set of a shape S. The validity of this method is based on the convexity of the finger shape and the topological structure of a hand. The fingertips detection results are shown in Fig. 6, where the red contours are the recognized hand shape contours, the green circles are the hand palm centers and the black circles denote the detected fingertips in each hand shape.

Figure 6
figure 6

Fingertips detection results. The black circles are the detected fingertips.

Each of the fingers has unique functional significance. From the thumb on the radial side to the ulnar side of the hand, the fingers are in this order: Thumb, Index finger, Middle finger, Ring finger, Little finger. With the fingertips detection above, the number of fingers is easy to obtain. If the number of fingers is 5, we only need to find the Thumb or Little finger. But it’s hard to recognize the significance of each finger because some of the hand gesture shapes are approximately symmetric. So, the number and positions of fingers should be taken into consideration when defining hand gestures to avoid ambiguity.

3.5 Skeleton Extraction

Skeleton can be viewed as a compact shape representation in that the shape can be completely reconstructed from the skeleton [27]. Some methods have been proposed for skeleton applications, such as human motion tracking [28] and graph matching [29]. Although we have obtained the shape decomposition result and its convex graph, they are not enough to recognize complex gestures. Thus, shape skeleton is a good choice to help represent the hand gestures. The proposed skeleton extraction method is based on the results of shape decomposition and fingertips detection. For a hand gesture shape S, we define \(c_{b}\) as the base point of this shape. \(c_{b}\) is the intersection point of cut line l and line segment joining the two points \(c_{0}\) and \(c_{1}\), which is shown in Fig. 1f. Then, we connect the two points \(c_{b}\) and \(c_{0}\). It is the first skeleton fragment of the shape. With the final cut set I(S), it’s easy to obtain the midpoint of each cut line segment \(cut_{i}\), defined as \(p_{i}\). And the midpoint set of a shape S is defined as H(S). Thus, the line segment connecting \(p_{i}\) and the corresponding fingertip is a skeleton fragment. Finally, we connect the hand center point \(c_{0}\) to each of the \(p_{i}\) in \(H\left (S \right )\). In this way, we simplify a shape skeleton as a set of line segments. We add direction to each line segment of the skeleton. Then skeleton becomes a vector set. Thus, for a shape S, the skeleton \(K\left ( S \right )\) is hierarchically defined as:

$$\begin{array}{rll}{\kern-8pt}K\left ( S \right )&=&\left \{\mathbf{c_{b}c_{0}}\right \}\cup \left \{\mathbf{ c_{0}p_{i}}|p_{i}\in H\left ( S \right ) \right \}\notag\\&&{\kern29pt}\cup \left \{\mathbf{ p_{i}t_{i}}|p_{i}\in H\left ( S \right ),t_{i}\in T\left ( S \right ) \right \} \end{array}$$
(8)

A few skeleton extraction results are shown in Fig. 7, where the black line segments plot the skeleton of each shape. Using this skeleton representation method, gesture recognition is simplified as distance measure between the gesture skeleton and predefined gesture template skeletons. In Fig. 8a, each of the three hand shapes is decomposed into 4 parts, but they are defined as different gestures. So, skeleton distance is employed to distinguish them. To calculate the distance between two skeletons, we encode each vector in the skeleton vector set based on the vector direction to achieve invariance to translation and scale. Then, the distance is easy to obtain. In the case shown in Fig. 8b, we assume they signify the same gesture.

Figure 7
figure 7

The confusion matrix of the testing results with gestures used in [17].

Figure 8
figure 8

Skeleton Extraction results. The skeleton of each hand shape is denoted as black line segments.

4 Two-Hand Gesture Recognition

Two-hand gesture recognition is an extension of single hand gesture recognition. The number of gestures made by a hand is limited as we know, so two-hand gestures have a lot of room to develop. Moreover, using both hands is a more natural way for people to interact with computers. Two-hand gesture contains not only the gesture message of each hand, but also the relative relationship such as relative positions. First, we assume that the two hands are the nearest object to the camera. Second, the two hands cannot overlap with each other from the camera view. Third, the average depth difference within two hands is less than a threshold \(z_{M}=30~mm\). Based on these assumptions, the depth image is divided into two parts, each of which contains a hand region. In detail, the foreground F is given by:

$$ F=\left \{ \left ( p,z \left ( p \right ) \right )|z\left ( p \right )< z_{0}+z_{D}+z_{M} \right \} $$
(9)

which is similar to (1). The foreground F contains two main parts due to the assumptions above. Then, we find the central point of each part using distance transform method. A cut line based on the two central points is used to cut the original depth image into two parts. Finally, we use the proposed hand detection and segmentation method, combined with the hand shape decomposition approach, to deal with each part.

5 Experiments

5.1 Hand Segmentation

To evaluate the hand segmentation method proposed in this work, the accuracy of hand shape decomposition is employed to demonstrate the quality of segmentation. This is due to that high quality of segmentation brings accurate decomposition result. First, we test the two thresholds \(d_{1}\) and \(d_{2}\) (see Section 2). Because we set \(d_{1}+d_{2}=100mm\), we just need to test one of the two thresholds. Figure 9a presents the accuracy with changing \(d_{1}\). Then, we test the cut line l defined in Section 2. Define \(c_{x}\) as the intersection point of l and \(\mathbf {c_{0}c_{1}}\). The ratio \(r=\frac {\left | \mathbf {c_{0}c_{x}} \right |}{\left | \mathbf {c_{0}c_{1}} \right |}\) is used to determine the line l. The test result is shown in Fig. 9b. So, the configuration of \(d_{1}=70mm\) and \(r=0.5\) is the best choice and retained in all further experiments. It should be noted that r is set to 0.5 in Fig. 9a and \(d_{1}=70~mm\) is used in Fig. 9b.

Figure 9
figure 9

Some hand shapes with same number of components but different skeletons.

5.2 Shape Decomposition and Skeleton Representation

To assess the validity of our approach, we use real-world depth image sequences obtained by a Kinect sensor to test the proposed hand shape decomposition method and skeleton representation method. Real-world depth image sequences are employed to assess shape decomposition and skeleton representation. Some test results are shown in Fig. 10, where hands are correctly decomposed and skeleton of each hand is denoted as black line segments. Then, two-hand gesture recognition from real-world depth images is performed. A few indicative recognizing results of the proposed method is shown in Fig. 11, which is similar to Fig. 10. The results demonstrate that hand shape decomposition and skeleton extraction are performed as we want.

Figure 10
figure 10

Hand segmentation evaluation. a Accuracy with changing \(d_{1}\). b Accuracy with changing ratio r.

Figure 11
figure 11

Single hand gesture test samples. For each hand gesture, hand contour and skeleton are described. A hand shape decomposition result is combined with each hand.

5.3 Quantitative Evaluation

We have defined a simple hand gesture dataset shown in Fig. 12 to test our method. To test the performance of proposed method, more than 2000 frames were recorded in experiments. We first evaluate the concavity \(\varepsilon \). The recognition accuracy with changing concavity is demonstrated in Fig. 13. The configuration of \(\varepsilon = 8mm\) is the best choice and used in other experiments. Then we evaluate our recognition method with the best configuration. Table 1 shows the detail results of the hand gesture recognition experiments (We use ‘b-right’ as a abbreviation of ‘bottom-right’, ‘b-left’ as a abbreviation of ‘bottom-left’, ‘t-right’ as a abbreviation of ‘top-right’ and ‘t-left’ as a abbreviation of ‘top-left’). The average accuracy of recognition in this experiment is about 0.912. The low accuracy of the gestures ‘top-left’ and ‘top-right’ is caused by the difficulty of posing these gestures. That is to say, the two angles \(\alpha_{x}\) and \(\alpha_{y}\) between hand and camera plane become large, which significantly affect the performance. Qualitatively, \(\alpha_{z}\) does not affect the performance since the projective shape is constant with changing \(\alpha_{z}\). So, we just evaluate \(\alpha_{x}\) and \(\alpha_{y}\) using synthetic data. The testing results are shown in Fig. 14, where the optimal angle of \(\alpha_{x}\) and \(\alpha_{y}\) is from −20° to 20°. Simultaneously changing \(\alpha_{x}\) and \(\alpha_{y}\) is not tested since it is similar to single change qualitatively.

Figure 12
figure 12

Two-hand gesture test samples. For each two-hand gesture, hand contours and skeletons are described. Hand shape decomposition result is combined with each hand.

Figure 13
figure 13

Gesture definition.

Figure 14
figure 14

Recognition accuracy with changing concavity.

Table 1 Recognition results of the gestures for single hand performance.

The effect of varying the distance of the hand from the depth sensor is considered in Fig. 15. The experiments are performed with real-world sequences. The accuracy increases as the depth increases until the average depth is about \(1.5~m\). Then, it declines with an increasing speed. From the plot, the effective distance is from \(0.5m\) to \(3.5~m\) and the optimal distance is from \(0.5~m\) to \(2.5~m\), in which region the accuracy is higher than 0.9.

Figure 15
figure 15

Recognition accuracy with changing \(\alpha_{x}\) and \(\alpha_{y}\) separately.

5.4 Comparisons

5.4.1 Comparison to Geometry-based Method

We also compare the proposed method with the previous work. This experiments are based on the same dataset provided in [8], which includes 10 gestures denoted as ‘1’, ‘2’, \(\cdots \), ‘10’ and 100 cases of each gesture. Although each of the cases consists of a color image and a depth map, we just use the depth map which is the only requirement in our method. The confusion matrix of the testing results based on the dataset using the proposed method is shown in Fig. 16. The mean accuracy is about 91.9 %. Compared with [8], our method is much more efficient for real-time applications, which is shown in Table 2.

Figure 16
figure 16

Recognition accuracy with changing distance.

Table 2 Comparative testing results of the method in [8] and proposed method.

5.4.2 Comparison to Classification-based Method

ToF camera provides range data which is similar to the depth image from Kinect sensor. In [17], hand features are extracted after segmentation from range data and used for training and classification. We employ real-world depth sequences from Kinect sensor to test the gestures (Gestures are denoted by IDs from 1 to 9, which are EnumOne, EnumTwo, EnumThree, EnumFour, EnumFive, Stop, Fist, OkLeft, OkRight.) used in [17] and the results is shown in Fig. 17. Each gesture is given 100 cases. The overall accuracy is about 0.941 which is very close to the accuracy 0.939 provided in [17]. However, their method needs training process, which is more complex. In general, classification-based methods usually employ appearance features for training and classification. So, these methods are not easy to extend. On the contrary, our system is easy to add new gestures by providing the template gesture skeletons.

Figure 17
figure 17

The confusion matrix of the testing results.

5.4.3 Comparison to Color-based Method

Because there is no ground truth data to compare the proposed depth-based method with color-based methods, depth and color sequences which are synchronously generated from Kinect sensor are used in further experiments. We implement the method proposed in [10] where scale-space feature detection is integrated into gesture recognition and six gestures in Table 3 are used for quantitative evaluation. In Table 3 we can find that we achieve an accuracy of 0.938 compared with the accuracy of 0.903 obtained by the method in [10].

Table 3 Comparative recognition results of the method in [10] and proposed method.

6 Conclusions

Real-time markerless hand gesture recognition has a wide range of applications, such as virtual interaction, robot control and other kinds of electrical applications. In this paper, we have proposed an efficient method for hand gesture recognition. Hand shapes are detected and segmented from low-resolution depth images which are obtained from a depth sensor. An radius based convex shape decomposition method is introduced at the same time to decompose hand shapes. With the shape decomposition result, fingertips are easily detected. The shape decomposition and fingertips detection, combined with the skeleton extraction, have address the accuracy and efficiency problems of hand gesture recognition to a certain extent. Extensive experimental results demonstrate accuracy, efficiency and robustness of our method.