1 Introduction

Among various gesture recognition methodologies using bodies, faces, or hands, a hand-based gesture or sign recognition methodology is one of efficient and natural ways for human machine interfaces. Potential applications of the hand gesture or sign recognition include human computer interfaces, automatic sign language interpretations [8], remote controls of electronic appliances [20, 26], and smart robotics [14]. One of key functions in these applications is to recognize hand poses, movements, or both which convey meaningful information of users’ intention to give instructions to machines. Among hand gesture recognition of signs such as letters, numbers, or symbols, hand number gesture recognition is an important task to interface between humans and machines. Although various approaches have been proposed for hand number gesture recognition, they can be divided into two commonly used approaches: one is inertial sensor glove-based and the other vision-based approaches [14].

The sensor glove-based approach uses an array of sensors which are attached to a hand glove that transduces fingers and hand actions into electrical signals to recognize the hand number gestures. In [5], a sensor glove was used to process and convey the degree of flexion in each finger of the hand and a tri-axis accelerometer was placed on the back of the hand to provide its orientation information. A neural network method was employed to recognize hand number gestures. In [4, 24], six accelerometers on a hand glove were used to get relative angles between fingers and the palm of the hand. Each number gesture was recognized by a fuzzy rule-based classification system. Another method proposed in [15] introduced a 3D input device with inertial sensors to recognize number drawing gestures in 3D space. The signals of angular velocity and acceleration were considered as motion features. Fisher discriminant analysis was used for hand number gesture recognition. Although, these studies reported some success in hand number gesture recognition, the methods require of wearing gloves and positioning sensor on the glove or hand. Therefore, motion sensor- or glove-based recognition approaches of hand gestures, in general, are not considered as a natural and convenient way in spite of high sensitivity [6].

The vision-based approach uses imaging sensors or cameras to get hand gesture features in colors, shapes, orientations, texture, contours, moments or skeleton [9, 14]. In [2], the authors proposed a method to recognize hand number gestures of Persian sign language from 1 to 10 using the skeleton information of hand silhouettes. The endpoints of the skeleton were extracted as fingertips in recognizing ten gestures. Similar to [25], the authors developed a hand gesture recognition system using the hand orientation and contour features and Hidden Markov Models. Another method proposed in [11, 23] used a color glove in which different colors were imprinted on five fingers of the hand glove. Then finger recognition was performed based on color information to classify hand number gestures. However, these studies exhibited limited success in hand number gesture recognition using color images since the use of color image and its features is generally sensitive to noise, lighting conditions, cluttered backgrounds, and especially occlusions.

Recently, with the introduction of new depth imaging sensors, some studies have been focusing on using depth features for hand gesture recognition [22]. In [13], the method was presented for recognizing hand number gestures by template matching using Chamfer Distance, measuring shape similarity based on the orientation, size, shape, and position features from depth images. In [7], the hand gesture recognition combined two types of hand pose features, namely, distance and curvature features by computing the hand contours. Finally, a support vector machine classifier was employed to recognize hand sign gestures. In these approaches, the same or similar features of color images are extracted from depth images, thereby resulting in marginally improved recognition rate. In another proposed method [19], the finger were detected by hand shape decomposition and hand number gesture recognition was done via template matching by minimizing dissimilarity distance using their proposed Finger-Earth Mover’s distance metric. However, this approach suffered from the ambiguity in the height of the thumb finger which is often shorter than other fingers in hand poses when using a height threshold for finger detection, resulting in reduced recognition rate. In general, these previous depth image-based approaches suffer from no information of finger parts, in spite of advantages of depth images such as low sensitivity to lighting conditions.

In this study, we have developed a novel approach of recognizing hand number gestures by recognizing or labeling hand parts in depth images. Our proposed approach consists of two main processes: hand parts recognition by random forests (RFs) classifier and rule-based hand number gestures recognition. The main advantage of our proposed approach is that the state of each finger is directly identified through the recognize hand parts and then number gestures are recognized based on the state of each finger. We have tested and validated our approach on synthetic and real data. Our experimental tests achieved 97.8 % in recognition of ten hand number gestures with five different subjects. Our hand number gesture recognition system should be useful in human and machine interaction applications.

The rest of the paper is organized as follows. In Section 2, we describe the processes of our proposed methodology including hand parts and hand number gesture recognition. In hand parts recognition, we present techniques of synthetic hand database (DB) creation, training and classification per pixel via RFs. In hand number gesture recognition, we describe steps of hand number gesture recognition. Section 3 presents parameter optimizations and obtained results. Section 4 shows our conclusion and discussion.

Fig. 1
figure 1

The framework of our hand number gesture recognition system

2 The proposed methodology

Overall process of our proposed system for hand number gesture recognition, shown in Fig. 1, consists of two main parts: in the first part, a synthetic hand DB, which contains pairs of depth maps and their corresponding hand parts-labeled maps, was generated. Then, the DB was used to training RFs. In the recognition stage, a depth image was first captured from a depth camera and then a hand depth silhouette was extracted by removing the background. Next, the hand parts of a depth silhouette were recognized using the trained RFs. Next, a set of features was extracted from the labeled hand parts. Finally, based on the extracted features, hand number gestures were recognized by our rule-based approach.

The ten hand number gestures representing 10 digits from 0 to 9 are shown in Fig. 2. Our final goal is to recognize these gestures with the information of each finger parts recognized by our proposed methodology.

Fig. 2
figure 2

Our target hand number gestures, representing numbers from 0 to 9

2.1 Hand parts recognition

In this part, we describe a methodology of hand parts recognition based on pixel classification of a hand depth silhouette via RFs. The details of hand parts recognition are presented in the following subsections.

2.1.1 Synthetic hand DB generation

To recognize hand parts from a hand depth silhouette via RFs, the synthetic hand DB, which contains pairs of depth images and their corresponding hand part-labeled maps, is needed to train RFs. We created such the DB with a synthetic hand model using 3Ds-max, a commercial 3D graphic package [1] as shown in Fig. 3a. To identify hand parts, twelve labels were assigned to each hand model as shown in Fig. 3b, c. The five fingers including the thumb, index, middle, ring, and pinkie fingers, were represented by ten corresponding labels including the five front and five back sides of the fingers. The front parts were coded with the index numbers of 2, 3, 4, 5 and 6. Likewise, the five back sides were coded with the index numbers of 7, 8, 9, 10 and 11, respectively. This is critical in our recognition system to identify the open and close state of each finger. In addition, the palm and wrist parts were given the index number of 1 and 12. Our DB comprised 5,000 pairs covering the ten different hand number gestures as shown in Fig. 2. Among them, a set of 3,000 pairs was used in training and a set of the remaining pairs was used in testing. A set of 500 pairs in the DB represents each hand number gesture captured under many different view angles and their orientation in ranging [ −45... +45] degrees compared to vertical original orientation. A set of samples of 3-D hand model and the map of the corresponding labeled hand parts are shown in Fig. 3. The images in the DB had a size of 240 × 320 with 16-bit depth values.

Fig. 3
figure 3

Hand model: a 3-D hand model in 3Ds-max and b, c Hand parts with twelve labels

2.1.2 Depth feature extraction

To extract depth features f from pixel p of depth silhouette I as described in [12, 21], we computed a set of depth features of the pixel p based on difference between the depth values of a neighborhood pixel pair in depth silhouette I. The positions of the pixel pairs were randomly selected on the depth silhouette and they had a relation with the position of the considered pixel p by two terms o 1 and o 2 of the coordinates of x and y, respectively. Depth features, f, are computed as follows:

$$ f(I,p)\,=\,I_{d}\left(x_{p}\,+\,\frac{o_{1x}}{I_{d}(x_{p},y_{p})}, y_{p}\,+\,\frac{o_{1y}}{I_{d}(x_{p},y_{p})}\right)-I_{d}\left(x_{p}\,+\,\frac{o_{2x}}{I_{d}(x_{p},y_{p})}, y_{p}\,+\,\frac{o_{2y}}{I_{d}(x_{p},y_{p})}\right) $$
(1)

where I d (x p ,y p ) is the depth value at the coordinates of (x p ,y p ) and (o 1,o 2) is an offset pair. The maximum offset value of (o 1, o 2) pairs was 30 pixels that corresponds to 0.6 meters which was the distance of the subject to the camera. The normalization of the offset by \(\frac {1}{d_{I}(x_{p},y_{p})}\) ensured that the features are distance invariant.

2.1.3 RFs for pixel classification

In our work, we utilized RFs for hand parts recognition. RFs are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same sample distribution for all trees in the forest [3, 10, 12, 21]. These concepts are illustrated in Fig. 4. Figure 4a presents a single decision tree learning process as a tree predictor. The use of a multitude of decision trees for training and testing RFs on the same DB of S is described in Fig. 4b. The sample sets {\(S_{i}\}^{n}_{i=1}\) are drawn randomly from the training data S by bootstrap algorithm [10].

Fig. 4
figure 4

RFs for pixel-based classification: a a binary decision tree and b ensemble of n decision trees for training RFs

In training, we used an ensemble of 21 decision trees. The maximum depth of trees was 20. Each tree in RFs was trained with different pixels sampled randomly from the DB. A subset of 500 training sample pixels was drawn randomly from each synthetic depth silhouette. A sample pixel was extracted as in (1), to obtain 800 candidate features. At each splitting node in the tree, a subset of 28 candidate features was considered. For pixel classification, each pixel p of a tested depth silhouette was extracted to obtain the candidate features. For each tree, starting from the root node, if the value of the splitting function was less than a threshold of the node, p went to left and otherwise p went to right. The optimal threshold for splitting the node was determined by maximizing the information gain in the training process. The probability distribution over 12 hand parts was computed at the leaf nodes in each tree. Final decision to label each depth pixel for a specific hand part was based on the voting result of all trees in the RFs.

2.2 Hand number gesture recognition

In this part, we describe how to recognize hand number gestures from an incoming hand depth silhouette. Based on the recognized hand parts of the hand depth silhouette, a set of features were extracted for recognizing hand number gestures.

2.2.1 Hand detection

As shown in Fig. 5, we used the creative interactive gesture camera (www.intel.com/software/perceptua). This device is capable of close-range depth data acquisition. The depth imaging parameters were set to be an image size of 240 × 320, and frame rate of 30 fps. The distance from hand parts to a camera was in a range of 0.15 m and 1.0 m. The hand parts could be captured in the field of view of 70 degrees.

Fig. 5
figure 5

Hand detection and segmentation: a the captured hand depth image and b its corresponding hand depth silhouette

To detect the hand area and remove background, we used an adaptive depth threshold. The value of threshold was determined based on a specific distance from a depth camera to the hand parts. To extract hand depth silhouettes, we used the library of Intel Perceptual Computing SDK 2013 which supports the Creative Labs Senz3D Depth and Gesture Camera (www.intel.com/software/perceptua). The detected and segmented hand is shown in Fig. 5. Based on the detected hand depth silhouette, hand orientation is determined by computing its eigenvector to rotate it about vertical original orientation before each hand depth silhouette used for recognizing hand parts.

2.2.2 Hand parts recognition

To recognize hand parts of each hand depth silhouette, all pixels of each hand depth silhouette were classified by the trained RFs to assign a corresponding label out of the 12 indices. A centroid point was withdrawn from each recognized hand part and represents each hand part as illustrated in Fig. 6. Figure 6a shows a hand depth silhouette and Fig. 6b presents the result of the labeled hand parts with 1, 2, 3, 9, 10, 11, and 12 indices. Based on Table 1, we know the recognized hand parts including the palm, thumb (front), index (front), middle (back), ring (back), pinkie (back), and wrist parts and their centroid points which are presented with the big dots in green.

Fig. 6
figure 6

An example of labeled hand parts recognition: a a hand depth silhouette and b the labeled hand parts

Table 1 A list of the named and labeled parts in hand model

2.2.3 Feature extraction and derived rules for hand number gesture recognition

From the recognized hand parts, we extracted a set of features. In our labeling, each finger was represented by two different labels: one label for its front side corresponding to the open state of the finger and another for its back side corresponding to the close state of the finger. From the information of the recognized hand parts, we identify the open or close states of each finger. The states of the five labeled fingers were identified and saved as features, namely f T h u m b , f I n d e x , f M i d d l e , f R i n g , and f P i n k i e , respectively.

$$ f_{Finger}(i) = \left\{ \begin{array}{ll} 1:Open & \text{for Label}=2,3,4,5,~ \text{or} ~6\\ 0:Close& \text{for Label}=7,8,9,10,~ \text{or} ~11\end{array} \right. $$
(2)

For example, as shown in Fig. 6b, f T h u m b and f I n d e x become 1 corresponding to the open state of the fingers. In contrast, f M i d d l e , f R i n g , and f P i n k i e become 0 corresponding to the close state of the fingers (2). To recognize hand number gestures, we derived a set of recognition rules. The set of five features from the states of all fingers was used to encode the meaning of the ten hand number poses. The derived recognition rules are given in Table 2. One problem in the finger state identification is having both the front and back labels after recognizing the finger parts depending on the configuration of each finger. However in the recognition of the binary state of each finger, we used a maximum likelihood estimation criterion to select one of the two labels based on estimating the mean probability distribution of all labelled pixels in each set.

Table 2 Recognition rules of the number gestures based on the states of the five fingers

3 Experimental results

In this part, we have first optimized some main parameters on our own hand synthetic DB and then evaluated hand parts recognition through quantitative and qualitative assessments using synthetic and real data. Final, we have tested to evaluate our proposed hand number gesture recognition methodology on real data.

3.1 Parameter optimization

We performed the experiments on our own DB to optimize the major parameters including the number of trees used in RFs, the number of extracted features, and the maximum value of (o 1,o 2) offset pairs chosen for feature extraction.

Figure 7 shows the average recognition rates of pixels corresponding to the number of trees used in RFs. The recognition accuracy of pixels increases slowly when the number of trees increases from 1 to around 15 trees. The recognition accuracy is maximum at 21 trees used in RFs and decreases for around 25 trees used in RFs.

Fig. 7
figure 7

The effects of trees in RFs on recognition accuracy

As can be seen in Fig. 8, the average recognition rates of pixels correspond to the selected maximum value changes of (o 1,o 2) offset pairs. In this experiment, the distance of the subject to the camera is set 0.6 meters. The recognition accuracy is the highest when the maximum value of (o 1,o 2) offset pairs was chosen 30. This means that ranging of (o 1,o 2) offset pairs is from 0 to 30. It decreases slowly when the (o 1,o 2) pairs get the maximum values is around 25, 35, 40, 50, and 60.

Fig. 8
figure 8

The effects of the (o 1,o 2) offset value pair on recognition accuracy

As can be seen in Fig. 9, the feature numbers extracted for each pixels in RFs obtained maximum recognition accuracy at around 800.

Fig. 9
figure 9

The effects of pixel features on recognition accuracy

3.2 Results of hand parts recognition

To evaluate our hand parts recognition via the trained RFs on own synthetic database with optimized parameters qualitative and quantitatively, we tested on a series of 2000 hand depth silhouettes containing various number poses over the ten hand number poses. The qualitative assessment of some hand pose samples on both synthetic data are presented in Fig. 10. As can be seen in Fig. 10, the parts in hand are labeled by different color codes and their centroid points are shown as big dots in green superimposed on the labeled hand parts.

Fig. 10
figure 10

Sample results of labeled hand parts via RFs on synthetic data

The quantitative assessment results on synthetic data are provided by a confusion matrix with 12 recognized hand parts as described in Table 3. As revealed by Table 3, the average recognition rate of the palm part is the highest whereas this recognition rate of the pinkie part is the lowest. For finger parts, the recognition results are higher on their front sides than back sides. The average recognition rate of all hand parts was 96.60 %.

Table 3 The accuracy of pixel-based recognition in hand parts via RFs on synthetic data (%)

3.3 Results of hand number gesture recognition

To evaluate our proposed hand number gesture recognition system with real data, we collected a hand pose dataset using the Creative Labs Senz3D Depth and Gesture Camera. A set of 500 samples of the hand poses was captured from five different subjects. Each subject was asked to make 100 hand poses of 10 different number gestures in front of the depth camera. The detected centroid points in the labeled hand regions as illustrated in Fig. 11. Since the ground truth labels are not available with real data, only qualitative visual inspection is possible. The top and bottom rows of Fig. 11 present the hand depth silhouettes and their pixel-wise labeled hand parts results, respectively. The recognized hand parts are presented by different color codes and their detected centroid points are shown as the green dots indicate the centroids. Figure 12 shows the sample results at each step of four hand number gestures 0, 3, 6, and 9 in our hand number gesture recognition system under real scenario.

Fig. 11
figure 11

Sample results of labeled hand parts via RFs on real data

Fig. 12
figure 12

Some sample results of our hand number gesture recognition system under real scenario: a Input scenario, b Hand depth silhouette, c Labeled hand parts, and d Recognized hand number gesture

Based on a set of our derived rules as provided in Table 2 and the state of detected centroid points in the pixel-wise labeled hand regions, the hand number gestures are recognized. Table 4 presents the confusion matrix of the hand number gesture recognition and its average recognition rate was achieved 97.80.

Table 4 The confusion matrix of hand number gesture recognition using our proposed system (%)

Comparison with another method

In this part, we have evaluated our proposed hand number gesture recognition methodology against the edge moments-based methodology. This approach has actively utilized for static hand gesture recognition [1618]. In this methodology, first, hand detection and segmentation are applied on color images. From the segmented hand silhouettes, the hand boundaries are extracted by the canny edge detection algorithm. Next, the Hu’s moments are computed on the hand boundaries, yielding the moment’s features. Finally, these features are used as an input to the SVM classifier. In our experiments, we utilized color and depth images captured at the same time. A set of 1000 color images of the hand number gestures for training and a set of 500 color images of the hand number gestures from five different subjects as collected in our proposed hand number gesture recognition experiment with real data was used. Table 5 shows the results of hand number gesture recognition. As can been shown in Table 5, many gestures are severely confused in the moment-based methodology such as between Gestures 5 and 6, Gestures 7 and 8, and Gestures 8 and 9. This might be due to the fact that their gesture shapes are very similar and the moment features could not differentiate them. The average hand number recognition rate of the color image-based approach archived 88.20 % whereas the proposed 97.80 %.

Table 5 The confusion matrix of hand number gesture recognition based the edge moments-based approach (%)

4 Conclusion and discussion

In this paper, we have fully presented a novel hand number gesture recognition system using the labeled hand parts via the trained RFs from a hand depth silhouette. We created our own synthetic hand DB and then train RFs with the DB. The recognized hand parts via the trained RFs were implemented on the optimized parameters of our hand DB. As a result, we have achieved the mean recognition rate of 97.80 % over the ten hand number gestures from five different subjects.

In comparison to the conventional color image-based approaches in [2, 11, 23, 25], the main contributions of our proposed approach is that our proposed method identified each finger by recognizing the hand parts and also understood the state of each finger in hand by recognizing its front and back side to identify the open or close states of each finger. Then finally the hand number gestures are recognized according to our derived recognition rules of the states.

Our presented work should be applicable for recognizing sign language gestures as well as humans and machines interaction applications such as household appliances control, health-care supporting systems which help to improve quality of life for everybody, especially for people with learning disabilities, disabled and elderly people.