Keywords

1 Introduction

In recent years, the human-computer interaction field has owned increasing diverse interaction methods, including speech recognition [16], hand gesture recognition [3] and touch recognition [20]. Hand gesture, as the second mainstream human communication method, provides solutions for interactive environments that use non-touch interfaces. Currently, it has been applied in many application fields, such as virtual reality systems [6], somatosensory games [22], sign language communication [25], and so on. Meanwhile, the increasing demand for intuitive interaction promotes research on hand gesture recognition.

Fig. 1.
figure 1

Captured depth images and hand skeleton images. Each hand skeleton consists of 22 joints, including: one joint for the center of the palm, one joint for the position of the wrist and four joints for each finger.

At present, the related research on hand gesture recognition can be divided into two categories: static hand gesture recognition and dynamic hand gesture recognition. The former is mainly to recognize hand gestures from a single image, while the latter has more extensive application value, which is mainly to understand the information conveyed by hand gesture sequences. In recent years, low-cost depth sensors can capture the hand pose with reasonably good quality and provide the precise 3D hand skeleton. As shown in Fig. 1, we present the depth images and hand skeleton images captured by Intel RealSense. Compared to original RGB images, hand skeleton data can provide more intuitive information, and it is more robust to varying lighting conditions and occlusions. Therefore, skeleton-based dynamic hand gesture recognition has gradually become a current research hotspot.

The hand is an object with complex topology and has no fixed variation period, which makes skeleton-based dynamic hand gesture recognition still a challenging topic. Original skeleton data can not effectively reflect temporal motion features and spatial structure features. Meanwhile, most methods do not make full use of multi-scale features to provide the discriminative basis for hand gesture recognition. To solve these problems, we propose a novel DR-Net to realize dynamic hand gesture recognition. On the one hand, it uses the temporal representation encoder to obtain short-term motion features and long-term motion features, which have lower intra-class variance and higher inter-class variance. Meanwhile, it also introduces the TFM to perceive multi-scale temporal features, which can effectively reduce time dependency. On the other hand, it uses the spatial representation encoder to obtain low-frequency spatial features and high-frequency spatial features, which reduces the impact of location-viewpoint variation on the recognition result. Besides, it uses the SFM to enhance important spatial features. The DR-Net uses the cross-entropy loss function as the loss term, which effectively improves the recognition results.

In summary, our main contributions are summarized as follows:

  • We propose a temporal representation encoder and a spatial representation encoder to enrich original skeleton data, which makes DR-Net use short-term motion features, long-term motion features, low-frequency spatial features, and high-frequency spatial features as input sources.

  • We design an efficient feature fusion module for DR-Net in the temporal and spatial domains, respectively. Specifically, our proposed TFM can effectively capture multi-scale temporal features, while SFM can effectively enhance important spatial features.

  • We conduct comprehensive experiments on two public benchmark datasets to verify the effectiveness of our method. Related experimental results demonstrate that DR-Net is competitive with the state-of-the-art methods.

2 Related Works

2.1 Skeleton Representations

Different skeleton representations can have an important impact on hand gesture recognition and human action recognition. Li et al. [17] proposed to use the Lie group to model the skeleton representation, which can effectively describe the three-dimensional geometric relationship between joints. Jiang et al. [15] proposed a spatial-temporal skeleton transformation descriptor, which describes the relative transformations of skeletons, including the rotation and translation during movement. Wei et al. [30] proposed a novel high-order joint relative motion feature to describe the instantaneous status of the skeleton joint, which consists of the relative position, velocity, and acceleration. Caetano et al. [2] proposed to encode the temporal dynamics by explicitly computing the magnitude and orientation values of the skeleton joints. Liu et al. [19] proposed to use 3D hand posture evolution volume and 2D hand movement map to represent hand posture variations and hand movements, respectively.

2.2 Deep Neural Networks

In recent years, deep neural networks have been widely used in dynamic hand gesture recognition and achieved satisfactory results. Nguyen et al. [23] presented a new neural network for hand gesture recognition that learns a discriminative SPD matrix encoding the first-order and second-order statistics. Chen et al. [4] proposed a novel motion feature augmented network for hand gesture recognition. Guo et al. [10] proposed a novel spatial-based GCNs called normalized edge convolutional networks for hand gesture recognition. Nunez et al. [24] proposed a deep learning approach based on a combination of a convolutional neural network and a long short-term memory network for hand gesture recognition. Chen et al. [5] proposed a dynamic graph-based spatial-temporal attention network for skeleton-based hand gesture recognition. Hou et al. [11] proposed an end-to-end spatial-temporal attention residual temporal convolutional network for hand gesture recognition. Weng et al. [31] proposed a deformable pose traversal convolution network for dynamic hand gesture recognition.

Fig. 2.
figure 2

The overall architecture of our proposed DR-Net. The temporal perception branch (TPB) consists of the temporal representation encoder (TRE) and the temporal fusion module (TFM). The spatial perception branch (SPB) consists of the spatial representation encoder (SRE) and the spatial fusion module (SFM).

3 Our Approach

3.1 Overview

As shown in Fig. 2, our proposed DR-Net mainly contains temporal perception branch and spatial perception branch. For the former, it uses the TRE to extract long-term motion features and short-term motion features. Besides, it uses the TFM to effectively capture and fuse multi-scale temporal features. For the latter, it uses the SRE to extract high-frequency spatial features and low-frequency spatial features. Besides, we propose the SFM to effectively enhance and fuse important spatial features. To balance the model size and recognition accuracy, the DR-Net adopts two continuous TFMs and SFMs.

3.2 Temporal Perception Branch

As we all know, dynamic hand gesture recognition not only needs to obtain the spatial information between the joints in the frame, but also needs to extract the temporal information of each joint between the frames. To solve the above problems, we propose the temporal representation encoder to process original skeleton data. In this paper, we assume the total frame number is T and the number of joints included in each frame is J. For the j-th skeleton joint of the t-th frame, in the 3D Cartesian coordinate system, it can be expressed as \(S_{j}^{t}=(x_{j}^{t},y_{j}^{t},z_{j}^{t})\). The set of all skeleton joints in the t-th frame of the k-th hand gesture can be expressed as: \(G_{k}^{t}=\left\{ S_{1}^{t},S_{2}^{t},S_{3}^{t},\cdots ,S_{J}^{t}\right\} \). Our proposed temporal representation encoder designs two different temporal skeleton representations as input sources. As shown in Fig. 3, short-term motion features refer to the difference between adjacent frames, while long-term motion features mean computing the difference between all other frames and the first skeleton frame.

Fig. 3.
figure 3

Illustration of the temporal representation encoder. The left figure shows the acquisition method of short-term motion features, and the right figure shows the acquisition method of long-term motion features.

To effectively fuse short-term motion features and long-term motion features, we design the temporal fusion module, which can help DR-Net obtain multi-scale motion features. As shown in Fig. 4, the TFM has two input sources and two output sources. For two consecutive TFMs, the output of the former will be the input of the latter, and the latter will use the concatenation operation to fuse the output result. Besides, the TFM processes motion features by using different convolution kernels with different scales of receptive fields, which can tolerate a variety of temporal extents in a complex hand gesture. Specifically, we fuse short-term motion features and long-term motion features, and send them into three different branches. Their convolution kernel sizes are all 3, and the dilation rates are 1, 3, and 5, respectively. We do not use the addition method to aggregate the results of multi-scale perception, but adopt the concatenation method, which can avoid the loss of information. Finally, we use the average pooling operation to process the aggregated motion features.

Fig. 4.
figure 4

The left figure shows the overall architecture of the TFM. The right figure shows the internal details of the TFM.

Fig. 5.
figure 5

Images generated by the spatial representation encoder. The left figure is the original spatial map. The middle figure is the high-frequency spatial map. The right figure is the low-frequency spatial map.

3.3 Spatial Perception Branch

To enhance the generalization ability of the model, we often need to perform extra operations such as translation, flipping, and rotation. However, the Cartesian coordinate feature is very sensitive to these data enhancement operations. Meanwhile, geometric features can fully reflect the spatial relationship of the skeleton joints. Therefore, we design an effective spatial representation encoder to extract geometric features. As shown in Fig. 5, the SRE can generate three different skeleton maps. For each skeleton frame, the original spatial map is generated by computing the Euclidean distance between any two joints, its specific calculation formula is as follows:

$$\begin{aligned} D_{i,j}^{t}=\sqrt{\left( x_{i}^{t}-x_{j}^{t}\right) ^{2}+\left( y_{i}^{t}-y_{j}^{t}\right) ^{2}+\left( z_{i}^{t}-z_{j}^{t}\right) ^{2}} \end{aligned}$$
(1)

where i and j represent the serial ID of the skeleton joints respectively. x, y, and z represent the data of the skeleton on different coordinate axes. Next, we use the fast Fourier transform to transform the spatial domain image into the frequency domain image, and use a circular filter to filter out the low-frequency information or high-frequency information. Finally, we use inverse fast Fourier transform to generate the high-frequency spatial map and low-frequency spatial map. To reduce redundant information, we only choose the upper triangular of each map to represent the geometric features of each hand skeleton.

To effectively fuse low-frequency spatial features and high-frequency spatial features, we design a spatial fusion module with the attention mechanism, which is similar to the temporal fusion module. Specifically, the SFM also has two input sources and two output sources. In the upper and lower branches, we use 1D convolutions with different dilation rates to perceive multi-scale spatial features, and use the concatenation operation to fuse them. In addition, we subtract the low-frequency features from the high-frequency features to obtain the significant difference features. Meanwhile, we take the difference feature as the input of the middle layer, and use the basic block and full connection layer to obtain the attention weight, so as to obtain the weighted features by multiplication.

4 Experiments

In this section, we evaluate our method on two public datasets: FPHA dataset and SHREC’17 Track dataset. Extensive ablation studies and comparative results show the effectiveness of our model.

4.1 Datasets

SHREC’17 Track Dataset. The SHREC’17 Track Dataset [27] is a public dynamic hand gesture dataset, which contains 2800 sequences. 28 participants perform each gesture between 1 and 10 times in two ways: using one finger and the whole hand. The depth images and hand skeletons are captured at 30 frames per second, with a resolution of 640\(\times \)480. The length of hand gestures ranges from 20 to 50 frames. Each skeleton frame provides the coordinates of 22 hand joints in the 3D world space, which forms a full hand skeleton.

FPHA Dataset. The FPHA dataset [9] is a challenging 3D hand pose dataset, which provides first-person dynamic hand gestures interacting with 3D objects. The dataset contains 1,175 action videos corresponding to 45 different action categories and performed by 6 actors in 3 different scenarios. It provides the 3D coordinates of 21 hand joints except for the palm joint. We used the 1:1 setting with 600 sequences for training and 575 sequences for testing.

4.2 Implementation Details

We perform all our experiments on an NVIDIA GeForce GTX 2080Ti with Keras using the TensorFlow backend. The learning rate is initially set to be 0.001. If the loss remains unchanged after 25 iterations, we set it to 0.5 times the current learning rate. The minimum learning rate is set to be \(1e^{-8}\). The batch size is set to be 64 and the network train 400 epochs. We employ the Adam algorithm with default parameters to optimize the network. Besides, we use median filtering operations to preprocess the original skeleton data and use linear interpolation to adjust the skeleton sequence with different lengths to 32 frames. To avoid over-fitting, we set the dropout rate to 0.5.

4.3 Ablation Study

Different Network Branches. To examine the influence of different network branches on hand gesture recognition accuracy, we conduct related ablation experiments according to different input sources. As shown in Table 1, the performance of the temporal perception branch is significantly better than that of the spatial perception branch. In addition, compared with long-term motion features, short-term motion features can provide a more discriminative recognition basis for the network. Meanwhile, we can get better recognition results by fusing SPB and TPB, which can achieve 96.31% and 93.21% recognition accuracy on 14 hand gestures and 28 hand gestures, respectively.

Table 1. Recognition accuracy (%) of our method for different network branches on the SHREC’17 Track dataset.
Table 2. Recognition accuracy (%) of our method for different joint distances on the SHREC’17 Track dataset.

Different Joint Distances. To investigate the influence of different joint distances on recognition accuracy, we design four related ablation experiments. As shown in Table 2, using Euclidean distance as the metric can obtain the best recognition performance, which can reach 96.31% and 93.21% on 14 hand gestures and 28 hand gestures, respectively. Besides, we find that the recognition result of using Correlation distance as the metric is the worst. This is mainly because it reduces the inter-class variance to a certain extent. Besides, we find that Cityblock distance also can obtain satisfactory results, which reflects the spatial position relationship of skeleton joints.

4.4 Comparison with the State-of-the-Art

In this section, we compare our method with the state-of-the-art methods on the SHREC’17 Track dataset [27] and FPHA dataset [9]. For the former, we use 1960 sequences for training and 840 sequences for testing. As shown in Table 3, our proposed DR-Net achieves 96.31% and 93.21% recognition accuracy for 14 hand gestures and 28 hand gestures, respectively. The DR-Net adopts temporal skeleton representation and spatial skeleton representation as input sources, which effectively improves recognition results. For the latter, we quote related results of compared methods from this paper [23] to demonstrate the effectiveness of our proposed DR-Net. As shown in Table 4, our approach outperforms the state-of-the-art methods. Besides, ST-TS-HGR-NET [23] uses SVM for hand gesture recognition, which is not suitable for larger datasets. HMM+HPEV [19] uses deep neural networks to recognize hand gestures, which performs poorly on smaller datasets. The related experimental results demonstrate that our method is suitable for datasets of various sizes.

Table 3. Comparison of recognition accuracy (%) with the state-of-the-art methods on the SHREC’17 Track dataset.
Table 4. Comparison of recognition accuracy (%) with the state-of-the-art methods on the FPHA dataset (C, D, S represent color images, depth images and skeleton data).

5 Conclusion

In this paper, we propose a novel DR-Net for skeleton-based dynamic hand gesture recognition, which decouples the original skeleton data into different skeleton representations as input. For the temporal perception branch, we use short-term motion features and long-term motion features as the temporal skeleton representation. Meanwhile, we design the TFM to capture multi-scale temporal features. For the spatial perception branch, we use spatial low-frequency features and spatial high-frequency features as the spatial skeleton representation. Besides, we also propose the SFM to enhance important spatial features. On the two public benchmark datasets, related experimental results demonstrate that our proposed DR-Net is competitive with the state-of-the-art methods. At present, our method cannot adaptively learn spatial geometric relationships, which leads to unsatisfactory performance for the spatial perception branch. In the future, we intend to use GCN to automatically learn the spatial geometric relationship between joints.