Keywords

1 Introduction

Human pose estimation (HPE), aiming to build human body representation (such as body skeleton and body shape), is a longstanding computer vision problem. 3D Human pose estimation has been applied to numerous applications, including motion recognition and analysis, human-computer interaction, virtual reality (VR), security identification and so on. However, this task is extremely challenging due to 1) the deep ambiguity in 2D-to-3D space transformation where a 2D keypoint corresponds to multiple 3D poses, 2) the insufficiency of labeled dataset due to the high cost of obtaining labels and 3) self-occlusion of human pose. Thanks to the success of deep neural networks, the generalization performance of the deep-learning-based method has risen sharply [13, 16, 25], making it have a broader prospect.

Fig. 1.
figure 1

Architecture of our proposed global-local features fusion network for 3D human pose estimation. Parallel full-connection module and group-connection module capture the global and local features of human pose respectively.

Improving the generalization performance of deep-learning-based 3D human pose estimation models is still a challenging problem. Surprisingly, we found some inspiring observations in the existing 3D human pose estimator: the prediction errors of keypoints of the human body have a high correlation with the structure of human pose, as shown in Fig. 2. Different human pose cross actions have high similarities in local features, and there are some samples as shown in Fig. 2(c). These inspiring observations indicate that there are communal global features in human pose cross actions, and regular similar local features existing in pose of different action categories. Whether it is possible to design a model based on this inspiring pattern for 3D human pose estimation.

To realize this idea, we designed a parallel fusion network where full-connection network and group-connection network learn to capture global-feature and local-feature of human pose, respectively. In this work, full-connection network connect all the input features and output features indiscriminately to learn the global information. Group-connection network connect the input features and output features in the group with global information representation to focus on learning the local information of human pose. Based on this motivation, a parallel fusion network which learn human structural local and global joint features (JointFusionNet) is designed for 3D human pose estimation, as shown in Fig. 1.

In extensive comparisons to state-of-the-art techniques, JointFusionNet exhibits performance considerably on 3D human pose estimation. More importantly, experiments show that JointFusionNet can not only outperform previous work, but elevate huge performance on pose with more similar local features, such as Sit, Greet and Phone. Moreover, various ablation studies validate our proposed approach. The main contributions are summarized as follows:

  • The methods of global and local features capture modules are proposed respectively to model human pose based on inspiring observation.

  • A network structure that fuses the global and local joint features of human pose is designed to improve the estimation performance.

  • Extensive comparisons and various ablation studies to validate our proposed JointFusionNet for single-frame 3D human pose estimation.

2 Related Works

Extensive research has been conducted on 3D human pose estimation and global-local features fusion. In the following, we briefly review methods that are related to our approach.

2.1 3D Human Pose Estimation

Existing deep-learning-based 3D pose estimation methods mainly follow two frameworks: end-to-end methods and two-stage methods. End-to-end methods regress 3D pose directly from the input image [15], which is extremely expensive to acquire labeled datasets, although those methods avoid error accumulation in two stages. Thanks to the high accuracy of 2D pose estimators, the two-stage method has become a major popular solution for 3D human pose estimation. The two-stage methods [13, 25, 26] first employ off-the-shelf 2D pose estimators to extract 2D pose from the input image and then establish the mapping from 2D pose to 3D pose. Simultaneously, considering the sequence information in video, Pavllo et al. [16] proposed a network that combines multi-frame sequence information for pose estimation. The methods of multi-view fusion [5] are also applied to 3D human pose estimation due to the natural existence of multiple-view cameras or sensors in the dataset or reality. Furthermore, the transform network based on the attention mechanism [10] is also used to mine spatial and temporal information in 3D human pose estimation. Simultaneously, lacking diversity in existing labeled 3D human pose dataset restricts the generalization ability of deep learning based methods. Therefore, Li et al. [9] proposed a method to synthesize massive paired 2D-3D human skeletons with evolution strategy. JointPose [21] further jointly performs pose network estimation and data augmentation by designing a reward/penalty strategy for effective joint training. In this paper, we focus on the universal transformation from 2D pose to 3D pose in the two-stage method, as much as possible to fuse the global and local features of human pose at the same time.

2.2 Global-Local Features Fusion

The method of considering global and local features has long been applied to deep learning models, such as part-based branching network [17] for 2D human pose estimation. Martinez et al. [13] proposed a simple yet effective full connection network to learn the mapping relationship of 2D-3D space, in which the keypoints are fully connected but do not pay attention to local connection features. Based on the connection relationship of the graph model, Ma et al. [11] proposed a pose estimation model considering context node information, which does not consider local groups based on human pose structure information. Zeng et al. [22] proposed a grouping and reorganization pose estimation model based on the local group of human structure information, which does not fully consider the global information of human pose. In this paper, we consider the global and local joint features of the human pose in parallel, hence propose JointFusionNet that fuses global and local features for 3D human pose estimation.

3 Method

In this section, we propose global- and local-feature capture module to learn the observed patterns of human pose and design a parallel fusion network for 3D human pose estimation.

3.1 Inspiring Pattern of Human Pose

The inspiring and regular observed patterns in human pose be applied to 3D human pose estimation models. Pattern I is the estimation performance pattern cross keypoints in existing 3D human pose estimator [22]: the estimation performance cross keypoints are not only quite different, but also show a regular pattern, as shown in Fig. 2(a) and Fig. 2(b). Pattern II is the similar local features of human pose: different human pose cross actions have specific similarities in local features. Here are some examples in H36M, as shown in Fig. 2(c).

Fig. 2.
figure 2

Inspiring observation pattern in human pose.

Pattern of estimation performance cross keypoints reveals that the performance of keypoints has a great correlation with the structure of human pose. The closer the keypoints to the center of the human body (like hip, shoulder and spine), the smaller the estimation error; the further away from the part (like foot, wrist and head), the greater the error compared to a structurally adjacent keypoint. The keypoints with low error represent global features of human pose, which are related to all the keypoints of human pose; The keypoints with high error represent local features of human pose, which are related to only part of keypoints. Based on Pattern I, also considering the partitioning of human pose used in [14], the keypoints of human pose are divided into groups, for example, 3 groups in different colors in Fig. 2(a) and Fig. 2(b).

Pattern of similar local features of human pose reveals that different human pose cross actions have high similarities in local features. Considering the structure and kinematics of human pose, the local features are limited and appear repeatedly in different actions. For example, the local feature of Sitting in the lower body appears in the action categories of Phone; the local feature of standing in the lower body appears in action of Purchases, Greet and so on. Further, the group of local features are not strongly related to each other, for example, the posture of the arm and of the legs are not highly correlated.

3.2 Global and Local Features Fusion

Inspired by the observed pattern of estimation error cross keypoints and similar local features of human pose, we explore this pattern to design the architecture of 3D pose estimation network, and proposed JointFusionNet, as shown in Fig. 1. In JointFusionNet, the 2D pose of the RGB image is estimated by the existing 2D human pose estimator, then full connection module and group connection module capture the global and local features of human pose. Finally, these features are fused and regressed to 3D pose.

We propose the global-local feature capture module, as shown in Fig. 3, to learn to capture global feature information and local feature information of human pose. Then a parallel structure is used to design the fusion network that fuses the learned features.

Fig. 3.
figure 3

Global-local features capture module

Global feature capture module is a full-connection layer network (FCN) [13], learning the global features of human pose when processing the encoding vector representing the global information of the human body, as shown in Fig. 3(a). It can be noted that in the global feature capture module, each output features and each intermediate feature is connected to all of the input features indiscriminately, allowing it to learn the global information represented by each feature. Simultaneously, residual connections [4] are used as a technique to improve generalization performance.

Local feature capture module is a group-connection layer network (GCN) with Low-Dimensional Global Context (LDGC) [22], learning the local features of human pose when processing the encoding vector representing the group local information of the human body, as shown in Fig. 3(b). According to the observed patterns in the human body and previous researches [14, 22], we divide the keypoints into groups, which are used to capture the local features of human pose. And Low-Dimensional Global Context is used to learn to represent the relationship between local features and the whole pose.

Given the keypoints of 2D human pose \( X = \left\{ X_i | i,...,N \right\} \in \mathbb {R}^{2N} \) , where N is the number of keypoints. Formally, the global feature of human pose can be expressed as

$$\begin{aligned} F_{global} = FCN(X). \end{aligned}$$
(1)

Then the keypoints can be divided into k groups \( X^k = \left\{ X_i^k | i,...,N_k \right\} \in \mathrm R^{ 2N} \), where k represents the number of groups, and \(N_k\) represents the number of keypoints in \(k^{th}\) group. The local feature of human pose can be expressed as

$$\begin{aligned} F_{local}^k = GCN^k(X^k), \end{aligned}$$
(2)

where \(F_{local}^k\) represents local feature in \(k^{th}\) group.

Fig. 4.
figure 4

Conceptual difference arrangement of full connection module and group connection module.

Arrangement of Feature Capture Modules. Full connection module and group connection module capture the global and local features of human pose respectively. When the representations of global features and local features are determined, how to fuse these two features becomes a key issue. At this time, the global-feature capture module and the local-feature capture module can be placed in a parallel or sequential manner, as shown in Fig. 4. Based on the previous feature fusion based method [22] and our research experiments, the parallel arrangement gives a better result than a sequential arrangement, which means learning the information of local features and global features separately and then conducting the fusion of the two.

4 Experiments

In this section, we quantitatively evaluate the effectiveness of JointFusionNet and visualize the observed patterns and further explain the performance of JointFusionNet cross actions. The ablation study analyzes the effects of global and local features, representation dimension, and grouping strategy.

Table 1. The MPJPE (mm) of the SOTA methods on the H36M dataset under protocol #1 and protocol #2, respectively. Best performance is marked with bold font. Dim: representation dimension.

4.1 Datasets, Evaluation Metrics and Details

Human3.6M [6] is a large benchmark widely used for 3D human pose estimation with 11 professional actors collected by the motion sensor. Following conventional works, data from 5 actors (subject 1, 5, 6, 7, 8) are used for training, and data from other 2 actors (subject 9, 11) are used for testing. We use MPJPE and PA-MPJPE for evaluation.

3DPW [12] is the first dataset in the wild with more complicated motions and scenes for 3D human pose estimation evaluation. To verify generalization of the proposed method, we use its test set for evaluation with MPJPE and PA-MPJPE as metric.

Evaluation Metrics. Following convention, we use the mean per joint position error (MPJPE) [6] for evaluation, as follows

$$\begin{aligned} MPJPE = \frac{1}{N} \sum _{i=1}^{N} {\Vert J_i - J_i^* \Vert }_2, \end{aligned}$$
(3)

where N is the number of all joints, \(J_i\) and \(J_i^* \) are respectively the groundtruth position and the estimated position of the ith joint. Protocol #1 was directly calculated, Protocol #2 (Procrustes Analysis MPJPE, PA-MPJPE) was calculated after rigid transformation.

Implementation Details. To train the 3D human pose estimation network, we adopt Adam optimizer [8] with a learning rate initialized as 0.001 and decays at the rate of 0.95 after each epoch. We train JointFusionNet model for 60 epoches in PyTorch framework on NVIDIA RTX 2080 Ti GPU.

4.2 Comparison with State-of-the-Art Methods

In this setting, we use the 2D pose detected by off-the-shelf 2D pose estimator as input of JointFusionNet, and set the representation dimension to 4096, grouping strategy to 5. We first compare our proposed method with the state-of-the-art methods using the standard subject protocol under Protocol #1 and Protocol #2. Table 1 shows that JointFusionNet yields an overall improvement over state-of-the-art methods, indicating strong generalization ability for 3D human pose estimation.

4.3 Cross-dataset Results on 3DPW

In this setting, we examine cross-dataset generalization ability of JointFusionNet by training the model on the Human3.6M training set and evaluating on the 3DPW test set. The performance of JointFusionNet is generally outperforming than that of previous work by a large margin. Notably, it yields an overall improvement of 14.8 mm (relative 13.8\(\%\) improvement) over the previous best method [2] on 3DPW dataset. As shown in Table 2, proposed approach achieves the best cross-data generalization performance.

Table 2. Performance on the 3DPW test set
Fig. 5.
figure 5

A visualization of example pose with huge and light improvement compared with 3D HPE Baseline cross actions.

4.4 Visualization and Explanation

This section visualizes some example human pose with local similar features in H36M and explains the performance comparison between JointFusionNet and 3D HPE Baseline method cross actions, as shown in Fig. 5. The local feature of sitting in the lower body and standing in the lower body appear in different action categories similarly. Correspondingly, the estimation performance of action (Such as Sit and Phone) that has more similar local features has a huge improvement (relative 26.4\(\%\), 36.8\(\%\) improvement over the 3D HPE Baseline method [13]). On the contrary, there is still an improvement in the performance of actions with relatively less similar local features (Such as Photo and SitD), though it is difficult for JointFusionNet to fully learn the relationship between global and local features of the action with few local similar features.

4.5 Ablation Study

Effect of Global and Local Features. In proposed JointFusionNet, the global-feature capture module and local-feature capture module focu on learning global and local features of human pose respectively. Therefore, this set of experiments explores the role of the global features and local features separately. This experiments use global-feature capture module, local-feature capture module and parallel global-local-feature capture module to capture features, respectively. Compared to capturing the global or local features individually, the proposed global-local features fusion network is more efficient, as shown in Table 3.

Table 3. Performance under capturing different features

Effect of Representation Dimension. In the Global-feature capture module, we use a high-dimensional feature representation of human pose. In this experiment, we set different representations to explore the effect of representation dimension. Higher-dimensional features represent the potential to learn to capture more complex interconnections, although inevitably pose challenges to network training, as shown in Table 4.

Table 4. Performance under different representation dimensions

Effect of Grouping Strategy. We compare the results of using different numbers of local groups in Table 5. Although there can be more and complex grouping strategies, we only set 3 commonly used strategies to explore the effect of grouping strategy. The way of grouping reveals the structural information of the human body. It is shown that the performance is best when the grouping method is consistent with intuitive perception, which indicates that a strong physical relationship among joints in a group is a prerequisite for learning effective local features.

Table 5. Performance under different grouping strategies

5 Conclusion

In this paper, we proposed JointFusionNet, a structural global and local joint features fusion approach based on inspiring observation patterns, which improves generalization performance in 3D human pose estimation. The key idea is to design a parallel fusion network that captures global-features and local-features for more effective learning. Experimental results and ablation studies show that JointFusionNet outperforms state-of-the-art techniques, especially for poses with more similar local features.