1 Introduction

3D human pose and mesh estimation aims to recover 3D human joint and mesh vertex locations simultaneously. It is a challenging task due to the depth and scale ambiguity, and the complex human body and hand articulation. There have been diverse approaches to address this problem, and recently, deep learning-based methods have shown noticeable performance improvement.

Most of the deep learning-based methods rely on human mesh models, such as SMPL  [32] and MANO  [48]. They can be generally categorized into a model-based approach and a model-free approach. The model-based approach trains a network to predict the model parameters and generates a human mesh by decoding them  [4,5,6, 23, 27, 29, 41, 42, 46]. On the contrary, the model-free approach regresses the coordinates of a 3D human mesh directly  [13, 28]. Both approaches compute the 3D human pose by multiplying the output mesh with a joint regression matrix, which is defined in the human mesh models  [32, 48].

Although the recent deep learning-based approaches have shown significant improvement, they have two major drawbacks. First, they cannot benefit from the train data of the controlled settings  [19, 22], which have accurate 3D annotations, without image appearance overfitting. This overfitting occurs because image appearance in the controlled settings, such as monotonous backgrounds and simple clothes of subjects, is quite different from that of in-the-wild images. The second drawback is that the pose parameters of the human mesh models might not be an appropriate regression target, as addressed in Kolotouros et al.  [28]. The SMPL pose parameters, for example, represent 3D rotations in an axis-angle, which can suffer from the non-unique problem (i.e., periodicity). While many works  [23, 29, 41] tried to avoid the periodicity by using a rotation matrix as the prediction target, it still has a non-minimal representation issue.

Fig. 1.
figure 1

The overall pipeline of Pose2Mesh.

To resolve the above issues, we propose Pose2Mesh, a graph convolutional system that recovers 3D human pose and mesh from the 2D human pose, in a model-free fashion. It has two advantages over existing methods. First, the input 2D human pose makes the proposed system free from the overfitting related to image appearance, while providing essential geometric information on the human articulation. In addition, the 2D human pose can be estimated accurately from in-the-wild images, since many well-performing methods  [9, 38, 52, 59] are trained on large-scale in-the-wild 2D human pose datasets  [1, 30]. The second advantage is that Pose2Mesh avoids the representation issues of the pose parameters, while exploiting the human mesh topology (i.e., face and edge information). It directly regresses the 3D coordinates of mesh vertices using a graph convolutional neural network (GraphCNN) with graphs constructed from the mesh topology.

We designed Pose2Mesh in a cascaded architecture, which consists of PoseNet and MeshNet. PoseNet lifts the 2D human pose to the 3D human pose. MeshNet takes both 2D and 3D human poses to estimate the 3D human mesh in a coarse-to-fine manner. During the forward propagation, the mesh features are initially processed in a coarse resolution and gradually upsampled to a fine resolution. Figure 1 depicts the overall pipeline of the system.

The experimental results show that the proposed Pose2Mesh outperforms the previous state-of-the-art 3D human pose and mesh estimation methods  [23, 27, 28] on various publicly available 3D human body and hand datasets  [19, 33, 62]. Particularly, our Pose2Mesh provides the state-of-the-art result on in-the-wild dataset  [33], even when it is trained only on the controlled setting dataset  [19].

We summarize our contributions as follows.

  • We propose a novel system, Pose2Mesh, that recovers 3D human pose and mesh from the 2D human pose. It is free from overfitting to image appearance, and thus generalize well on in-the-wild data.

  • Our Pose2Mesh directly regresses 3D coordinates of a human mesh using GraphCNN. It avoids representation issues of the model parameters and leverages the pre-defined mesh topology.

  • We show that Pose2Mesh outperforms previous 3D human pose and mesh estimation methods on various publicly available datasets.

2 Related Works

3D Human Body Pose Estimation. Current 3D human body pose estimation methods can be categorized into two approaches according to the input type: an image-based approach and a 2D pose-based approach. The image-based approach takes an RGB image as an input for 3D body pose estimation. Sun et al.  [53] proposed to use compositional loss, which exploits the joint connection structure. Sun et al.  [54] employed soft-argmax operation to regress the 3D coordinates of body joints in a differentiable way. Sharma et al.  [50] incorporated a generative model and depth ordering of joints to predict the most reliable 3D pose that corresponds to the estimated 2D pose.

The 2D pose-based approach lifts the 2D human pose to the 3D space. Martinez et al.  [34] introduced a simple network that consists of consecutive fully-connected layers, which lifts the 2D human pose to the 3D space. Zhao et al.  [60] developed a semantic GraphCNN to use spatial relationships between joint coordinates. Our work follows the 2D pose-based approach, to make the Pose2Mesh more robust to the domain difference between the controlled environment of the training set and in-the-wild environment of the testing set.

3D Human Body and Hand Pose and Mesh Estimation. A model-based approach trains a neural network to estimate the human mesh model parameters  [32, 48]. It has been widely used for the 3D human mesh estimation, since it does not necessarily require 3D annotation for mesh supervision. Pavlakos et al.  [46] proposed a system that could be only supervised by 2D joint coordinates and silhouette. Omran et al.  [41] trained a network with 2D joint coordinates, which takes human part segmentation as input. Kanazawa et al.  [23] utilized adversarial loss to regress plausible SMPL parameters. Baek et al.  [4] trained a CNN to estimate parameters of the MANO model using neural renderer  [25]. Kolotouros et al.  [27] introduced a self-improving system that consists of SMPL parameter regressor and iterative fitting framework  [5].

Recently, the advance of fitting frameworks  [5, 44] has motivated a model-free approach, which estimates human mesh coordinates directly. It enabled researchers to obtain 3D mesh annotation, which is essential for the model-free methods, from in-the-wild data. Kolotouros et al.  [28] proposed a GraphCNN, which learns the deformation of the template body mesh to the target body mesh. Ge et al.  [13] adopted a GraphCNN to estimate vertices of hand mesh. Moon et al.  [40] proposed a new heatmap representation, called lixel, to recover 3D human meshes.

Our Pose2Mesh differs from the above methods, which are image-based, in that it uses the 2D human pose as an input. It can benefit from the data with 3D annotations, which are captured from controlled settings  [19, 22], without the image appearance overfitting.

GraphCNN for Mesh Processing. Recently, many methods consider a mesh as a graph structure and process it using the GraphCNN, since it can fully exploit mesh topology compared with simple stacked fully-connected layers. Wang et al.  [58] adopted a GraphCNN to learn a deformation from an initial ellipsoid mesh to the target object mesh in a coarse-to-fine manner. Verma et al.  [56] proposed a novel graph convolution operator and evaluated it on the shape correspondence problem. Ranjan et al.  [47] also proposed a GraphCNN-based VAE, which learns a latent space of the human face meshes in a hierarchical manner.

3 PoseNet

3.1 Synthesizing Errors on the Input 2D Pose

PoseNet estimates the root joint-relative 3D pose \(\mathbf {P}^{\text {3D}} \in \mathbb {R}^{J \times 3}\) from the 2D pose, where J denotes the number of human joints. We define the root joint of the human body and hand as pelvis and wrist, respectively. However, the estimated 2D pose often contains errors  [49], especially under severe occlusions or challenging poses. To make PoseNet robust to the errors, we synthesize 2D input poses by adding realistic errors on the ground truth 2D pose, following  [38, 39], during the training stage. We represent the estimated 2D pose or the synthesized 2D pose as \(\mathbf {P}^{\text {2D}} \in \mathbb {R}^{J \times 2}\).

3.2 2D Input Pose Normalization

We apply standard normalization to \(\mathbf {P}^{\text {2D}}\), following  [39, 57]. For this, we subtract the mean from \(\mathbf {P}^{\text {2D}}\) and divide it by the standard deviation, which becomes \(\bar{\mathbf {P}}^{\text {2D}}\). The mean and the standard deviation of \(\mathbf {P}^{\text {2D}}\) represent the 2D location and scale of the subject, respectively. This normalization is necessary because \(\mathbf {P}^{\text {3D}}\) is independent of scale and location of the 2D input pose \(\mathbf {P}^{\text {2D}}\).

3.3 Network Architecture

The architecture of the PoseNet is based on that of  [34, 39]. The normalized 2D input pose \(\bar{{\mathbf {P}}}^{\text {2D}}\) is converted to a 4096-dimensional feature vector through a fully-connected layer. Then, it is fed to the two residual blocks  [18]. Finally, the output feature vector of the residual blocks is converted to (3J)-dimensional vector, which represents \(\mathbf {P}^{\text {3D}}\), by a full-connected layer.

3.4 Loss Function

We train the PoseNet by minimizing L1 distance between the predicted 3D pose \(\mathbf {P}^{\text {3D}}\) and groundtruth. The loss function \(L_{\text {pose}}\) is defined as follows:

$$\begin{aligned} L_{\text {pose}} = \Vert \mathbf {P}^{\text {3D}} - {\mathbf {P}^{\text {3D}}}^{*}\Vert _1, \end{aligned}$$
(1)

where the asterisk indicates the groundtruth.

Fig. 2.
figure 2

The coarsening process initially generates multiple coarse graphs from \(\mathcal {G}_\text {M}\), and adds fake nodes without edges to each graph, following  [11]. The numbers of vertices range from 96 to 12288 and from 68 to 1088, for body and hand meshes, respectively.

Fig. 3.
figure 3

The network architecture of MeshNet.

4 MeshNet

4.1 Graph Convolution on Pose

MeshNet concatenates \(\bar{\mathbf {P}}^{\text {2D}}\) and \(\mathbf {P}^{\text {3D}}\) into \(\mathbf {P} \in \mathbb {R}^{J \times 5}\). Then, it estimates the root joint-relative 3D mesh \(\mathbf {M} \in \mathbb {R}^{V \times 3}\) from \(\mathbf {P}\), where V denotes the number of human mesh vertices. To this end, MeshNet uses the spectral graph convolution  [7, 51], which can be defined as the multiplication of a signal \(x \in \mathbb {R}^N\) with a filter \(g_\theta =\textit{diag}(\theta )\) in Fourier domain as follows:

$$\begin{aligned} g_\theta * x = U g_\theta U ^T x, \end{aligned}$$
(2)

where graph Fourier basis \( U \) is the matrix of the eigenvectors of the normalized graph Laplacian \( L \)  [10], and \( U ^Tx\) denotes the graph Fourier transform of x. Specifically, to reduce the computational complexity, we design MeshNet to be based on Chebysev spectral graph convolution  [11].

Graph Construction. We construct a graph of \(\mathbf {P}\), \(\mathcal {G}_\text {P}=(\mathcal {V}_\text {P}, A _\text {P})\), where \(\mathcal {V}_\text {P}=\mathbf {P}=\{\mathbf {p}_i\}^{J}_{i=1}\) is a set of J human joints, and \( A _\text {P} \in \{0,1\}^{J \times J}\) is an adjacency matrix. \( A _\text {P}\) defines the edge connections between the joints based on the human skeleton and symmetrical relationships  [8], where \(( A _\text {P})_{ij}=1\) if joints i and j are the same or connected, and \(( A _\text {P})_{ij}=0\) otherwise. The normalized Laplaican is computed as \( L _{\text {P}}= I _J - D ^{-1/2}_{\text {P}} A _{\text {P}} D ^{-1/2}_{\text {P}}\), where \( I _J\) is the identity matrix, and \( D _{\text {P}}\) is the diagonal matrix which represents the degree of each joint in \(\mathcal {V}_\text {P}\) as \(( D _{\text {P}})_{ij}=\sum _j ( A _\text {P})_{ij}\). The scaled Laplacian is computed as \(\tilde{ L _{\text {P}}}=2 L _{\text {P}}/\lambda _{\text {max}}- I _J\).

Spectral Convolution on Graph. Then, MeshNet performs the spectral graph convolution on \(\mathcal {G}_\text {P}\), which is defined as follows:

$$\begin{aligned} F _{\text {out}} = \sum _{k=0}^{K-1} T _k \big (\tilde{ L _{\text {P}}}\big ) F _{\text {in}} \varTheta _k, \end{aligned}$$
(3)

where \( F _{\text {in}} \in \mathbb {R}^{J \times f_{ \text {in}}}\) and \( F _{\text {out}} \in \mathbb {R}^{J \times f_{\text {out}}}\) are the input and output feature maps respectively, \( T _k \big (x\big )=2x T _{k-1} \big (x\big )- T _{k-2} \big (x\big )\) is the Chebysev polynomial  [15] of order k, and \( \varTheta _k \in \mathbb {R}^{f_{\text {in}} \times f_{\text {out}}}\) is the kth Chebysev coefficient matrix, whose elements are the trainable parameters of the graph convolutional layer. \(f_{\text {in}}\) and \(f_{\text {out}}\) are the input and output feature dimensions respectively. The initial input feature map \( F _{\text {in}}\) is \(\mathbf {P}\) in practice, where \(f_{\text {in}}=5\). This graph convolution is K-localized, which means at most K-hop neighbor nodes from each node are affected  [11, 26], since it is a K-order polynomial in the Laplacian. Our MeshNet sets \(K=3\) for all graph convolutional layers following  [13].

4.2 Coarse-to-fine Mesh Upsampling

We gradually upsample \(\mathcal {G}_\text {P}\) to the graph of \(\mathbf {M}\), \(\mathcal {G}_\text {M}=(\mathcal {V}_\text {M}, A _\text {M})\), where \(\mathcal {V}_\text {M}=\mathbf {M}=\{\mathbf {m}_i\}^{V}_{i=1}\) is a set of V human mesh vertices, and \( A _\text {M} \in \{0,1\}^{V \times V}\) is an adjacency matrix defining edges of the human mesh. To this end, we apply the graph coarsening  [12] technique to \(\mathcal {G}_\text {M}\), which creates various resolutions of graphs, \(\{\mathcal {G}_\text {M}^c=(\mathcal {V}_\text {M}^c, A _\text {M}^c)\}_{c=0}^C\), where C denotes the number of coarsening steps, following Defferrard et al.  [11]. Figure 2 shows the coarsening process and a balanced binary tree structure of mesh graphs, where the ith vertex in \(\mathcal {G}_\text {M}^{c+1}\) is a parent node of the \(2i-1\)th and 2ith vertices in \(\mathcal {G}_\text {M}^{c}\), and \(2|\mathcal {V}_\text {M}^{c+1}|=|\mathcal {V}_\text {M}^c|\). i starts from 1. The final output of MeshNet is \(\mathcal {V}_\text {M}\), which is converted from \(\mathcal {V}_\text {M}^0\) by a pre-defined indices mapping. During the forward propagation, MeshNet first upsamples the \(\mathcal {G}_\text {P}\) to the coarsest mesh graph \(\mathcal {G}_\text {M}^C\) by reshaping and a fully-connected layer. Then, it performs the spectral graph convolution on each resolution of mesh graphs as follows:

$$\begin{aligned} F _{\text {out}} = \sum _{k=0}^{K-1} T _k \big (\tilde{ L _{\text {M}}^c}\big ) F _{\text {in}} \varTheta _k, \end{aligned}$$
(4)

where \(\tilde{ L _{\text {M}}^c}\) denotes the scaled Laplacian of \(\mathcal {G}_\text {M}^c\), and the other notations are defined in the same manner as Eq. 3. Following  [13], MeshNet performs mesh upsampling by copying features of each parent vertex in \(\mathcal {G}_\text {M}^{c+1}\) to the corresponding children vertices in \(\mathcal {G}_\text {M}^c\). The upsampling process is defined as follows:

$$\begin{aligned} F _c = \psi ( F _{c+1}^T)^T, \end{aligned}$$
(5)

where \( F _{c} \in \mathbb {R}^{\mathcal {V}_\text {M}^{c} \times f_{c}}\) is the first feature map of \(\mathcal {G}_\text {M}^{c}\), \( F _{c+1} \in \mathbb {R}^{\mathcal {V}_\text {M}^{c+1} \times f_{c+1}}\) is the last feature map of \(\mathcal {G}_\text {M}^{c+1}\), \(\psi :\mathbb {R}^{f_{c+1} \times \mathcal {V}_\text {M}^{c+1}} \rightarrow \mathbb {R}^{f_{c+1} \times \mathcal {V}_\text {M}^c}\) denotes a nearest-neighbor upsampling function, and \(f_{c}\) and \(f_{c+1}\)are the feature dimensions of vertices in \( F _{c}\) and \( F _{c+1}\) respectively. The nearest upsampling function copies the feature of the ith vertex in \(\mathcal {G}_\text {M}^{c+1}\) to the \(2i-1\)th and 2ith vertices in \(\mathcal {G}_\text {M}^c\). To facilitate the learning process, we additionally incorporate a residual connection between each resolution. Figure 3 shows the overall architecture of MeshNet.

4.3 Loss Function

To train our MeshNet, we use four loss functions.

Vertex Coordinate Loss. We minimize L1 distance between the predicted 3D mesh coordinates \(\mathbf {M}\) and groundtruth, which is defined as follows:

$$\begin{aligned} L_{\text {vertex}} = \Vert \mathbf {M} - \mathbf {M}^{*}\Vert _1, \end{aligned}$$
(6)

where the asterisk indicates the groundtruth.

Joint Coordinate Loss. We use a L1 loss function between the groundtruth root-relative 3d pose and the 3D pose regressed from \(\mathbf {M}\), to train our MeshNet to estimate mesh vertices aligned with joint locations. The 3D pose is calculated as \(\mathcal {J}\mathbf {M}\), where \(\mathcal {J} \in \mathbb {R}^{J \times V}\) is a joint regression matrix defined in SMPL or MANO model. The loss function is defined as follows:

$$\begin{aligned} L_{\text {joint}} = \Vert \mathcal {J}\mathbf {M} - {\mathbf {P}^{\text {3D}}}^{*}\Vert _1, \end{aligned}$$
(7)

where the asterisk indicates the groundtruth.

Surface Normal Loss. We supervise normal vectors of an output mesh surface to be consistent with groundtruth. This consistency loss improves surface smoothness and local details  [58]. Thus, we define the loss function \(L_{\text {normal}}\) as follows:

$$\begin{aligned} L_{\text {normal}} = \sum _{f} \sum _{\{i,j\} \subset f} \Big |\Big \langle \frac{\mathbf {m}_{i} - \mathbf {m}_{j}}{\Vert \mathbf {m}_{i} - \mathbf {m}_{j}\Vert _2}, n^*_f \Big \rangle \Big |, \end{aligned}$$
(8)

where f and \(n^*_f\) denote a triangle face in the human mesh and a groundtruth unit normal vector of f, respectively. \(\mathbf {m}_i\) and \(\mathbf {m}_j\) denote the ith and jth vertices in f.

Surface Edge Loss. We define edge length consistency loss between predicted and groundtruth edges, following  [58]. The edge loss is effective in recovering smoothness of hands, feet, and a mouth, which have dense vertices. The loss function \(L_{\text {edge}}\) is defined as follows:

$$\begin{aligned} L_{\text {edge}} = \sum _{f} \sum _{\{i,j\} \subset f} | \Vert \mathbf {m}_{i} - \mathbf {m}_{j}\Vert _2 - \Vert \mathbf {m}^{*}_{i} - \mathbf {m}^{*}_{j}\Vert _2 |, \end{aligned}$$
(9)

where f and the asterisk denote a triangle face in the human mesh and the groundtruth, respectively. \(\mathbf {m}_i\) and \(\mathbf {m}_j\) denote ith and jth vertex in f.

We define the total loss of our MeshNet, \(L_{\text {mesh}}\), as a weighted sum of all four loss functions:

$$\begin{aligned} L_{\text {mesh}} = \lambda _\text{ v } L_{\text {vertex}} + \lambda _\text{ j } L_{\text {joint}} + \lambda _\text{ n } L_{\text {normal}} + \lambda _\text{ e } L_{\text {edge}}, \end{aligned}$$
(10)

where \(\lambda _\text{ v }=1,\;\lambda _\text{ j }=1,\;\lambda _\text{ n }=0.1,\) and \(\lambda _\text{ e }=20\).

5 Implementation Details

PyTorch  [43] is used for implementation. We first pre-train our PoseNet, and then train the whole network, Pose2Mesh, in an end-to-end manner. Empirically, our two-step training strategy gives better performance than the one-step training. The weights are updated by the Rmsprop optimization  [55] with a mini-batch size of 64. We pre-train PoseNet 60 epochs with a learning rate \(10^{-3}\). The learning rate is reduced by a factor of 10 after the 30th epoch. After integrating the pre-trained PoseNet to Pose2Mesh, we train the whole network 15 epochs with a learning rate \(10^{-3}\). The learning rate is reduced by a factor of 10 after the 12th epoch. In addition, we set \(\lambda _\text{ e }\) to 0 until 7 epoch on the second training stage, since it tends to cause local optima at the early training phase. We used four NVIDIA RTX 2080 Ti GPUs for Pose2Mesh training, which took at least a half day and at most two and a half days, depending on the training datasets. In inference time, we use 2D pose outputs from Sun et al.  [52] and Xiao et al.  [59]. They run at 5 fps and 67 fps respectively, and our Pose2Mesh runs at 37 fps. Thus, the proposed system can process from 4 fps to 22 fps in practice, which shows the applicability to real-time applications.

6 Experiment

6.1 Dataset and Evaluation Metric

Human3.6M. Human3.6M  [19] is a large-scale indoor 3D body pose benchmark, which consists of 3.6M video frames. The groundtruth 3D poses are obtained using a motion capture system, but there are no groundtruth 3D meshes. As a result, for 3D mesh supervision, most of the previous 3D pose and mesh estimation works  [23, 27, 28] used pseudo-groundtruth obtained from Mosh  [31]. However, because of the license issue, the pseudo-groundtruth from Mosh is not currently publicly accessible. Thus, we generate new pseudo-groundtruth 3D meshes by fitting SMPL parameters to the 3D groundtruth poses using SMPLify-X  [44]. For the fair comparison, we trained and tested previous state-of-the-art methods on the obtained groundtruth using their officially released code. Following  [23, 45], all methods are trained on 5 subjects (S1, S5, S6, S7, S8) and tested on 2 subjects (S9, S11).

We report our performance for the 3D pose using two evaluation metrics. One is mean per joint position error (MPJPE)  [19], which measures the Euclidean distance in millimeters between the estimated and groundtruth joint coordinates, after aligning the root joint. The other one is PA-MPJPE, which calculates MPJPE after further alignment (i.e., Procrustes analysis (PA)  [14]). \(\mathcal {J}\mathbf {M}\) is used for the estimated joint coordinates. We only evaluate 14 joints out of 17 estimated joints following  [23, 27, 28, 46].

3DPW. 3DPW  [33] is captured from in-the-wild and contains 3D body pose and mesh annotations. It consists of 51K video frames, and IMU sensors are leveraged to acquire the groundtruth 3D pose and mesh. We only use the test set of 3DPW for evaluation following  [27]. MPJPE and mean per vertex position error (MPVPE) are used for evaluation. 14 joints from \(\mathcal {J}\mathbf {M}\), whose joint set follows that of Human3.6M, are evaluated for MPJPE as above. MPVPE measures the Euclidean distance in millimeters between the estimated and groundtruth vertex coordinates, after aligning the root joint.

COCO. COCO  [30] is an in-the-wild dataset with various 2D annotations such as detection and human joints. To exploit this dataset on 3D mesh learning, Kolotouros et al.  [27] fitted SMPL parameters to 2D joints using SMPLify  [5]. Following them, we use the processed data for training.

MuCo-3DHP. MuCo-3DHP  [36] is synthesized from the existing MPI-INF-3DHP 3D single-person pose estimation dataset  [35]. It consists of 200K frames, and half of them have augmented backgrounds. For the background augmentation, we use images of COCO that do not include humans to follow Moon et al.  [37]. Following them, we use this dataset only for the training.

FreiHAND. FreiHAND  [62] is a large-scale 3D hand pose and mesh dataset. It consists of a total of 134K frames for training and testing. Following Zimmermann et al.  [62], we report PA-MPVPE, F-scores, and additionally PA-MPJPE of Pose2Mesh. \(\mathcal {J}\mathbf {M}\) is evaluated for the joint errors.

Table 1. The performance comparison between four combinations of regression target and network design tested on Human3.6M. ‘no. param.’ denotes the number of parameters of a network, which estimates SMPL parameters or vertex coordinates from the output of PoseNet.

6.2 Ablation Study

To analyze each component of the proposed system, we trained different networks on Human3.6M, and evaluated on Human3.6M and 3DPW. The test 2D input poses used in Human3.6M and 3DPW evaluation are outputs from Integral Regression  [54] and HRNet  [52] respectively, using groundtruth bounding boxes.

Regression Target and Network Design. To demonstrate the effectiveness of regressing the 3D mesh vertex coordinates using GraphCNN, we compare MPJPE and PA-MPJPE of four different combinations of the regression target and the network design in Table 1. First, vertex-GraphCNN, our Pose2Mesh, substantially improves the joint errors compared to vertex-FC, which regresses vertex coordinates with a network of fully-connected layers. This proves the importance of exploiting the human mesh topology with GraphCNN, when estimating the 3D vertex coordinates. Second, vertex-GraphCNN provides better performance than both networks estimating SMPL parameters, while maintaining the considerably smaller number of network parameters. Taken together, the effectiveness of our mesh coordinate regression scheme using GraphCNN is clearly justified.

In this comparison, the same PoseNet and cascaded architecture are employed for all networks. On top of the PoseNet, vertex-FC and param-FC used a series of fully-connected layers, whereas param-GraphCNN added fully-connected layers on top of Pose2Mesh. For the fair comparison, when training param-FC and param-GraphCNN, we also supervised the reconstructed mesh from the predicted SMPL parameters with \(L_{\text {vertex}}\) and \(L_{\text {joint}}\). The networks estimating SMPL parameters incorporated Zhou et al.’s method  [61] for continuous rotations following  [27].

Table 2. The performance comparison on Human3.6M between two upsampling schems. GPU mem. and fps denote the required memory during training and fps in inference time respectively.
Table 3. The MPJPE comparison between four architectures tested on 3DPW.
Table 4. The upper bounds of the two different graph convolutional networks that take a 2D pose and a 3D pose. Tested on Human3.6M.

Coarse-to-Fine Mesh Upsampling. We compare a coarse-to-fine mesh upsampling scheme and a direct mesh upsampling scheme. The direct upsampling method performs graph convolution on the lowest resolution mesh until the middle layer of MeshNet, and then directly upsamples it to the highest one (e.g., 96 to 12288 for the human body mesh). While it has the same number of graph convolution layers and almost the same number of parameters, our coarse-to-fine model consumes half as much GPU memory and runs 1.5 times faster than the direct upsampling method. It is because graph convolution on the highest resolution takes much more time and memory than graph convolution on lower resolutions. In addition, the coarse-to-fine upsampling method provides a slightly lower joint error, as shown in Table 2. These results confirm the effectiveness of our coarse-to-fine upsampling strategy.

Cascaded Architecture Analysis. We analyze the cascaded architecture of Pose2Mesh to demonstrate its validity in Table 3. To be specific, we construct (a) a GraphCNN that directly takes a 2D pose, (b) a cascaded network that predicts mesh coordinates from a 3D pose from pretrained PoseNet, and (c) our Pose2Mesh. All methods are both trained by synthesized 2D poses. First, (a) outperforms (b), which implies a 3D pose output from PoseNet may lack geometry information in the 2D input pose. If we concatenate the 3D pose output with the 2D input pose as (c), it provides the lowest errors. This explains that depth information in 3D poses could positively affect 3D mesh estimation.

To further verify the superiority of the cascaded architecture, we explore the upper bounds of (a) and (d) a GraphCNN that takes a 3D pose in Table 4. To this end, we fed the groundtruth 2D pose and 3D pose to (a) and (d) as test inputs, respectively. Apparently, since the input 3D pose contains additional depth information, the upper bound of (d) is considerably higher than that of (a). We also fed state-of-the-art 3D pose outputs from  [37] to (d), to validate the practical potential for performance improvement. Surprisingly, the performance is comparable to the upper bound of (a). Thus, our Pose2Mesh will substantially outperform (a) a graph convolution network that directly takes a 2D pose, if we can improve the performance of PoseNet.

In summary, the above results prove the validity of our cascaded architecture of Pose2Mesh.

Table 5. The accuracy comparison between state-of-the-art methods and Pose2Mesh on Human3.6M. The dataset names on top are training sets.
Table 6. The accuracy comparison between state-of-the-art methods and Pose2Mesh on 3DPW. The dataset names on top are training sets.

6.3 Comparison with State-of-the-art Methods

Human3.6M. We compare our Pose2Mesh with the previous state-of-the-art 3D body pose and mesh estimation methods on Human3.6M in Table 5. First, when we train all methods only on Human3.6M, our Pose2Mesh significantly outperforms other methods. However, when we train the methods additionally on COCO, the performance of the previous baselines increases, but that of Pose2Mesh slightly decreases. The performance gain of other methods is a well-known phenomenon  [54] among image-based methods, which tend to generalize better when trained with diverse images from in-the-wild. Whereas, our Pose2Mesh does not benefit from more images in the same manner, since it only takes the 2D pose. We analyze the reason for the performance drop is that the test set and train set of Human3.6M have similar poses, which are from the same action categories. Thus, overfitting the network to the poses of Human3.6M can lead to better accuracy. Nevertheless, in both cases, our Pose2Mesh outperforms the previous methods in both MPJPE and PA-MPJPE. The test 2D input poses for Pose2Mesh are estimated by the method of Sun et al.  [54] trained on MPII dataset  [2], using groundtruth bounding boxes.

3DPW. We compare MPJPE, PA-MPJPE, and MPVPE of our Pose2Mesh with the previous state-of-the-art 3D body pose and mesh estimation works on 3DPW, which is an in-the-wild dataset, in Table 6. First, when the image-based methods are trained only on Human3.6M, they give extremely high errors. This is because the image-based methods are overfitted to the image appearance of Human3.6M. In fact, since Human3.6M is an indoor dataset from the controlled setting, the image features from it are very different from in-the-wild image features. On the other hand, since our Pose2Mesh takes a 2D human pose as an input, it does not overfit to the particular image appearance. As a result, the proposed system gives far better performance on in-the-wild images from 3DPW, even when it is trained only on Human3.6M while other methods are additionally trained on COCO. By utilizing accurate 3D annotations of the lab-recorded 3D datasets  [19] without image appearance overfitting, Pose2Mesh does not require 3D data captured from in-the-wild. This property can reduce data capture burden significantly because capturing 3D data from in-the-wild is very challenging. The test 2D input poses for Pose2Mesh are estimated by HRNet  [52] and Simple  [59] trained on COCO, using groundtruth bounding boxes. The average precision (AP) of  [52] and  [59] are 85.1 and 82.8 on 3DPW test set, 72.1 and 70.4 on COCO validation set, respectively.

Table 7. The accuracy comparison between state-of-the-art methods and Pose2Mesh on FreiHAND.

FreiHAND. We present the comparison between our Pose2Mesh and other state-of-the-art 3D hand pose and mesh estimation works in Table 7. The proposed system outperforms other methods in various metrics, including PA-MPVPE and F-scores. The test 2D input poses for Pose2Mesh are estimated by HRNet  [52] trained on FreiHAND  [62], using bounding boxes from Mask R-CNN  [17] with ResNet-50 backbone  [18].

Comparison with Different Train Sets. We report MPJPE and PA-MPJPE of Pose2Mesh trained on Human3.6M, COCO, and MuCo-3DHP, and other methods trained on different train sets in Table 8. The train sets include Human3.6M, COCO, MPII  [2] , LSP  [20], LSP-Extended  [21], UP  [29], and MPI-INF-3DHP  [35]. Each method is trained on a different subset of them. In the table, the errors of  [23, 27, 28] decrease by a large margin compared to the errors in Table 5 and 6. Although this shows that the image-based methods can improve the generalizability with weak-supervision on in-the-wild 2D pose datasets, Pose2Mesh still provides the lowest errors in 3DPW, which is the in-the-wild benchmark. This suggests that avoiding the image appearance overfitting while benefiting from the accurate 3D annotations from the controlled setting datasets is important. We measured the PA-MPJPE of Pose2Mesh on Human3.6M by testing only on the frontal camera set, following the previous works  [23, 27, 28].

Table 8. The accuracy comparison between state-of-the-art methods and Pose2Mesh on Human3.6M and 3DPW. Different train sets are used.
Fig. 4.
figure 4

Qualitative results of the proposed Pose2Mesh. First to third rows: COCO, fourth row: FreiHAND.

Figure 4 shows the qualitative results on COCO validation set and FreiHAND test set. Our Pose2Mesh outputs visually decent human meshes without post-processing, such as model fitting  [28]. More qualitative results can be found in the supplementary material.

7 Discussion

Although the proposed system benefits from the image appearance invariant property of the 2D input pose, it could be challenging to recover various 3D shapes solely from the pose. While it may be true, we found that the 2D pose still carries necessary information to reason the corresponding 3D shape to some degree. In the literature, SMPLify  [5] has experimentally verified that under the canonical body pose, utilizing 2D pose significantly drops the body shape fitting error compared to using the mean body shape. We show that Pose2Mesh can recover various body shapes from the 2D pose in the supplementary material.

8 Conclusion

We propose a novel and general system, Pose2Mesh, for 3D human mesh and pose estimation from a 2D human pose. The 2D input pose enables the system to benefit from the data captured from the controlled settings without the image appearance overfitting. The model-free approach using GraphCNN allows it to fully exploit mesh topology, while avoiding the representation issues of the 3D rotation parameters. We plan to enhance the shape recover capability of Pose2Mesh using denser keypoints or part segmentation, while maintaining the above advantages.