1 Introduction

Hands play an indispensable role in human daily life interactions. Reconstruction of 3D hand geometries has key significance in a variety of computer graphics applications such as computer animation, 3D game, virtual reality (VR), augmented reality (AR) and human–computer interaction [2, 10, 49].

A great number of approaches have been explored for estimating 3D hand joint positions in the literature. Early work generally makes good use of optical markers [34, 54] or glove techniques [12, 26, 47] which are capable of recovering joint positions with high fidelity. Recent development shows that deep neural networks (DNNs) are very promising to reconstruct 3D hand poses from RGB/D images taken by consumer-level cameras [40, 42, 45, 52, 56], although it is still a challenging task for these approaches to predict 3D joint positions with accuracy comparable to traditional methods due to complexity and occlusion.

Joint positions should further be converted to motion parameters for most computer graphics applications such as animation and virtual/mixed reality. Traditional approaches usually leverage the inverse kinematic technique to estimate, while recent works resort to deep neural networks to regress the parameters. However, the former are usually time-consuming due to the nonlinear optimization, while the latter are required to further improve in accuracy.

We propose a stable, accurate and fast enough method to estimate motion parameters from given 3D hand joint positions. The problem is formulated as an nonlinear optimization with finger movement range constraints. To force the estimate falling within the hand motion space, we exploit hand biomechanical constraints to restrict the rotation of hand joints within a specific range. We then attack the optimization with a block coordinate descent method by extremely decomposing it into a set of optimizations for a single joint rotation. Specifically, we alternatively optimize each coordinate on the kinematic chain from the root joint to leaf joints. When optimizing the motion parameters of a joint, we fix those of all other joints. This not only helps deal with motion constraints but also makes it possible to derive closed-form solutions. To sum up, our approach has the following contributions:

  • The biomechanical constraints are introduced to reduce the solution space of hand poses. It not only makes it easier to obtain reasonable hand poses but also helps accelerate the solution of the optimization.

  • A block coordinate descent algorithm is designed to solve the optimization, which cooperates with the hand motion constraints to fit the MANO model to 2D or 3D joint positions.

  • A variety of experiments show that our method can obtain more accurate motion parameters and reasonable hand poses than the state-of-the-art approaches.

2 Related works

There are a variety of disciplines to acquire 3D hand shape and pose meshes in the literature. According to input data, 3D hand mesh models can be reconstructed from a single image, multiview images, videos, 3D information (by scanners or glove sensors). We mainly focus on image-based modeling approaches from two aspects according to whether involving parametric models or not. For a comprehensive review, one can refer to the survey by Ahmad et al. [1].

2.1 Nonparametric 3D hand reconstruction

2.1.1 Direct hand pose and shape reconstruction

Early methods acquire dense point clouds of hands either by scanning or via multiview image reconstruction. Less works are particularly contributed to hand reconstruction in this category [21, 33] because of the difficulty of combining hand priors.

The introduction of machine learning approaches has changed this situation [6]. Given a hand image, Kulon et al. [22] employ an encoder to extract its latent code and then generate a 3D hand mesh. Ge et al. [11] build graph CNNs to recover 3D hand models from images. Peng et al. [21, 33] leverage a three-stage and coarse-to-fine GCN to regress the vertex coordinates of the hand mesh. Nevertheless, these approaches are still in the stage of preliminary exploration and usually difficult to achieve high accuracy.

2.1.2 Hand pose reconstruction based on database retrieval

By building a database of different 3D hand shapes and poses as well as their different view images, one can retrieve an approximatory mesh model in the database for a given image [3, 16, 37]. Miyamoto et al. [28] propose a tree structure to speed up the database retrieval process. Besides, to improve the matching accuracy, Imai et al. [17] introduce a mismatched likelihood index. Wang et al. [47] build an image dataset of hands wearing customized color gloves. It is difficult to include all possible poses in a database due to the diversity of hand poses. This makes this kind of approaches hard to achieve high accuracy.

2.2 3D hand reconstruction based on parametric models

This category assumes a hand parametric model has been built. It only needs to estimate shape and motion parameters for reconstructing 3D hands. In the following, we first briefly recall some hand parametric models and then review the reconstruction methods.

2.2.1 Hand parametric models

Inspired by the idea of linear blend skinning (LBS) for human body [25], a number of parametric models are proposed to represent hand shape and pose [49]. For example, Bray et al. [5] create an LBS for hand with a mesh template of 9051 vertices and a skeleton of 30 degrees of freedom. Oikonomidis et al. [30] substitute 37 cylinders and spheres for the mesh template to approximate hand shape and pose for hand tracking. Melax et al. [27] further reduce the geometry to convex polyhedra for fast skeletal hand tracking.

Wheatland et al. [48] perform PCA on the American sign language database to extract pose PCA bases in order to reduce the dimensionality of the hand motion space. A more popular model is MANO by Romero et al. [38], which adds a set of shape parameters to the LBS model in order to capture the hand shape of different individual subjects. The PCA technique is combined to simplify the model. Qian et al. [36] consummate MANO by augmenting it with a parametric texture model.

Different skeletal structures include different number of joints. For example, NYU lab [45] builds a dataset based on 14 joints, the ICVL dataset [44] adopts 16 joints, and the MSRA dataset [35] involves 21 joints. Most parametric models adopt 21 joints to describe hand motion [38].

Fig. 1
figure 1

Illustration of MANO hand model [38]: the hand mesh as well as its skeleton tree in which color segments and circles, respectively, illustrate the bones and joints of the hand skeleton

2.2.2 Hand pose reconstruction

Most of earlier methods aim at optical-marker-based motion capture, which are able to obtain joint positions of high accuracy. Fitting the LBS to these tracked joints is the so-called inverse kinematics (IK) problem. Numerical methods such as Jacobian inverse technique [20] are usually employed to address the issue. The core idea is to model the forward kinematics equation using Taylor expansion to simplify the solution.

To improve the accuracy of joint prediction, some works [15, 23, 43, 54] introduce additional image features such as edges, silhouettes and textures to guide the pose fitting. IK becomes an energy term of a more general minimization problem. La Gorce et al. [23] employ a quasi-Newton method to tackle the optimization, while Zhao et al. [54] apply the particle swarm algorithm to address the issue.

Both Xiang et al. [51] and Pavlakos et al. [31] take advantage of the gradient descent algorithm to address the optimization mixing constraints of joint positions, image features as well as priors to recover whole body shape/pose from a single image. The algorithm may trap in local minima because of improper initialization and suffer low efficiency due to the complexity of the search space. Making use of PCA to reduce the space of motion parameters can accelerate the process but is hard to support joint constraints due to lack of semantic meanings [38, 48]. Instead of resorting to image features, we introduce hand biomechanical constraints on the optimization and solve it with an iterative coordinate descent algorithm. We clearly define the motion space of hand joints by considering the degree of freedom (DOF) of every joint according to the hand biomechanical constraints. This not only makes our results approximate the valid pose as much as possible but also increases the efficiency and stability of the solution.

Deep neural networks are also employed to address parametrization-driven hand pose reconstruction. Zhou et al. [55] use existing motion capture data to train a six-layer perceptron (MLP) to regress 2D joint positions and 3D joint angles. Zhang et al. [53] and Boukhayma et al. [4], respectively, propose an end-to-end neural network to predict motion parameters with 2D image as input. Qian et al. [36] leverage the network in [4] to obtain the motion parameters of MANO and further to refine the mesh model by photometric loss. Other deep learning-based approaches either leverage depth information [29, 32] or combine image and depth information together. As pose data usually distributes near the mean pose of the dataset, neural networks are prone to smooth the pose and make the result near the mean [41].

Fig. 2
figure 2

Hinge joint (left) and saddle joint (right)

3 Preliminaries

For convenience of description, we first introduce the formulation of MANO and then present the notion of hand anatomical kinematics as preliminaries.

3.1 MANO

MANO (hand Model with Articulated and Non-rigid deformations) [38] is a hand parametric model built after SMPL [24], a human body parametric model. Like SMPL, MANO employs a group of PCA bases to capture the shape variation of specific hands and the skeleton skinning technique [25] to generate motion gestures of the specific hand. The skeleton used in MANO has 21 joints [35, 44, 45], as shown in Fig. 1 in which color line segments represent bones (edges of the tree) and color circles indicate joints. Vertices of the hand mesh in Fig. 1 are evaluated using MANO [38].

Denote the average hand mesh in rest (zero) pose by \(\overline{T}=<\overline{V},E,F>\), where \(\overline{V}=\{\overline{v}_i ,i=1,\ldots ,N\}\) is the set of vertices, and \(E\subset [1 \cdots N]^2\), \(F\subset [1 \cdots N]^3\) are, respectively, the set of edges and the set of triangles. The shape variation of a specific hand with respect to \(\overline{T}\) is captured by the set of PCA bases \(\mathcal{{S}}=\{\ S_i\in R^{3N} : i=1,\ldots ,|\mathcal{{S}}|\}\), where \(|\mathcal{{S}}|\) indicates the element number in \(\mathcal{{S}}\). Without causing confusion, we also use \(\overline{V}\) to denote the 3N-dimensional vector concatenated by the coordinates of its vertices.

The variation of an arbitrary hand shape with respect to \(\overline{T}\) can then be computed by blending the bases with coefficients \(\beta =\{\ {\beta }_i\in R: i=1,\ldots ,|\mathcal{{S}}|\}\):

$$\begin{aligned} B_S(\beta ,\mathcal{{S}})=\sum _{k=1}^{|\mathcal{{S}}|} {\beta }_i S_i \end{aligned}$$

To improve the skinning accuracy, MANO makes a compensation to the rest pose in terms of motion parameters

$$\begin{aligned} B_P(\theta ,\mathcal{{P}})=\sum _{k=1}^{9K} (R_i(\theta )-R_i({\theta })^*) P_i \end{aligned}$$

where K is the number of bones, \(\mathcal{{P}}=\{\ P_i, i=1,\ldots ,{9K}\}\) is a set of pose bases for blending vertex offsets, \({\varvec{\theta }}=({\varvec{\omega }}_1,{\varvec{\omega }}_2,\ldots ,{\varvec{\omega }}_{20})\) is the motion parameters of an arbitrary pose and \(R_i(\theta )\) indicates the ith entry of the rotational matrices (total K rotational matrices and each matrix has \(3 \times 3=9\) elements). The full form of MANO can then be written as

$$\begin{aligned} \text{ M }(\beta ,\theta )=\text{ W }(T_P(\beta ,\theta ),\text{ J }(\beta ),\theta ,\mathcal {W}) \end{aligned}$$

where \(\text{ W }\) is the skinning function (linear blending skinning) \(\text{ J }(\beta )=(\mathbf{j} _0,\mathbf{j} _1,\ldots ,\mathbf{j} _{20})\in R^{21 \times 3}\) is the set of joint position in the rest pose, \(\mathcal {W}\) is the weight matrix, and

$$\begin{aligned} T_P(\beta ,\theta )=\overline{V}+B_S(\beta ,\mathcal{{S}})+B_P(\theta ,\mathcal{{P}}). \end{aligned}$$

\(\overline{T}\), \(\mathcal {S}\), \(\mathcal {P}\) and \(\mathcal {W}\) are known for our reconstruction task.

3.2 Hand anatomical kinematics

We follow the anatomical kinematics structure in a hand animation approach [9] which is also adopted by robotics arms [46]. Such structure employs the same skeleton as shown in Fig. 1 in which hand joints are classified into hinge joints including joints 2, 3, 6,7, 10, 11, 14, 15, 18 and 19, and saddle joints containing joints 1, 5, 9, 13 and 17.

Specifically, a hinge joint has 1 degree of freedom, namely bones shooting from this kind of joints can only conduct bending motion as shown in Fig. 2 (left). A saddle joint has two degrees of freedoms as depicted in Fig. 2 (right), which are respectively described by two different rotational angles (flexion/extension, abduction/adduction) as shown in Fig. 3. Biomechanical constraints [50] are exploited to constrain the motion range of these angles.

Fig. 3
figure 3

Rotation range of saddle joint

4 The proposed method

Now, we describe our framework which involves biomechanical constraints of hand motions, the mathematical model for reconstructing hand poses from joint positions, and the solution of the model.

4.1 Hand motion constraints

According to [9, 50] (see Sect. 3.2), there are two kinds of joints in a hand, i.e., hinge joints and saddle joints. For the sake of formal description, we create a local coordinate system for each joint (say k for example) in order to describe the orientation of the bone starting from k. As shown in Fig. 3, its origin is placed at k, its x axis points from the parent joint of k to k itself, and the rotational axis of the bending motion around k is viewed as z axis such that the bending motion observes the right-hand rule as shown in Fig. 3.

In the local coordinate system of hinge joint k, the motion of the bone starting from k is actually a rotation around axis z. Let \(\phi _{z,k}\) be the rotational angle. We can then express the rotation matrix as [50]

$$\begin{aligned} e^{{\varvec{\omega }}_k} = e^\mathbf{n _{z} \phi _{z,k}} \end{aligned}$$
(1)

with \(\mathbf{n} _{z} = (0,0,1)\) and \(\phi _{z,k} \in [\phi _{z,k}^{\mathrm{min}},\phi _{z,k}^{\mathrm{max}}]\), where \(e^{*}\) is called Rodrigues function.

For saddle joint k, the motion of the bone shooting from it is a composition of the two rotations, respectively, around axis z and axis y. Similarly, denoting the two angles by \(\phi _{z,k}\) and \(\phi _{y,k}\) separately, we then have [50]

$$\begin{aligned} e^{{\varvec{\omega }}_k}=e^{(\mathbf{n} _{z} \phi _{z,k})}e^{({\varvec{n}}_{y} \phi _{y,k})}, \end{aligned}$$
(2)

where \(\mathbf{n} _{y} = (0,1,0)\). The above rotational angles satisfy the following ellipse constraint

(3)

where \(*= '\hbox {min}' ~\text {or} ~'\hbox {max}' \) as shown in Table 1 [2, 17] depending on which quadrant the child joint of joint k locates in.

The above rotation transformations will be formulated into MANO [38]. Figure 4 depicts z axis of the local system of all joints. We determine the z axes using skin surface details and registration data. Notice that leaf joints (finger tips) have no additional information since there is no bone shooting from them. On the other hand, as the root of the kinematic tree, the wrist joint is free of constraints. In practical terms, the translation of the root joint can be viewed as the inverse translation of the camera. So, we can fix the root node and only estimate the global transformation.

Fig. 4
figure 4

Right-hand mesh template (left) and z axes of its joints (blue arrows in the right)

Table 1 Rotation range of hand joints (unit: )
Fig. 5
figure 5

Visual results of the proposed algorithm. Left is the input joint positions; right shows the corresponding results

4.2 Hand pose reconstruction from 3D joint positions

In our setting, we neglect the hand shape and only reconstruct the pose using MANO. Denote the position vector of all joints by \(\mathbf{J} ^{3D}=(\mathbf{j} _0^{3D},\mathbf{j} _1^{3D},\ldots , \mathbf{j} _{20}^{e})\in R^{21 \times 3}\). Particularly, denote their rest pose counterparts by \(\mathbf{J} ^{*}=(\mathbf{j} _0^{*},\mathbf{j} _1^{*},\ldots ,\mathbf{j} _{20}^{*})\in R^{21 \times 3}\) (see Fig. 4).

Pose change is captured by rotating around joints. In our setting, only 15 joints are rotatable. The wrist joint is fixed and its orientation is described using the camera parameters instead. In addition, tips of five fingers have not parameters. Namely, \({\varvec{\omega }}_0\) remains unchanged, while \({\varvec{\omega }}_4,{\varvec{\omega }}_8,{\varvec{\omega }}_{12}, {\varvec{\omega }}_{16}\) and \({\varvec{\omega }}_{20})\) are known and have no impact on the pose. According to the forward kinematics, this yields the following global transformation matrix for joint k:

(4)

where \(e^{({\varvec{\omega }}_i)}\) is the rotation matrix of joint i, A(k) represents the node path from the second level ancestor, which is adjacent to node 0, to the parent node of joint k; p(i) denotes the parent node of i; \(\mathbf{G} _k({\varvec{\theta }},\mathbf{J} )\) describes the transformation of joint k related to the world system. Equation (4) can further be simplified as

$$\begin{aligned} \mathbf{G} _k({\varvec{\theta }},\mathbf{J} ^{*})= \begin{bmatrix} \mathbf{R} _k &{} \mathbf{t} _k \\ 0 &{} 1 \end{bmatrix} \end{aligned}$$
(5)

where \(\mathbf{R} _k\) is a \(3 \times 3\) rotation matrix and \(\mathbf{t} _k\) is the translation of joint k. Let \([\mathbf{R} ^g|\mathbf{t} ^g]\) be the global rigid transformation. If a set of joint positions in \(J^e\) are given, motion estimation can then be formulated as minimizing the mean square distance between the prediction joint positions and the given ones under hand biomechanical constraints:

(6)
$$\begin{aligned} \begin{aligned} \text {s.t.} \\&\text {(i) if DOF(k)=1,} {\varvec{\omega }}_k=\mathbf{n} _{z}\phi _{z,k}, \phi _{z,k} \in [\phi _{z,k}^{min},\phi _{z,k}^{max}]; \\&\text {(ii) if DOF(k)=2,} e^{({\varvec{\omega }}_k)}= e^{(\mathbf{n} _{z}\phi _{z,k})} e^{(\mathbf{n} _{y}\phi _{y,k})}; \\&~~~~\text {and} \left( \frac{\phi _{y,k}}{\bar{\phi }_{y,k}^{*}} \right) ^2+\left( \frac{\phi _{z,k}}{\bar{\phi }_{z,k}^{*}} \right) ^2 \le 1. \\ \end{aligned} \end{aligned}$$

The arguments to be optimized are pose parameters \({\varvec{\theta }}\), global transformation \(\mathbf{R} ^g\) and \(\mathbf{t} ^g\). It is difficult to consider all the constraints simultaneously. Hence, inspired by the idea of ICP and based on the hierarchical structure of the hand kinematic tree, we devise a block coordinate descent scheme to address Eq. 6.

4.3 Numerical solution

We divide the variables into different blocks according to their joint number in order to solve Eq. 6. For each block, we minimize a subproblem by fixing all other blocks. The order of optimizing different blocks follows the join number order in a hierarchical manner: first \(\mathbf{R} ^g\) and \(\mathbf{t} ^g\!\), then from \({\varvec{\omega }}_1\) to \({\varvec{\omega }}_{19}\). In this subsection, we will separately discuss the subproblems according to their types including camera parameters, saddle joints and hinge joints.

4.3.1 Initialization

In the beginning, we first need to initialize the current poses. Specifically, we evaluate the mean pose of the MANO dataset as the initial pose and use it to estimate the camera external parameters. \(\mathbf{G} _k\) in Eq. 5 can then be obtained from current pose parameters \({\varvec{\theta }}\).

4.3.2 Update of global rigid transformation

Global transformation is in place of the root pose. Hence, all hand joint positions are transformed correspondingly. While only considering \(\mathbf{R} ^g\) and \(\mathbf{t} ^g\), the optimization has the form:

$$\begin{aligned} \mathop {\arg \min }_\mathbf{R ^g,\mathbf{t} ^g} \ \sum _{k}(\mathbf{R} ^g \mathbf{t} _k+\mathbf{t} ^g-\mathbf{j} _k^e)^2,\nonumber \\ \qquad \qquad \text {s.t.} \qquad (\mathbf{R} ^g)^T\mathbf{R} ^g= \mathbf{I} \end{aligned}$$
(7)

A method to address this problem uses the difference between barycenters of source points and target points in order to find the translation \(\mathbf{t} ^g\), while \(\mathbf{R} ^g\) can be solved with Kabsch algorithm [19].

Table 2 MPJPE and MPVPE on the MANO dataset
Table 3 PA MPJPE and PA MPVPE on the MANO dataset
Table 4 Hand reconstruction from a single image: comparison

4.3.3 Update of saddle joint parameters

Saddle joints locate in the second level in the kinematic chain, which are directly adjacent to the root. Updating the transformation of this kind of joints only influences the position of their descendent joints. Therefore, the optimization for saddle joint k reduces to

$$\begin{aligned}&\mathop {\arg \min }_{{\varvec{\omega }}_{k}} \sum _{i \in D(k)}(\mathbf{R} ^g (\mathbf{R} _{k} e^{{\varvec{\omega }}_k} (\mathbf{j} _i^*-\mathbf{j} _k^*)+\mathbf{t} _k) + \mathbf{t} ^g-\mathbf{j} _i^e)^2,\nonumber \\&\text {s.t.} \nonumber \\&\qquad e^{{\varvec{\omega }}_k} = e^\mathbf{n _{k}^*\phi _{z,k}} exp^{{\varvec{\omega }}_{y,k}^*\phi _{y,k}}; \end{aligned}$$
(8)

and

$$\begin{aligned} \left( \frac{\phi _{y,k}}{\bar{\phi }_{y,k}^{*}} \right) ^2+\left( \frac{\phi _{z,k}}{\bar{\phi }_{z,k}^{*}} \right) ^2 \le 1. \end{aligned}$$

where D(k) is the descendant set of joint k. Equation 8 solves the rotation of joint k by minimizing the error of the prediction to the ground truth of the descendents of joint k. Noting \((R^g)^{-1}= (R^g)^{T}\) and \(R_{k}^{-1} = R_{k}^{T}\), we have the following equivalent form of Eq. 8

$$\begin{aligned} \begin{aligned}&\mathop {\arg \min }_{{\varvec{\omega }}_{k}} \sum _{i \in D(k)}(e^{{\varvec{\omega }}_k}(\mathbf{j} ^*_i-\mathbf{j} ^*_k)+\mathbf{R} _{k}^T(\mathbf{t} _k+ \\&\qquad \qquad \qquad (\mathbf{R} ^g)^T(\mathbf{t} ^g-\mathbf{j} _i^e)))^2 \\ \end{aligned} \end{aligned}$$
(9)

Equation 9 is actually the orthogonal Procrustes problem [13] if neglecting the constraints of Eq. 8. Henceforth, we tackle it via two steps. First, Kabsch algorithm is employed to address the unconstrained problem to yield \(\mathbf{R} _{\mathrm{temp}}\). Second, Euler angles around z and y axes are extracted from \(\mathbf{R} _{\mathrm{temp}}\) [8]. Viewing the pair of Euler angles as a point, we then find its nearest point within the ellipse by using Newton–Raphson algorithm. Let \(\mathbf{R} _{k}\) be the rotation matrix constructed from the new Euler angles, \(\phi _{y,k}\) and \(\phi _{z,k}\), which are around axes y and z, respectively.

It should be noted that an additional internal rotation for thumb joints should be considered. We formulate it as \(\mathbf{R} _{in}=e^{(\mathbf{j} _{c(k)}-\mathbf{j} _k) \alpha \phi _{y,k}}\), where \(\alpha \) is a predefined heuristic parameter. The final transformation for joint k of the thumb is then computed as \(R_{in}R_k\). The internal rotation of other fingers can be ignored because the rotation around the z axis of other fingers is generally perpendicular to the hand palm.

Fig. 6
figure 6

PA MPJPE and PA MPVPE curves of the three algorithms (ours, GDC [38] and MLP [55]) for inputs with different levels of Gaussian noise. The errors are evaluated using the reconstructed mesh and the ground truth

Fig. 7
figure 7

PA MPJPE curves of the three algorithms (ours, GDC [38] and MLP [55]) for inputs with different levels of Gaussian noise. The errors are evaluated using the reconstructed joint positions and the groud truth

4.3.4 Update of hinge joint parameters

Update of hinge joints is similar to update of saddle joints but their constraints are simpler. Namely, only one rotational angle is constrained for each joint. Hence, the approach for saddle joints is also applicable here. Specifically, we first solve the unconstrained problem to obtain a rotation transformation, then extract Euler angle of the fixed axis, and finally apply the constraints to the angle for computing the final rotation.

4.3.5 Stop criteria and collision avoidance

In an iteration of updating all the parameters, our algorithm successively optimizes the parameters of each single joint as shown in Algorithm 1. It stops when there is no improvement or the iteration number exceeds the threshold. To avoid the possible finger intersection, we introduce an additional step to detect collision. It treats each bone as a capsule. Therefore, once the distance between two line segments (of the bones) is less than a specified threshold determined by the thickness of the two bones, they are considered intersection. In this case, we move one of the bones along the opposite direction to ensure the distance threshold.

figure g
Fig. 8
figure 8

Single image reconstruction: each row shows two examples and total 8 examples are depicted. Each example includes three images which are, respectively, the input image, the reconstructed hand model by our method, and the result by FrankMoCap [39]

4.4 Reconstruction from images

Our approach can be applied to reconstruct 3D poses from a single image. The process consists of two steps. Firstly, we employ existing methods to detect 2D joints on the image and estimate the 3D joint positions from 2D ones. In our experiments, we use PoseNet [7] to do this. After that, we can use Algorithm 1 to estimate motion data which is finally used to skin the MANO model to generate hand meshes.

5 Experiments

The proposed algorithm is implemented with Python on a PC with Intel(R) Core (TM) i7-4470 CPU @ 3.4GHz. This section presents a variety of experiments to show the performance of the proposed algorithm.

5.1 Visual results

We first take use of some examples with typical hand poses to show the effectiveness of our method which solves the motion of each finger independently and combine them together to express complicated poses. In each example, the 3D joint positions are given. Figure 5 illustrates that our approach is able to rotate and redirect the fingers to exactly register the joint positions.

5.2 Accuracy on the MANO registration dataset

To quantitatively evaluate our approach, we conduct an experiment on the MANO dataset [38] which is built by using the MANO hand parametric model to register the MoCap scans. The dataset includes 1554 poses of real human hands from 31 subjects. Each sample consists of a set of 3D hand joint positions as well as the corresponding pose parameters. We recover the pose parameters from 3D joint positions and then compared them with the ground truth.

Two error metrics are used: mean per-joint position error (MPJPE) and mean per-vertex position error (MPVPE). In addition, we also compute PA MPJPE and PA MPVPE, the rigid alignment values of MPJPE and MPVPE respectively. The gradient descent (GDC) method [38] and the MLP method [55] are selected to compare with our approach. Table 2 summarizes the mean errors of all samples in the dataset, their standard deviation (std) and timings. Results for PA MPJPE and PA MPVPE are listed in Table 3. Both tables illustrate that our method outperforms the other two in both accuracy and stability (with smaller std). In addition, GDC is the slowest one, while our method is comparable to the MLP method (Table 3).

5.2.1 Accuracy on the dataset with simulated noise

Considering that real data acquired from motion capture devices are usually contaminated, we evaluate our method using inputs with different levels of simulated noise. As the average length of hand bones is 33.4 mm in the template pose, we, respectively, add Gaussian noise with standard deviation of 2, 5, 10 and 20 (mm) to every joint position of the MANO registration dataset.

Figure 6 shows the curves of PA MPJPE and PA MPVPE between the reconstructed hand mesh and the ground truth, while Fig. 7 depicts the PA MPJPE curves of joint positions. Both demonstrates that our approach outperforms the other two state-of-the-art approaches [38, 55] in reconstruction accuracy owing to introducing mechanical constraints. Nevertheless, our method is more sensitive to the noise level change. This is because some noisy inputs may happen to be a valid pose, while our algorithm still fits the pose parameters to such a deformed pose. An elaborate comparison with the state-of-the-art methods [4, 14, 55, 57] is conducted as illustrated in Table 4. Our approach achieves the highest accuracy.

5.3 Pose reconstruction from a single image

To reconstruct hand pose from a single RGB image, we first apply PoseNet [7] to estimate 3D joint positions from the image and then compute motion parameters with our algorithm. The experiment is conducted on the FreiHAND dataset [57] which consists of 130K training images and 4K test images with MANO pose parameters. Figure 8 depicts some visual results (MANO models). We also illustrate the results by FrankMoCap [39] as comparison. Visually, the hand gestures by our method are better than those by FrankMoCap [39]. In the first example, the gesture by FrankMoCap [39] is even wrong (see column 3 of row 1 in the figure).

We also compare our algorithm with the state-of-the-art approaches described in [4, 14, 55, 57]. PoseNet [7] is trained using the training images to estimate 3D joint positions as input of our method. All involved reconstruction algorithms are directly used without additional training. Figure 9 depicts the PCK curves by these approaches and ours. It can be observed that our approach performs best.

Table 4 summarizes the reconstruction errors by the involved approaches measured by a variety of error metrics, which also shows that our method outperforms the state-of-the-art methods in almost all indices. Though minimal hand [55] is faster than our approach, all accuracy indices are worst.

Fig. 9
figure 9

3D Hand mesh reconstruction PCK curves for our method and approaches in [4, 14, 55, 57]

To show the stability of our approach, we apply it to reconstruct a sequence frame by frame. The data come from sequence 171204_pose6 of the database in [18]. Totally, 100 frames are reconstructed (see the attached video), and the indices of these 100 frames are from 20,701 to 20,800. Here, we depict 5 of the 100 frames with indices of 20,720, 20,740, 20,760, 20,780 and 20,800 in Fig. 10. PA MPJPE by our method is 5.3302 with std = 0.4320, while PA-MPJPE by FrankMoCap is 5.3442 with std = 0.5151. Our approach is more stable than FrankMoCap [39].

Fig. 10
figure 10

Sequence reconstruction: the top row shows the corresponding images, the middle row shows results reconstructed by our approach, and the bottom row shows results by FrankMoCap [39]

5.4 Limitations

We accomplish the task of reconstructing hand poses from hand joint coordinates by proposing an iterative algorithm based on mechanical constraints. A variety of experiments demonstrate that it exhibits excellent performance compared to the state-of-the-art approaches. However, it does not distinguish the shape of different hands. In addition, it depends on PoseNet when recovering motion parameters from images. Therefore, once PoseNet fails to predict 3D hand joint positions correctly, our algorithm cannot rectify the case.

6 Conclusions

We propose a coordinate descent algorithm for reconstructing hand motion parameters from joint positions, in which the joint rotation of each finger bone is solved successively by fixing other pose motion parameters. The natural structure of hands is used to constrain joint motions in order to reduce the search space. These two contributions make our algorithm exhibits advantages in accuracy, robustness and running time. As future work, it is interesting to extend the proposed framework to estimate hand motion sequences by utilizing the inter-frame coherence to sustain the stability of the sequence.