Abstract
Estimating 3D hand pose from 2D images is a difficult, inverse problem due to the inherent scale and depth ambiguities. Current state-of-the-art methods train fully supervised deep neural networks with 3D ground-truth data. However, acquiring 3D annotations is expensive, typically requiring calibrated multi-view setups or labour intensive manual annotations. While annotations of 2D keypoints are much easier to obtain, how to efficiently leverage such weakly-supervised data to improve the task of 3D hand pose prediction remains an important open question. The key difficulty stems from the fact that direct application of additional 2D supervision mostly benefits the 2D proxy objective but does little to alleviate the depth and scale ambiguities. Embracing this challenge we propose a set of novel losses that constrain the prediction of a neural network to lie within the range of biomechanically feasible 3D hand configurations. We show by extensive experiments that our proposed constraints significantly reduce the depth ambiguity and allow the network to more effectively leverage additional 2D annotated images. For example, on the challenging freiHAND dataset, using additional 2D annotation without our proposed biomechanical constraints reduces the depth error by only \(15\%\), whereas the error is reduced significantly by \(50\%\) when the proposed biomechanical constraints are used.
A. Spurr—This work was done during an internship at NVIDIA.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Vision-based reconstruction of the 3D pose of human hands is a difficult problem that has applications in many domains. Given that RGB sensors are ubiquitous, recent work has focused on estimating the full 3D pose [6, 18, 25, 34, 44] and dense surface [5, 13, 15] of human hands from 2D imagery alone. This task is challenging due to the dexterity of the human hand, self-occlusions, varying lighting conditions and interactions with objects. Moreover, any given 2D point in the image plane can correspond to multiple 3D points in world space, all of which project onto that same 2D point. This makes 3D hand pose estimation from monocular imagery an ill-posed inverse problem in which depth and the resulting scale ambiguity pose a significant difficulty.
Most of the recent methods use deep neural networks for hand pose estimation and rely on a combination of fully labeled real and synthetic training data (e.g., [4, 6, 15, 15, 18, 25, 34, 46, 48]). However, acquiring full 3D annotations for real images is very difficult as it requires complex multi-view setups and labour intensive manual annotations of 2D keypoints in all views [14, 45, 49]. On the other hand, synthetic data does not generalize well to realistic scenarios due to domain discrepancies. Some works attempt to alleviate this by leveraging additional 2D annotated images [5, 18]. Such kind of weakly-supervised data is far easier to acquire for real images as compared to full 3D annotations. These methods use these annotations in a straightforward way in the form of a reprojection loss [5] or supervision for the 2D component only [18]. However, we find that the improvements stemming from including the weakly-supervised data in such a manner are mainly a result of 3D poses that agree with the 2D projection. Yet, the uncertainties arising due to depth ambiguities remain largely unaddressed and the resulting 3D poses can still be implausible. Therefore, these methods still rely on large amounts of fully annotated training data to reduce these ambiguities. In contrast, our goal is to minimize the requirement of 3D annotated data as much as possible and maximize the utility of weakly-labeled real data.
To this end, we propose a set of biomechanically inspired constraints (BMC) which can be integrated in the training of neural networks to enable anatomically plausible 3D hand poses even for data with 2D supervision only. Our key insight is that the human hand is subject to a set of limitations imposed by its biomechanics. We model these limitations in a differentiable manner as a set of soft constraints. Note that this is a challenging problem. While the bone length constraints have been used successfully [37, 47], capturing other biomechanical aspects is more difficult. Instead of fitting a hand model to the predictions, we extract the quantities in question directly from the predictions to impose our constraints. As such, the method of extraction has to be carefully designed to work under noisy and malformed 3D joint predictions while simultaneously being fully differentiable under any pose. We propose to encode these constraints into a set of losses that are fully differentiable, interpretable and which can be incorporated into the training of any deep learning architecture that predicts 3D joint configurations. Due to this integration, we do not require a post-refinement step during test time. More specifically, our set of soft constraints consists of three equations that define i) the range of valid bone lengths, ii) the range of valid palm structure, and iii) the range of valid joint angles of the thumb and fingers. The main advantage of our set of constraints is that all parameters are interpretable and can either be set manually, opening up the possibility of personalization, or be obtained from a small set of data points for which 3D labels are available. As backbone model, we use the 2.5D representation proposed by Iqbal et al. [18] due to its superior performance. We identify an issue in absolute depth calculation and remedy it via a novel refinement network. In summary, we contribute:
-
A novel set of differentiable soft constraints inspired by the biomechanical structure of the human hand.
-
Quantitative and qualitative evidence that demonstrates that our proposed set of constraints improves 3D prediction accuracy in weakly supervised settings, resulting in an improvement of \(55\%\) as opposed to \(32\%\) as yielded by straightforward use of weakly-supervised data.
-
A neural network architecture that extends [18] with a refinement step.
-
Achieving state-of-the-art performance on Dexter+Object using only synthetic and weakly-supervised real data, indicating cross-data generalizability.
The proposed constraints require no special data nor are they specific to a particular backbone architecture.
2 Related Work
Hand pose estimation from monocular RGB has gained traction in recent years due numerous possible applications. Generally there are two trains of thought.
Model-based methods ensure plausible poses by fitting a hand model to the observation via optimization. As they are not learning-based, they are sensitive to initial conditions, rely on temporal information [17, 26,27,28] or do not take the image into consideration during optimization [28]. Whereas some make use of geometric primitives [26,27,28], other simply model the joint angles directly [8, 11, 20, 22, 31, 41], learn a lower dimensional embedding of the joints [23], pose [17] or go a step further and model muscles of the hand [1]. Different to these methods, we propose to incorporate these constraints directly into the training procedure of a neural network in a fully differentiable manner. As such, we do not fit a hand model to the prediction, but extract and constrain the biomechanical quantities from them directly. The resulting network predicts biomechanically-plausible poses and does not suffer from the same disadvantages.
Learning-based methods utilize neural networks that either directly regress the 3D positions of the hand keypoints [18, 25, 34, 38, 44, 48] or predict the parameters of a deformable hand model [4, 5, 15, 42, 46]. Zimmermann et al. [48] are the first to use deep neural network for root-relative 3D hand pose estimation from RGB images via a multi-staged approach. Spurr et al. [34] learn a unified latent space that projects multiple modalities into the same space, learning a lower level embedding of the hands. Similarly, Yang et al. [44] learn a latent space that disentangles background, camera and hand pose. However, all these methods require large numbers of fully labeled training data. Cai et al. [6] try to alleviate this problem by introducing an approach that utilizes paired RGB-D images to regularize the depth predictions. Mueller et al. [25] attempt to improve the quality of synthetic training data by learning a GAN model that minimizes the discrepancies between real and synthetic images. Iqbal et al. [18] decompose the task into learning 2D and root-relative depth components. This decomposition allows to use weakly-labeled real images with only 2D pose annotations which are cheap to acquire. While these methods demonstrate better generalization by adding a large number weakly-labeled training samples, the main drawback of this approach is that the depth ambiguities remain unaddressed. As such, training using only 2D pose annotations does not impact the depth predictions. This may result in 3D poses with accurate 2D projections, but due to depth ambiguities the 3D poses can still be implausible. In contrast, in this work, we propose a set of biomechanical constraints that ensures that the predicted 3D poses are always anatomically plausible during training (see Fig. 1). We formulate these constraints in form of a fully-differentiable loss functions which can be incorporated into any deep learning architecture that predicts 3D joint configurations. We use a variant of Iqbal et al. [18] as a baseline and demonstrate that the requirement of fully labeled real images can be significantly minimized while still maintaining performance on par with fully-supervised methods.
Other recent methods directly predict the parameters of a deformable hand model, e.g., MANO [32], from RGB images [5, 15, 29, 42, 46]. The predicted parameters consist of the shape and pose deformations wrt. a mean shape and pose that are learned using large amounts of 3D scans of the hand. Alternatively, [13, 21] circumvent the need for a parametric hand model by directly predicting the mesh vertices from RGB images. These methods require both shape and pose annotations for training, therefore obtaining such kind of training data is even harder. Hence, most methods rely on synthetic training data. Some methods [4, 5, 46] alleviate this by introducing re-projection losses that measure the discrepancy between the projection of 3D mesh with labeled 2D poses [5] or silhouettes [4, 46]. Even though they utilize strong hand priors in form of a mean hand shape and by operating on a low-dimensional PCA space, using re-projection losses with weakly-labeled data still does not guarantee that the resulting 3D poses will be anatomically plausible. Therefore, all these methods rely on a large number of fully labeled training data. In body pose estimation, such methods generally resort to adversarial losses to ensure plausibility [19].
Biomechanical constraints have also been used in the literature to encourage plausible 3D poses by imposing biomechanical limits on the structure of the hands [9, 10, 12, 24, 33, 36, 39, 40, 43] or via a learned refinement model [7]. Most methods [2, 9, 10, 24, 33, 36, 39, 43] impose these limits via inverse kinematic in a post-processing step, therefore the possibility of integrating them for neural network training remains unanswered. Our proposed soft-constraints are fully integrated into the network, which does not require a post-refinement step during test time. Similar to our method, [12, 40] also penalize invalid bone lengths. However, we additionally model the joint limits and palmar structure.
3 Method
Our method is summarized in Fig. 2. Our key contribution is a set of novel constraints that constitute a biomechanical model of the human hand and capture the bone lengths, joint angles and shape of the palm. We emphasize that we do not fit a kinematic model to the predictions, but instead extract the quantities in question directly from the predictions in order to constrain them. Therefore the method of extraction is carefully designed to work under noisy and malformed 3D joint predictions while simultaneously being fully differentiable in any configuration. These biomechanical constraints provide an inductive bias to the neural network. Specifically, the network is guided to predict anatomically plausible hand poses for weakly-supervised data (i.e. 2D only), which in turn increases generalizability. The model can be combined with any backbone architecture that predicts 3D keypoints. We first introduce the notations used in this paper followed by the details of the proposed biomechanical losses. Finally, we discuss the integration with a variant of [18].
Notation. We use bold capital font for matrices, bold lowercase for vector and roman font for scalars. We assume a right hand. The joints define a kinematic chain of the hand starting from the root joint \(\mathbf {j}^{3D}_1\) and ending in the fingertips. For the sake of simplicity, the joints of the hands are grouped by the fingers, denoted as the respective set \(F1,\dots ,F5\), visualized in Fig. 3a. Each \(\mathbf {j}^{3D}_i\), except the root joint (CMC), has a parent, denoted as p(i). We define a bone \(\mathbf {b}_i = \mathbf {j}^{3D}_{i+1} - \mathbf {j}^{3D}_{p(i+1)}\) as the vector pointing from the parent joint to its child joint. Hence . The bones are named according to the child joint. For example, the bone connecting MCP to PIP is called PIP bone. We define the five root bones as the MCP bones, where one endpoint is the root \(\mathbf {j}^{3D}_1\). Intuitively, the root bones are those that lie within and define the palm. We define the bones \(\mathbf {b}_i\) with \(i=1,\dots ,5\) to correspond to the root bones of fingers \(F1, \dots , F5\). We denote the angle \(\alpha (v_1, v_2) = \mathrm {arccos}\big (\frac{\mathbf {v}_1^T \mathbf {v}_2}{||\mathbf {v}_1||_2 \, ||\mathbf {v}_2||_2}\big )\) between the vectors \(\mathbf {v}_1, \mathbf {v}_2\). The interval loss is defined as \(\mathcal {I}(x;a,b) = \max (a-x,0) + \max (x-b,0)\). The normalized vector is defined as \(\mathrm {norm}(\mathbf {x}) = \frac{\mathbf {x}}{||\mathbf {x}||_2}\). Lastly, \(\mathrm {P}_{\mathbf {xy}}(\mathbf {v})\) is the orthogonal projection operator, projecting \(\mathbf {v}\) orthogonally onto the \(\mathbf {x}\)-\(\mathbf {y}\) plane where \(\mathbf {x}\),\(\mathbf {y}\) are vectors.
3.1 Biomechanical Constraints
Our goal is to integrate our biomechanical soft constraints (BMC) into the training procedure that encourages the network to predict feasible hand poses. We seek to avoid iterative optimization approaches such as inverse kinematics in order to avert significant increases in training time.
The proposed model consists of three functional parts, visualized in Fig. 3. First, we consider the length of the bones, including the root bones of the palm. Second, we model the structure and shape of the palmar region, consisting of a rigid structure made up of individual joints. To account for inter-subject variability of bones and palm structure, it is important to not enforce a specific mean shape. Instead, we allow for these properties to lie within a valid range. Lastly, the model describes the articulation of the individual fingers. The finger motion is described via modeling of the flexion and abduction of individual bones. As their limits are interdependent, they need to be modeled jointly. As such, we propose a novel constraint that takes this interdependence into account.
The limits for each constraint can be attained manually from measurements, from the literature (e.g. [9, 33]), or acquired in a data-driven way from 3D annotations, should they be available.
Bone Length. For each bone i, we define an interval \([b^{\min }_i, b^{\max }_i]\) of valid bone length and penalize if the length \(||\mathbf {b}_i||_2\) lies outside of this interval:
This loss encourages keypoint predictions that yield valid bone lengths. Figure 3a shows the length of a bone in blue.
Root Bones. To attain valid palmar structures we first interpret the root bones as spanning a mesh and compute its curvature by following [30]:
Where \(\mathbf {e}_i\) is the edge normal at bone \(\mathbf {b}_i\):
Positive values of \(c_i\) denote an arched hand, for example when pinky and thumb touch. A flat hand has no curvature. Figure 3b visualizes the mesh in dashed yellow and the triangle over which the curvature is computed in dashed purple.
We ensure that the root bones fall within correct angular ranges by defining the angular distance between neighbouring \(\mathbf {b}_i\),\(\mathbf {b}_{i+1}\) across the plane they span:
We constrain both the curvature \(c_i\) and angular distance \(\phi _i\) to lie within a valid range \([c_i^{\min }, c_i^{\max }]\) and \([\phi _i^{\min }, \phi _i^{\max }]\):
\(\mathcal {L}_{\mathrm {RB}}\) ensures that the predicted joints of the palm define a valid structure, which is crucial since the kinematic chains of the fingers originate from this region.
Joint Angles. To compute the joint angles, we first need to define a consistent frame \(\mathbf {F}_i\) of a local coordinate system for each finger bone \(\mathbf {b}_i\). \(\mathbf {F}_i\) must be consistent with respect to the movements of the finger. In other words, if one constructs \(\mathbf {F}_i\) given a pose \(\mathbf {J}^{3D}_1\), then moves the fingers and corresponding \(\mathbf {F}_i\) into pose \(\mathbf {J}^{3D}_2\), the resulting \(\mathbf {F}_i\) should be the same as if constructed from \(\mathbf {J}^{3D}_2\) directly.
We assume right-handed coordinate systems. To construct \(\mathbf {F}_i\), we define two out of three axes based on the palm. We start with the first layer of fingers bones (PIP bones). We define their respective z-component of \(\mathbf {F}_i\) as the normalized bone of their respective parent bone (in this case, the root bones): \(\mathbf {z}_i = \text {norm}(\mathbf {b}_{p(i)})\). Next, we define the x-axis, based on the plane normals spanned by two neighbouring root bones:
Where \(\mathbf {n}_i\) is defined as in Eq. 2. Lastly, we compute the last axis \(\mathbf {y}_i = \text {norm}(\mathbf {z}_i \times \mathbf {x}_i)\). Given \(\mathbf {F}_i\), we can now define the flexion and abduction angles. Each of these angles are given with respect to the local z-axis of \(\mathbf {F}_i\). Given \(\mathbf {b}_i\) in its local coordinates \(\mathbf {b}_i^{\mathbf {F}_i}\) wrt. \(\mathbf {F}_i\), we define the flexion and abduction angles as:
Figure 3c visualizes \(\mathbf {F}_i\) and the resulting angles. Note that this formulation leads to ambiguities, where different bone orientations can map to the same (\(\theta ^{\mathrm {f}}_i\), \(\theta ^{\mathrm {a}}_i\))-point. We resolve this via an octant lookup, which leads to angles in the intervals \(\theta ^{\mathrm {f}}_i \in [-\pi ,\pi ]\) and \(\theta ^{\mathrm {a}}_i \in [-\pi /2,\pi /2]\) respectively. See appendix for more details.
Given the angles of the first set of finger bones, we can then construct the remaining two rows of finger bones. Let \(\mathbf {R}^{\theta _i}\) denote the rotation matrix that rotates by \(\theta ^{\mathrm {f}}_i\) and \(\theta ^{\mathrm {a}}_i\) such that \(\mathbf {R}^{\theta _i} \mathbf {z}_i = \mathbf {b}_i^{\mathbf {F}_i}\), then we iteratively construct the remaining frames along the kinematic chain of the fingers:
This method of frame construction via rotating by \(\theta ^{\mathrm {f}}_i\) and \(\theta ^{\mathrm {a}}_i\) ensures consistency across poses. The remaining angles can be acquired as described in Eq. 5.
Lastly, the angles need to be constrained. One way to do this is to consider each angle independently and penalize them if they lie outside an interval. This corresponds to constraining them within a box in a 2D space, where the endpoints are the min/max of the limits. However, finger angles have inter-dependency, therefore we propose an alternative approach to account for this. Given points \(\theta _i = (\theta ^{\mathrm {f}}_i, \theta ^{\mathrm {a}}_i)\) that define a range of motion, we approximate their convex hull on the \((\theta ^{\mathrm {f}}, \theta ^{\mathrm {a}})\)-plane with a fixed set of points \(\mathcal {H}_i\). The angles are constrained to lie within this structure by minimizing their distance to it:
Where \(D_H\) is the distance of point \(\theta _i\) to the hull \(\mathcal {H}_i\). Details on the convex hull approximation and implementation can be found in the appendix.
3.2 \(\mathbf {Z}^{\mathrm {root}}\) Refinement
The 2.5D joint representation allows us to recover the value of the absolute pose \(Z^{root}\) up to a scaling factor. This is done by solving a quadratic equation dependent on the 2D projection \(\mathbf {J}^{2D}\) and relative depth values \(\mathbf {z}^r\), as proposed in [18]. In practice, small errors in \(\mathbf {J}^{2D}\) or \(\mathbf {z}^r\) can result in large deviations of \(Z^{root}\). This leads to big fluctuations in the translation and scale of the predicted pose, which is undesirable. To alleviate these issues, we employ an MLP to refine and smooth the calculated \(\hat{Z}^{root}\):
Where \(M_\mathrm {MLP}\) is a multilayered perceptron with parameters \(\mathbf {\omega }\) that takes the predicted and calculated values , , and outputs a residual term. Alternatively, one could predict \(Z^{root}\) directly using an MLP with the same input. However, as the exact relationship between the predicted variables and \(Z^{root}\) is known, we resort to the refinement approach instead of requiring a model to learn what is already known.
3.3 Final Loss
The biomechanical soft constraints is constructed as follows:
Our final model is trained on the following loss function:
where \(\mathcal {L}_{\mathbf {J}^{2D}}\), \(\mathcal {L}_{\mathbf {z}^r}\) and \(\mathcal {L}_{\mathrm {Z^{root}}}\) are the L1 loss on any available \(\mathbf {J}^{2D}\), \(\mathbf {z}^r\) and \(Z^{root}\) labels respectively. The weights \(\lambda _i\) balance the individual loss terms.
4 Implementation
We use a ResNet-50 backbone [16]. The input to our model is a \(128 \times 128\) RGB image from which the 2.5D representation is directly regressed. The model and its refinement step is trained on fully supervised and weakly-supervised data. The network was trained for 70 epochs using SGD with a learning rate of and a step-wise learning rate decay of 0.1 after every 30 epochs. We apply the biomechanical constraints directly on the predicted 3D keypoints \(\mathbf {J}^{3D}\).
5 Evaluation
Here we introduce the datasets used, show the performance of our proposed \(\mathcal {L}_{\mathrm {\mathbf {BMC}}}\) and compare in extensive settings. Specifically, we study the effect of adding weakly supervised data to complement fully supervised training. All experiments are conducted in a setting where we assume access to a fully supervised dataset, as well as a supplementary weakly supervised real dataset. Therefore we have access to 2D ground-truth annotations and the computed constraint limits. We study two cases of 3D supervision sources:
Synthetic Data. We choose RHD. Acquiring fully labeled synthetic data is substantially easier as compared to real data. Section 5.3–5.5 consider this setting.
Partially Labeled Real Data. In Sect. 5.6 we gradually increase the number of real 3D labeled samples to study how the proposed approach works under different ratio of fully to weakly supervised data.
To make clear what kind of supervision is used we denote \(\mathbf {3D}_\mathrm {A}\) if 3D annotation is used from dataset \(\mathrm {A}\). We indicate usage of 2D from dataset \(\mathrm {A}\) as \(\mathbf {2D}_A\). Section 5.3 and 5.4 are evaluated on FH.
5.1 Datasets
Each dataset that provides 3D labels comes with the camera intrinsics. Hence the 2D pose can be easily acquired from the 3D pose. Table 1 provides an overview of datasets used. The test set of HO-3D and FH are available only via a submission system with limited number of total submissions. Therefore for the ablation study (Sect. 5.4) and inspecting the effect of weak-supervision (Sect. 5.3), we divide the training set into a training and validation split. For these sections, we choose to evaluate on FH due to its large number of samples and variability in both hand pose and shape.
5.2 Evaluation Metric
HO-3D. The error given by the submission system is the mean joint error in mm. The INTERP is the error on test frames sampled from training sequences that are not present in the training set. The EXTRAP is the error on test samples that have neither hand shapes nor objects present in the training set. We used the version of the dataset that was available at the time [3].
FH. The error given by the submission system is the mean joint error in mm. Additionally, the area under the curve (AUC) of the percentage of correct keypoints (PCK) plot is reported. The PCK values lie in an interval 0 mm 50 mm with 100 equally spaced thresholds. Both the aligned (using procrustes analysis) and unaligned scores are given. We report the aligned score. The unaligned score can be found in the appendix.
D+O. We report the AUC for the PCK thresholds of 20 to 50 mm comparable with prior work [5, 46, 49]. For [18, 25, 34, 48] we report the numbers as presented in [46] as they consolidate all AUC of related work in a consistent manner using the same PCK thresholds. For [4], we recomputed the AUC for the same interval based on the values provided by the authors.
5.3 Effect of Weak-Supervision
We first inspect how weak-supervision affects the performance of the model. We decompose the 3D prediction error on the validation set of FH in terms of its 2D (\(\mathbf {J}^{2D}\)) and depth component (\(\mathbf {Z}\)) via the pinhole camera model \(\mathbf {Z}^{-1}\mathbf {K}\mathbf {J}^{3D} = \mathbf {J}^{2D}\) and evaluate their individual error.
We train four models using different data sources. 1) Full 3D supervision on both synthetic RHD and real FH (\(\mathbf {3D}_\mathrm {RHD}+\mathbf {3D}_\mathrm {FH}\)), which serves as an upper bound for when all 3D labels are available 2) Fully supervised on RHD which constitutes our lower bound on accuracy (\(\mathbf {3D}_\mathrm {RHD}\)) 3) Fully supervised on RHD with naive application of weakly-supervised FH (\(+\mathbf {2D}_\mathrm {FH}\)) 4) Like setting 3) but adding our proposed constraints (\(\mathbf {+ \mathcal {L}_\mathrm {\mathbf {BMC}}}\)).
Table 2 shows the results. The model trained with full 3D supervision from real and synthetic data reflects the best setting. Adding \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) during training slightly reduces 3D error (8.78 mm to 8.6 mm) primarily due to a regularization effect. When the model is trained only on synthetic data (\(\mathbf {3D}_\mathrm {RHD}\)) we observe a significant rise (8.78mm to 30.82 mm) in 3D error due to the poor generalization from synthetic data. When weak-supervision is provided from the real data (\(+\mathbf {2D}_\mathrm {FH}\)), the error is reduced (30.82 mm to 20.92 mm). However, inspecting this more closely we observe that the improvement comes mainly from 2D error reduction (12.35px to 3.8px), whereas the depth component is improved marginally (20.02 mm to 17.02 mm). Observing these samples qualitatively (Fig. 1), we see that many do not adhere to biomechanical limits of the human hand. By penalizing such violations via our proposed losses \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) to the weakly supervised setting we see a significant improvement in 3D error (20.92mm to 13.78mm) which is due to improved depth accuracy (20.02mm to 9.97mm). Inspecting (e.g. Fig. 1) closer, we see that the model predicts the correct 3D pose in challenging settings such as heavy self- and object occlusion, despite having never seen such samples in 3D. Since \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) describes a valid range, rather than a specific pose, slight deviations from the ground truth 3D pose have to be expected which explains the small remaining quantitative gap from the fully supervised model.
5.4 Ablation Study
We quantify the individual contributions of our proposals on the validation set of FH and reproduce these results on HO-3D in supplementary. Each error metric is computed for the root-relative 3D pose.
Refinement Network. Table 3 shows the impact of \(Z^{root}\) refinement (Sect. 3.2). We train two models that include (w. refinement) or omit (w/o refinement) the refinement step, using full supervision on FH (\(\mathbf {3D}_\mathrm {FH}\)). Using refinement, the mean error is reduced by 1.44mm which indicates that refining effectively reduces outliers.
Components of BMC. In Table 4, we perform a series of experiments where we incrementally add each of the proposed constraints. For 3D guidance, we use the synthetic RHD and only use the 2D labels of FH. We first run the baseline model trained only on this data (\(\mathbf {3D}_\mathrm {RHD} + \mathbf {2D}_\mathrm {FH}\)). Next, we add the bone length loss \(\mathcal {L}_{\mathrm {BL}}\), followed by the root bone loss \(\mathcal {L}_{\mathrm {RB}}\) and the angle loss \(\mathcal {L}_\mathrm {A}\). An upper bound is given by our model trained fully supervised on both datasets (\(\mathbf {3D}_\mathrm {RHD} + \mathbf {3D}_\mathrm {FH}\)). Each component contributes positively towards the final performance, totalling a decrease of 6.24mm in mean error as compared to our weakly-supervised baseline, significantly closing the gap to the fully supervised upper bound. A qualitative assessment of the individual losses can be seen in Fig. 4.
Co-dependency of Angles. In Table 5, we show the importance of modeling the dependencies between the flexion and abduction angle limits (Sect. 3), instead of regarding them independently. Co-dependent angle limits yield a decrease in mean error of 1.40 mm.
Constraint Limits. In Table 6, we investigate the effect of the used limits on the final performance, as one may have to resort to approximations. For this, we instead take the hand parameters from RHD and perform the same weakly-supervised experiment as before (\(+\mathcal {L}_\mathrm {\mathbf {BMC}}\)). Approximating the limits from another dataset slightly increases the error, but still clearly outperforms the 2D baseline.
5.5 Bootstrapping with Synthetic Data
We validate \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) on the test set of FH and HO-3D. We train the same four models like in Sect. 5.3 using fully supervised RHD and weakly-supervised real data R\(\in \)[FH,HO-3D].
For all results here we perform training on the full dataset and evaluate on the official test split via the online submission system. Additionally, we evaluate the cross-dataset performance on D+O dataset to show how our proposed constraints improves generalizability and compare with prior work [4, 5, 18, 25, 46].
FH. The second column of Table 7 shows the dataset performance for R = FH. Training solely on RHD (\(\mathbf {3D_\mathrm {RHD}}\)) performs the worst. Adding real data (\(+ \mathbf {2D_\mathrm {FH}}\)) with 2D labels reduces the error, as we reduce the real/synthetic domain gap. Including the proposed \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) results in an accuracy boost.
HO-3D. The third column of Table 7 shows a similar trend for R = HO-3D. Most notably, our constraints yield a decrease of 14.85 mm for INTERP. This is significantly larger than the relative decrease the 2D data adds (-8.41mm). For EXTRAP, BMC yields an improvement of 1.15mm, which is close to the 1.27mm gained from 2D data. This demonstrates that \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) is beneficial in leveraging 2D data more effectively in unseen scenarios.
D+O. In Table 8 we demonstrate the cross-data performance on D+O for R = FH. Most recent works have made use of MANO [4, 5, 46], leveraging a low-dimensional embedding of highly detailed hand scans and require custom synthetic data [4, 5] to fit the shape. Using only fully supervised synthetic data and weakly-supervised real data in conjunction with \(\mathcal {L}_\mathrm {\mathbf {BMC}}\), we reach state-of-the-art.
5.6 Bootstrapping with Real Data
We study the impact of our biomechanical constrains on reducing the number of labeled samples required in scenarios where few real 3D labeled samples are available. We train a model in a setting where a fraction of the data contains the full 3D labels and the remainder contains only 2D supervision.
Here we choose \(R = \) FH, use the entire training set and evaluate on the test set. For each fraction of fully labelled data we evaluate two models. The first is trained on both the fully and weakly labeled samples. The second is trained with the addition of our proposed constraints. We show the results in Fig. 5. For a given AUC, we plot the number of labeled samples required to reach it. We observe that for lower labeling percentages, the amount of labeled data required is approximately half using \(\mathcal {L}_\mathrm {\mathbf {BMC}}\). This showcases its effectiveness in low label settings and demonstrates the decrease in requirement for fully annotated training data.
6 Conclusion
We propose a set of fully differentiable biomechanical losses to more effectively leverage weakly supervised data. Our method consists of a novel procedure to encourage anatomically correct predictions of a backbone network via a set of novel losses that penalize invalid bone length, joint angles as well as palmar structures. Furthermore, we have experimentally shown that our constraints can more effectively leverage weakly-supervised data, which show improvement on both within- and cross-dataset performance. Our method reaches state-of-the-art performance on the aligned D+O objective using 3D synthetic and 2D real data and reduces the need of training data by half in low label settings on FH.
References
Albrecht, I., Haber, J., Seidel, H.P.: Construction and animation of anatomically based human hand models. In: SIGGRAPH (2003)
Aristidou, A.: Hand tracking with physiological constraints. Vis. Comput. 34(2), 213–228 (2018). https://doi.org/10.1007/s00371-016-1327-8
Armagan, A., et al.: Measuring generalisation to unseen viewpoints, articulations, shapes and objects for 3D hand pose estimation under hand-object interaction. In: ECCV (2020)
Baek, S., Kim, K.I., Kim, T.K.: Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering. In: CVPR (2019)
Boukhayma, A., de Bem, R., Torr, P.H.: 3D hand shape and pose from images in the wild. In: CVPR (2019)
Cai, Y., Ge, L., Cai, J., Yuan, J.: Weakly-supervised 3D hand pose estimation from monocular RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 678–694. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_41
Cai, Y., et al.: Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: CVPR (2019)
Cerveri, P., De Momi, E., Lopomo, N., Baud-Bovy, G., Barros, R., Ferrigno, G.: Finger kinematic modeling and real-time hand motion estimation. Ann. Biomed. Eng. 35(11), 1989–2002 (2007). https://doi.org/10.1007/s10439-007-9364-0
Chen Chen, F., Appendino, S., Battezzato, A., Favetto, A., Mousavi, M., Pescarmona, F.: Constraint study for a hand exoskeleton: human hand kinematics and dynamics. J. Robot. (2013)
Cobos, S., Ferre, M., Uran, M.S., Ortego, J., Pena, C.: Efficient human hand kinematics for manipulation tasks. In: IROS (2008)
Cordella, F., Zollo, L., Guglielmelli, E., Siciliano, B.: A bio-inspired grasp optimization algorithm for an anthropomorphic robotic hand. Int. J. Interact. Des. Manuf. 6(2), 113–122 (2012). https://doi.org/10.1007/s12008-012-0149-9
Dibra, E., Wolf, T., Oztireli, C., Gross, M.: How to refine 3D hand pose estimation from unlabelled depth data? In: 3DV (2017)
Ge, L., et al.: 3D hand shape and pose estimation from a single RGB image. In: CVPR (2019)
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: HOnnotate: a method for 3D annotation of hand and object poses. In: CVPR (2020)
Hasson, Y.,et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Heap, T., Hogg, D.: Towards 3D hand tracking using a deformable model. In: FG (1996)
Iqbal, U., Molchanov, P., Breuel, T., Gall, J., Kautz, J.: Hand pose estimation via latent 2.5D heatmap regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 125–143. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_8
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Kuch, J.J., Huang, T.S.: Vision based hand modeling and tracking for virtual teleconferencing and telecollaboration. In: CVPR (1995)
Kulon, D., Wang, H., Güler, R.A., Bronstein, M., Zafeiriou, S.: Single image 3D hand reconstruction with mesh convolutions. In: BMVC (2019)
Lee, J., Kunii, T.L.: Model-based analysis of hand posture. IEEE Comput. Graph. Appl. 15(5), 77–86 (1995)
Lin, J., Wu, Y., Huang, T.S.: Modeling the constraints of human hand motion. In: IEEE Workshop on Human Motion (2000)
Melax, S., Keselman, L., Orsten, S.: Dynamics based 3D skeletal hand tracking. In: ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (2013)
Mueller, F., et al.: GANerated hands for real-time 3D hand tracking from monocular RGB. In: CVPR (2018)
Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints. In: ICCV (2011)
Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Efficient model-based 3D tracking of hand articulations using kinect. In: BMVC (2011)
Panteleris, P., Oikonomidis, I., Argyros, A.: Using a single RGB frame for real time 3D hand pose estimation in the wild. In: WACV (2017)
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Reed, N.: What is the simplest way to compute principal curvature for a mesh triangle? (2019). https://computergraphics.stackexchange.com/questions/1718/what-is-the-simplest-way-to-compute-principal-curvature-for-a-mesh-triangle
Rhee, T., Neumann, U., Lewis, J.P.: Human hand modeling from surface anatomy. In: ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (2006)
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. In: SIGGRAPH-Asia (2017)
Ryf, C., Weymann, A.: The neutral zero method–a principle of measuring joint function. Injury 26, 1–11 (1995)
Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation. In: CVPR (2018)
Sridhar, S., Mueller, F., Zollhöfer, M., Casas, D., Oulasvirta, A., Theobalt, C.: Real-time joint tracking of a hand manipulating an object from RGB-D input. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 294–310. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_19
Sridhar, S., Oulasvirta, A., Theobalt, C.: Interactive markerless articulated hand motion tracking using RGB and depth data. In: ICCV (2013)
Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: ICCV (2017)
Tekin, B., Bogo, F., Pollefeys, M.: H+o: unified egocentric recognition of 3D hand-object poses and interactions. In: CVPR (2019)
Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. (ToG) 33(5), 1–10 (2014)
Wan, C., Probst, T., Gool, L.V., Yao, A.: Self-supervised 3D hand pose estimation through training by fitting. In: CVPR (2019)
Wu, Y., Huang, T.S.: Capturing articulated human hand motion: a divide-and-conquer approach. In: ICCV (1999)
Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: posing face, body, and hands in the wild. In: CVPR (2019)
Xu, C., Cheng, L.: Efficient hand pose estimation from a single depth image. In: ICCV (2013)
Yang, L., Yao, A.: Disentangling latent hands for image synthesis and pose estimation. In: CVPR (2019)
Zhang, J., Jiao, J., Chen, M., Qu, L., Xu, X., Yang, Q.: 3D hand pose tracking and estimation using stereo matching. arXiv:1610.07214 (2016)
Zhang, X., Li, Q., Mo, H., Zhang, W., Zheng, W.: End-to-end hand mesh recovery from a monocular RGB image. In: ICCV (2019)
Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: ICCV (2017)
Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: ICCV (2017)
Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images. In: ICCV (2019)
Acknowledgments
We are grateful to Christoph Gebhardt and Shoaib Ahmed Siddiqui for the aid in figure creation and Abhishek Badki for helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Spurr, A., Iqbal, U., Molchanov, P., Hilliges, O., Kautz, J. (2020). Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12362. Springer, Cham. https://doi.org/10.1007/978-3-030-58520-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-58520-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58519-8
Online ISBN: 978-3-030-58520-4
eBook Packages: Computer ScienceComputer Science (R0)