Keywords

1 Introduction

Vision-based reconstruction of the 3D pose of human hands is a difficult problem that has applications in many domains. Given that RGB sensors are ubiquitous, recent work has focused on estimating the full 3D pose [6, 18, 25, 34, 44] and dense surface [5, 13, 15] of human hands from 2D imagery alone. This task is challenging due to the dexterity of the human hand, self-occlusions, varying lighting conditions and interactions with objects. Moreover, any given 2D point in the image plane can correspond to multiple 3D points in world space, all of which project onto that same 2D point. This makes 3D hand pose estimation from monocular imagery an ill-posed inverse problem in which depth and the resulting scale ambiguity pose a significant difficulty.

Fig. 1.
figure 1

Impact of the proposed biomechanical constraints (BMC). (b,e) Supplementing fully supervised data with 2D annotated data yields 3D poses with correct 2D projections, yet they are anatomically implausible. (c,f) Adding our biomechanical constraints significantly improves the pose prediction quantitatively and qualitatively. The resulting 3D poses are anatomically valid and display more accurate depth/scale even under severe self- and object occlusions, thus are closer to the ground-truth (d,g).

Most of the recent methods use deep neural networks for hand pose estimation and rely on a combination of fully labeled real and synthetic training data (e.g., [4, 6, 15, 15, 18, 25, 34, 46, 48]). However, acquiring full 3D annotations for real images is very difficult as it requires complex multi-view setups and labour intensive manual annotations of 2D keypoints in all views  [14, 45, 49]. On the other hand, synthetic data does not generalize well to realistic scenarios due to domain discrepancies. Some works attempt to alleviate this by leveraging additional 2D annotated images [5, 18]. Such kind of weakly-supervised data is far easier to acquire for real images as compared to full 3D annotations. These methods use these annotations in a straightforward way in the form of a reprojection loss  [5] or supervision for the 2D component only  [18]. However, we find that the improvements stemming from including the weakly-supervised data in such a manner are mainly a result of 3D poses that agree with the 2D projection. Yet, the uncertainties arising due to depth ambiguities remain largely unaddressed and the resulting 3D poses can still be implausible. Therefore, these methods still rely on large amounts of fully annotated training data to reduce these ambiguities. In contrast, our goal is to minimize the requirement of 3D annotated data as much as possible and maximize the utility of weakly-labeled real data.

To this end, we propose a set of biomechanically inspired constraints (BMC) which can be integrated in the training of neural networks to enable anatomically plausible 3D hand poses even for data with 2D supervision only. Our key insight is that the human hand is subject to a set of limitations imposed by its biomechanics. We model these limitations in a differentiable manner as a set of soft constraints. Note that this is a challenging problem. While the bone length constraints have been used successfully [37, 47], capturing other biomechanical aspects is more difficult. Instead of fitting a hand model to the predictions, we extract the quantities in question directly from the predictions to impose our constraints. As such, the method of extraction has to be carefully designed to work under noisy and malformed 3D joint predictions while simultaneously being fully differentiable under any pose. We propose to encode these constraints into a set of losses that are fully differentiable, interpretable and which can be incorporated into the training of any deep learning architecture that predicts 3D joint configurations. Due to this integration, we do not require a post-refinement step during test time. More specifically, our set of soft constraints consists of three equations that define i) the range of valid bone lengths, ii) the range of valid palm structure, and iii) the range of valid joint angles of the thumb and fingers. The main advantage of our set of constraints is that all parameters are interpretable and can either be set manually, opening up the possibility of personalization, or be obtained from a small set of data points for which 3D labels are available. As backbone model, we use the 2.5D representation proposed by Iqbal et al.  [18] due to its superior performance. We identify an issue in absolute depth calculation and remedy it via a novel refinement network. In summary, we contribute:

  • A novel set of differentiable soft constraints inspired by the biomechanical structure of the human hand.

  • Quantitative and qualitative evidence that demonstrates that our proposed set of constraints improves 3D prediction accuracy in weakly supervised settings, resulting in an improvement of \(55\%\) as opposed to \(32\%\) as yielded by straightforward use of weakly-supervised data.

  • A neural network architecture that extends [18] with a refinement step.

  • Achieving state-of-the-art performance on Dexter+Object using only synthetic and weakly-supervised real data, indicating cross-data generalizability.

The proposed constraints require no special data nor are they specific to a particular backbone architecture.

2 Related Work

Hand pose estimation from monocular RGB has gained traction in recent years due numerous possible applications. Generally there are two trains of thought.

Model-based methods ensure plausible poses by fitting a hand model to the observation via optimization. As they are not learning-based, they are sensitive to initial conditions, rely on temporal information [17, 26,27,28] or do not take the image into consideration during optimization [28]. Whereas some make use of geometric primitives [26,27,28], other simply model the joint angles directly [8, 11, 20, 22, 31, 41], learn a lower dimensional embedding of the joints [23], pose [17] or go a step further and model muscles of the hand [1]. Different to these methods, we propose to incorporate these constraints directly into the training procedure of a neural network in a fully differentiable manner. As such, we do not fit a hand model to the prediction, but extract and constrain the biomechanical quantities from them directly. The resulting network predicts biomechanically-plausible poses and does not suffer from the same disadvantages.

Learning-based methods utilize neural networks that either directly regress the 3D positions of the hand keypoints  [18, 25, 34, 38, 44, 48] or predict the parameters of a deformable hand model  [4, 5, 15, 42, 46]. Zimmermann et al.  [48] are the first to use deep neural network for root-relative 3D hand pose estimation from RGB images via a multi-staged approach. Spurr et al.  [34] learn a unified latent space that projects multiple modalities into the same space, learning a lower level embedding of the hands. Similarly, Yang et al.  [44] learn a latent space that disentangles background, camera and hand pose. However, all these methods require large numbers of fully labeled training data. Cai et al.  [6] try to alleviate this problem by introducing an approach that utilizes paired RGB-D images to regularize the depth predictions. Mueller et al.  [25] attempt to improve the quality of synthetic training data by learning a GAN model that minimizes the discrepancies between real and synthetic images. Iqbal et al.  [18] decompose the task into learning 2D and root-relative depth components. This decomposition allows to use weakly-labeled real images with only 2D pose annotations which are cheap to acquire. While these methods demonstrate better generalization by adding a large number weakly-labeled training samples, the main drawback of this approach is that the depth ambiguities remain unaddressed. As such, training using only 2D pose annotations does not impact the depth predictions. This may result in 3D poses with accurate 2D projections, but due to depth ambiguities the 3D poses can still be implausible. In contrast, in this work, we propose a set of biomechanical constraints that ensures that the predicted 3D poses are always anatomically plausible during training (see Fig. 1). We formulate these constraints in form of a fully-differentiable loss functions which can be incorporated into any deep learning architecture that predicts 3D joint configurations. We use a variant of Iqbal et al.  [18] as a baseline and demonstrate that the requirement of fully labeled real images can be significantly minimized while still maintaining performance on par with fully-supervised methods.

Other recent methods directly predict the parameters of a deformable hand model, e.g., MANO  [32], from RGB images  [5, 15, 29, 42, 46]. The predicted parameters consist of the shape and pose deformations wrt. a mean shape and pose that are learned using large amounts of 3D scans of the hand. Alternatively, [13, 21] circumvent the need for a parametric hand model by directly predicting the mesh vertices from RGB images. These methods require both shape and pose annotations for training, therefore obtaining such kind of training data is even harder. Hence, most methods rely on synthetic training data. Some methods  [4, 5, 46] alleviate this by introducing re-projection losses that measure the discrepancy between the projection of 3D mesh with labeled 2D poses  [5] or silhouettes  [4, 46]. Even though they utilize strong hand priors in form of a mean hand shape and by operating on a low-dimensional PCA space, using re-projection losses with weakly-labeled data still does not guarantee that the resulting 3D poses will be anatomically plausible. Therefore, all these methods rely on a large number of fully labeled training data. In body pose estimation, such methods generally resort to adversarial losses to ensure plausibility  [19].

Biomechanical constraints have also been used in the literature to encourage plausible 3D poses by imposing biomechanical limits on the structure of the hands  [9, 10, 12, 24, 33, 36, 39, 40, 43] or via a learned refinement model [7]. Most methods [2, 9, 10, 24, 33, 36, 39, 43] impose these limits via inverse kinematic in a post-processing step, therefore the possibility of integrating them for neural network training remains unanswered. Our proposed soft-constraints are fully integrated into the network, which does not require a post-refinement step during test time. Similar to our method, [12, 40] also penalize invalid bone lengths. However, we additionally model the joint limits and palmar structure.

3 Method

Our method is summarized in Fig. 2. Our key contribution is a set of novel constraints that constitute a biomechanical model of the human hand and capture the bone lengths, joint angles and shape of the palm. We emphasize that we do not fit a kinematic model to the predictions, but instead extract the quantities in question directly from the predictions in order to constrain them. Therefore the method of extraction is carefully designed to work under noisy and malformed 3D joint predictions while simultaneously being fully differentiable in any configuration. These biomechanical constraints provide an inductive bias to the neural network. Specifically, the network is guided to predict anatomically plausible hand poses for weakly-supervised data (i.e. 2D only), which in turn increases generalizability. The model can be combined with any backbone architecture that predicts 3D keypoints. We first introduce the notations used in this paper followed by the details of the proposed biomechanical losses. Finally, we discuss the integration with a variant of [18].

Fig. 2.
figure 2

Method overview. A model takes an RGB image and predicts the 3D joints on which we apply our proposed BMC. These guide the model to predict plausible poses.

Notation. We use bold capital font for matrices, bold lowercase for vector and roman font for scalars. We assume a right hand. The joints define a kinematic chain of the hand starting from the root joint \(\mathbf {j}^{3D}_1\) and ending in the fingertips. For the sake of simplicity, the joints of the hands are grouped by the fingers, denoted as the respective set \(F1,\dots ,F5\), visualized in Fig. 3a. Each \(\mathbf {j}^{3D}_i\), except the root joint (CMC), has a parent, denoted as p(i). We define a bone \(\mathbf {b}_i = \mathbf {j}^{3D}_{i+1} - \mathbf {j}^{3D}_{p(i+1)}\) as the vector pointing from the parent joint to its child joint. Hence . The bones are named according to the child joint. For example, the bone connecting MCP to PIP is called PIP bone. We define the five root bones as the MCP bones, where one endpoint is the root \(\mathbf {j}^{3D}_1\). Intuitively, the root bones are those that lie within and define the palm. We define the bones \(\mathbf {b}_i\) with \(i=1,\dots ,5\) to correspond to the root bones of fingers \(F1, \dots , F5\). We denote the angle \(\alpha (v_1, v_2) = \mathrm {arccos}\big (\frac{\mathbf {v}_1^T \mathbf {v}_2}{||\mathbf {v}_1||_2 \, ||\mathbf {v}_2||_2}\big )\) between the vectors \(\mathbf {v}_1, \mathbf {v}_2\). The interval loss is defined as \(\mathcal {I}(x;a,b) = \max (a-x,0) + \max (x-b,0)\). The normalized vector is defined as \(\mathrm {norm}(\mathbf {x}) = \frac{\mathbf {x}}{||\mathbf {x}||_2}\). Lastly, \(\mathrm {P}_{\mathbf {xy}}(\mathbf {v})\) is the orthogonal projection operator, projecting \(\mathbf {v}\) orthogonally onto the \(\mathbf {x}\)-\(\mathbf {y}\) plane where \(\mathbf {x}\),\(\mathbf {y}\) are vectors.

Fig. 3.
figure 3

Illustration of our proposed biomechanical structure.

3.1 Biomechanical Constraints

Our goal is to integrate our biomechanical soft constraints (BMC) into the training procedure that encourages the network to predict feasible hand poses. We seek to avoid iterative optimization approaches such as inverse kinematics in order to avert significant increases in training time.

The proposed model consists of three functional parts, visualized in Fig. 3. First, we consider the length of the bones, including the root bones of the palm. Second, we model the structure and shape of the palmar region, consisting of a rigid structure made up of individual joints. To account for inter-subject variability of bones and palm structure, it is important to not enforce a specific mean shape. Instead, we allow for these properties to lie within a valid range. Lastly, the model describes the articulation of the individual fingers. The finger motion is described via modeling of the flexion and abduction of individual bones. As their limits are interdependent, they need to be modeled jointly. As such, we propose a novel constraint that takes this interdependence into account.

The limits for each constraint can be attained manually from measurements, from the literature (e.g. [9, 33]), or acquired in a data-driven way from 3D annotations, should they be available.

Bone Length. For each bone i, we define an interval \([b^{\min }_i, b^{\max }_i]\) of valid bone length and penalize if the length \(||\mathbf {b}_i||_2\) lies outside of this interval:

$$\begin{aligned} \mathcal {L}_{\mathrm {BL}}(\mathbf {J}^{3D}) = \frac{1}{20} \sum _{i=1}^{20} \mathcal {I}(||\mathbf {b}_i||_2; b^{\min }_i, b^{\max }_i) \end{aligned}$$

This loss encourages keypoint predictions that yield valid bone lengths. Figure 3a shows the length of a bone in blue.

Root Bones. To attain valid palmar structures we first interpret the root bones as spanning a mesh and compute its curvature by following [30]:

$$\begin{aligned} c_i = \frac{(\mathbf {e}_{i+1} - \mathbf {e}_i)^T(\mathbf {b}_{i+1} - \mathbf {b}_i)}{||\mathbf {b}_{i+1} - \mathbf {b}_i||^2}, \text { for } i \in \{1,2,3,4\} \end{aligned}$$
(1)

Where \(\mathbf {e}_i\) is the edge normal at bone \(\mathbf {b}_i\):

(2)

Positive values of \(c_i\) denote an arched hand, for example when pinky and thumb touch. A flat hand has no curvature. Figure 3b visualizes the mesh in dashed yellow and the triangle over which the curvature is computed in dashed purple.

We ensure that the root bones fall within correct angular ranges by defining the angular distance between neighbouring \(\mathbf {b}_i\),\(\mathbf {b}_{i+1}\) across the plane they span:

$$\begin{aligned} \phi _i = \alpha (\mathbf {b}_i, \mathbf {b}_{i+1}) \end{aligned}$$
(3)

We constrain both the curvature \(c_i\) and angular distance \(\phi _i\) to lie within a valid range \([c_i^{\min }, c_i^{\max }]\) and \([\phi _i^{\min }, \phi _i^{\max }]\):

$$\begin{aligned} \mathcal {L}_{\mathrm {RB}}(\mathbf {J}^{3D}) = \frac{1}{4}\sum _{i=1}^4 \big (\mathcal {I}(c_i;c_i^{\min },c_i^{\max }) + \mathcal {I}(\phi _i;\phi _i^{\min }, \phi _i^{\max })\big ) \end{aligned}$$

\(\mathcal {L}_{\mathrm {RB}}\) ensures that the predicted joints of the palm define a valid structure, which is crucial since the kinematic chains of the fingers originate from this region.

Joint Angles. To compute the joint angles, we first need to define a consistent frame \(\mathbf {F}_i\) of a local coordinate system for each finger bone \(\mathbf {b}_i\). \(\mathbf {F}_i\) must be consistent with respect to the movements of the finger. In other words, if one constructs \(\mathbf {F}_i\) given a pose \(\mathbf {J}^{3D}_1\), then moves the fingers and corresponding \(\mathbf {F}_i\) into pose \(\mathbf {J}^{3D}_2\), the resulting \(\mathbf {F}_i\) should be the same as if constructed from \(\mathbf {J}^{3D}_2\) directly.

We assume right-handed coordinate systems. To construct \(\mathbf {F}_i\), we define two out of three axes based on the palm. We start with the first layer of fingers bones (PIP bones). We define their respective z-component of \(\mathbf {F}_i\) as the normalized bone of their respective parent bone (in this case, the root bones): \(\mathbf {z}_i = \text {norm}(\mathbf {b}_{p(i)})\). Next, we define the x-axis, based on the plane normals spanned by two neighbouring root bones:

(4)

Where \(\mathbf {n}_i\) is defined as in Eq. 2. Lastly, we compute the last axis \(\mathbf {y}_i = \text {norm}(\mathbf {z}_i \times \mathbf {x}_i)\). Given \(\mathbf {F}_i\), we can now define the flexion and abduction angles. Each of these angles are given with respect to the local z-axis of \(\mathbf {F}_i\). Given \(\mathbf {b}_i\) in its local coordinates \(\mathbf {b}_i^{\mathbf {F}_i}\) wrt. \(\mathbf {F}_i\), we define the flexion and abduction angles as:

$$\begin{aligned} \begin{aligned} \theta ^{\mathrm {f}}_i&= \alpha (\mathrm {P}_{xz}(\mathbf {b}_i^{\mathbf {F}_i}), \mathbf {z}_i)\\ \theta ^{\mathrm {a}}_i&= \alpha (\mathrm {P}_{xz}(\mathbf {b}_i^{\mathbf {F}_i}), \mathbf {b}_i^{\mathbf {F}_i}) \end{aligned} \end{aligned}$$
(5)

Figure 3c visualizes \(\mathbf {F}_i\) and the resulting angles. Note that this formulation leads to ambiguities, where different bone orientations can map to the same (\(\theta ^{\mathrm {f}}_i\), \(\theta ^{\mathrm {a}}_i\))-point. We resolve this via an octant lookup, which leads to angles in the intervals \(\theta ^{\mathrm {f}}_i \in [-\pi ,\pi ]\) and \(\theta ^{\mathrm {a}}_i \in [-\pi /2,\pi /2]\) respectively. See appendix for more details.

Given the angles of the first set of finger bones, we can then construct the remaining two rows of finger bones. Let \(\mathbf {R}^{\theta _i}\) denote the rotation matrix that rotates by \(\theta ^{\mathrm {f}}_i\) and \(\theta ^{\mathrm {a}}_i\) such that \(\mathbf {R}^{\theta _i} \mathbf {z}_i = \mathbf {b}_i^{\mathbf {F}_i}\), then we iteratively construct the remaining frames along the kinematic chain of the fingers:

$$\begin{aligned} \begin{aligned} \mathbf {F}_i = \mathbf {R}^{\theta _i}\mathbf {F}_{p(i)} \end{aligned} \end{aligned}$$
(6)

This method of frame construction via rotating by \(\theta ^{\mathrm {f}}_i\) and \(\theta ^{\mathrm {a}}_i\) ensures consistency across poses. The remaining angles can be acquired as described in Eq. 5.

Lastly, the angles need to be constrained. One way to do this is to consider each angle independently and penalize them if they lie outside an interval. This corresponds to constraining them within a box in a 2D space, where the endpoints are the min/max of the limits. However, finger angles have inter-dependency, therefore we propose an alternative approach to account for this. Given points \(\theta _i = (\theta ^{\mathrm {f}}_i, \theta ^{\mathrm {a}}_i)\) that define a range of motion, we approximate their convex hull on the \((\theta ^{\mathrm {f}}, \theta ^{\mathrm {a}})\)-plane with a fixed set of points \(\mathcal {H}_i\). The angles are constrained to lie within this structure by minimizing their distance to it:

$$\begin{aligned} \begin{aligned} \mathcal {L}_\mathrm {A}(\mathbf {J}^{3D}) = \frac{1}{15} \sum _{i=1}^{15} D_H(\theta _i, \mathcal {H}_i) \end{aligned} \end{aligned}$$
(7)

Where \(D_H\) is the distance of point \(\theta _i\) to the hull \(\mathcal {H}_i\). Details on the convex hull approximation and implementation can be found in the appendix.

3.2 \(\mathbf {Z}^{\mathrm {root}}\) Refinement

The 2.5D joint representation allows us to recover the value of the absolute pose \(Z^{root}\) up to a scaling factor. This is done by solving a quadratic equation dependent on the 2D projection \(\mathbf {J}^{2D}\) and relative depth values \(\mathbf {z}^r\), as proposed in [18]. In practice, small errors in \(\mathbf {J}^{2D}\) or \(\mathbf {z}^r\) can result in large deviations of \(Z^{root}\). This leads to big fluctuations in the translation and scale of the predicted pose, which is undesirable. To alleviate these issues, we employ an MLP to refine and smooth the calculated \(\hat{Z}^{root}\):

$$\begin{aligned} \hat{Z}^{root}_\mathrm {ref} = \hat{Z}^{root} + M_{\mathrm {MLP}}(\mathbf {z}^r, \mathbf {K}^{-1}\mathbf {J}^{2D}, \hat{Z}^{root}; \mathbf {\omega }) \end{aligned}$$
(8)

Where \(M_\mathrm {MLP}\) is a multilayered perceptron with parameters \(\mathbf {\omega }\) that takes the predicted and calculated values , , and outputs a residual term. Alternatively, one could predict \(Z^{root}\) directly using an MLP with the same input. However, as the exact relationship between the predicted variables and \(Z^{root}\) is known, we resort to the refinement approach instead of requiring a model to learn what is already known.

3.3 Final Loss

The biomechanical soft constraints is constructed as follows:

$$\begin{aligned} \mathcal {L}_{\mathrm {BMC}} = \lambda _{\mathrm {BL}}\mathcal {L}_{\mathrm {BL}} + \lambda _{\mathrm {RB}}\mathcal {L}_{\mathrm {RB}} + \lambda _{\mathrm {A}}\mathcal {L}_{\mathrm {A}} \end{aligned}$$
(9)

Our final model is trained on the following loss function:

$$\begin{aligned} \mathcal {L} = \lambda _{\mathbf {J}^{2D}}\mathcal {L}_{\mathbf {J}^{2D}} + \lambda _{\mathbf {z}^r}\mathcal {L}_{\mathbf {z}^r} + \lambda _{\mathrm {Z_\mathrm {ref}^{root}}}\mathcal {L}_{\mathrm {Z^{root}}} + \mathcal {L}_{\mathrm {BMC}} \end{aligned}$$
(10)

where \(\mathcal {L}_{\mathbf {J}^{2D}}\), \(\mathcal {L}_{\mathbf {z}^r}\) and \(\mathcal {L}_{\mathrm {Z^{root}}}\) are the L1 loss on any available \(\mathbf {J}^{2D}\), \(\mathbf {z}^r\) and \(Z^{root}\) labels respectively. The weights \(\lambda _i\) balance the individual loss terms.

4 Implementation

We use a ResNet-50 backbone  [16]. The input to our model is a \(128 \times 128\) RGB image from which the 2.5D representation is directly regressed. The model and its refinement step is trained on fully supervised and weakly-supervised data. The network was trained for 70 epochs using SGD with a learning rate of and a step-wise learning rate decay of 0.1 after every 30 epochs. We apply the biomechanical constraints directly on the predicted 3D keypoints \(\mathbf {J}^{3D}\).

5 Evaluation

Here we introduce the datasets used, show the performance of our proposed \(\mathcal {L}_{\mathrm {\mathbf {BMC}}}\) and compare in extensive settings. Specifically, we study the effect of adding weakly supervised data to complement fully supervised training. All experiments are conducted in a setting where we assume access to a fully supervised dataset, as well as a supplementary weakly supervised real dataset. Therefore we have access to 2D ground-truth annotations and the computed constraint limits. We study two cases of 3D supervision sources:

Synthetic Data. We choose RHD. Acquiring fully labeled synthetic data is substantially easier as compared to real data. Section 5.35.5 consider this setting.

Partially Labeled Real Data. In Sect. 5.6 we gradually increase the number of real 3D labeled samples to study how the proposed approach works under different ratio of fully to weakly supervised data.

To make clear what kind of supervision is used we denote \(\mathbf {3D}_\mathrm {A}\) if 3D annotation is used from dataset \(\mathrm {A}\). We indicate usage of 2D from dataset \(\mathrm {A}\) as \(\mathbf {2D}_A\). Section 5.3 and 5.4 are evaluated on FH.

Table 1. Overview of datasets used for evaluation.

5.1 Datasets

Each dataset that provides 3D labels comes with the camera intrinsics. Hence the 2D pose can be easily acquired from the 3D pose. Table 1 provides an overview of datasets used. The test set of HO-3D and FH are available only via a submission system with limited number of total submissions. Therefore for the ablation study (Sect. 5.4) and inspecting the effect of weak-supervision (Sect. 5.3), we divide the training set into a training and validation split. For these sections, we choose to evaluate on FH due to its large number of samples and variability in both hand pose and shape.

5.2 Evaluation Metric

HO-3D. The error given by the submission system is the mean joint error in mm. The INTERP is the error on test frames sampled from training sequences that are not present in the training set. The EXTRAP is the error on test samples that have neither hand shapes nor objects present in the training set. We used the version of the dataset that was available at the time [3].

FH. The error given by the submission system is the mean joint error in mm. Additionally, the area under the curve (AUC) of the percentage of correct keypoints (PCK) plot is reported. The PCK values lie in an interval 0 mm 50 mm with 100 equally spaced thresholds. Both the aligned (using procrustes analysis) and unaligned scores are given. We report the aligned score. The unaligned score can be found in the appendix.

D+O. We report the AUC for the PCK thresholds of 20 to 50 mm comparable with prior work [5, 46, 49]. For [18, 25, 34, 48] we report the numbers as presented in [46] as they consolidate all AUC of related work in a consistent manner using the same PCK thresholds. For [4], we recomputed the AUC for the same interval based on the values provided by the authors.

5.3 Effect of Weak-Supervision

We first inspect how weak-supervision affects the performance of the model. We decompose the 3D prediction error on the validation set of FH in terms of its 2D (\(\mathbf {J}^{2D}\)) and depth component (\(\mathbf {Z}\)) via the pinhole camera model \(\mathbf {Z}^{-1}\mathbf {K}\mathbf {J}^{3D} = \mathbf {J}^{2D}\) and evaluate their individual error.

We train four models using different data sources. 1) Full 3D supervision on both synthetic RHD and real FH (\(\mathbf {3D}_\mathrm {RHD}+\mathbf {3D}_\mathrm {FH}\)), which serves as an upper bound for when all 3D labels are available 2) Fully supervised on RHD which constitutes our lower bound on accuracy (\(\mathbf {3D}_\mathrm {RHD}\)) 3) Fully supervised on RHD with naive application of weakly-supervised FH (\(+\mathbf {2D}_\mathrm {FH}\)) 4) Like setting 3) but adding our proposed constraints (\(\mathbf {+ \mathcal {L}_\mathrm {\mathbf {BMC}}}\)).

Table 2. The effect of weak-supervision on the validation split of FH. Training on synthetic data (RHD) leads to poor accuracy on real data (FH). Adding real 2D labeled data reduces 3D prediction error due to better alignment with the 2D projection. Adding our proposed \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) significantly reduces the 3D error due to more accurate \(\mathbf {Z}\).

Table 2 shows the results. The model trained with full 3D supervision from real and synthetic data reflects the best setting. Adding \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) during training slightly reduces 3D error (8.78 mm to 8.6 mm) primarily due to a regularization effect. When the model is trained only on synthetic data (\(\mathbf {3D}_\mathrm {RHD}\)) we observe a significant rise (8.78mm to 30.82 mm) in 3D error due to the poor generalization from synthetic data. When weak-supervision is provided from the real data (\(+\mathbf {2D}_\mathrm {FH}\)), the error is reduced (30.82 mm to 20.92 mm). However, inspecting this more closely we observe that the improvement comes mainly from 2D error reduction (12.35px to 3.8px), whereas the depth component is improved marginally (20.02 mm to 17.02 mm). Observing these samples qualitatively (Fig. 1), we see that many do not adhere to biomechanical limits of the human hand. By penalizing such violations via our proposed losses \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) to the weakly supervised setting we see a significant improvement in 3D error (20.92mm to 13.78mm) which is due to improved depth accuracy (20.02mm to 9.97mm). Inspecting (e.g. Fig. 1) closer, we see that the model predicts the correct 3D pose in challenging settings such as heavy self- and object occlusion, despite having never seen such samples in 3D. Since \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) describes a valid range, rather than a specific pose, slight deviations from the ground truth 3D pose have to be expected which explains the small remaining quantitative gap from the fully supervised model.

5.4 Ablation Study

We quantify the individual contributions of our proposals on the validation set of FH and reproduce these results on HO-3D in supplementary. Each error metric is computed for the root-relative 3D pose.

Table 3. Effect of \(Z^{root}\) refinement

Refinement Network. Table 3 shows the impact of \(Z^{root}\) refinement (Sect. 3.2). We train two models that include (w. refinement) or omit (w/o refinement) the refinement step, using full supervision on FH (\(\mathbf {3D}_\mathrm {FH}\)). Using refinement, the mean error is reduced by 1.44mm which indicates that refining effectively reduces outliers.

Fig. 4.
figure 4

Impact of our proposed losses. (a) All predicted 3D poses project to the same 2D pose. (b) Ground-truth pose. (c) \(\mathcal {L}_{\mathrm {BL}}\) results in poses that have correct bone lengths, but may have invalid angles and palm structure. (d) Including \(\mathcal {L}_{\mathrm {RB}}\) imposes a correct palm, but the fingers are still articulated wrong. (e) Adding \(\mathcal {L}_{\mathrm {A}}\) leads to the finger bones having correct angles. The resulting hand is plausible and close to the ground-truth.

Table 4. Effect of BMC components.

Components of BMC. In Table 4, we perform a series of experiments where we incrementally add each of the proposed constraints. For 3D guidance, we use the synthetic RHD and only use the 2D labels of FH. We first run the baseline model trained only on this data (\(\mathbf {3D}_\mathrm {RHD} + \mathbf {2D}_\mathrm {FH}\)). Next, we add the bone length loss \(\mathcal {L}_{\mathrm {BL}}\), followed by the root bone loss \(\mathcal {L}_{\mathrm {RB}}\) and the angle loss \(\mathcal {L}_\mathrm {A}\). An upper bound is given by our model trained fully supervised on both datasets (\(\mathbf {3D}_\mathrm {RHD} + \mathbf {3D}_\mathrm {FH}\)). Each component contributes positively towards the final performance, totalling a decrease of 6.24mm in mean error as compared to our weakly-supervised baseline, significantly closing the gap to the fully supervised upper bound. A qualitative assessment of the individual losses can be seen in Fig. 4.

Table 5. Effect of angle constraints

Co-dependency of Angles. In Table 5, we show the importance of modeling the dependencies between the flexion and abduction angle limits (Sect. 3), instead of regarding them independently. Co-dependent angle limits yield a decrease in mean error of 1.40 mm.

Constraint Limits. In Table 6, we investigate the effect of the used limits on the final performance, as one may have to resort to approximations. For this, we instead take the hand parameters from RHD and perform the same weakly-supervised experiment as before (\(+\mathcal {L}_\mathrm {\mathbf {BMC}}\)). Approximating the limits from another dataset slightly increases the error, but still clearly outperforms the 2D baseline.

Table 6. Effect of limits

5.5 Bootstrapping with Synthetic Data

We validate \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) on the test set of FH and HO-3D. We train the same four models like in Sect. 5.3 using fully supervised RHD and weakly-supervised real data R\(\in \)[FH,HO-3D].

For all results here we perform training on the full dataset and evaluate on the official test split via the online submission system. Additionally, we evaluate the cross-dataset performance on D+O dataset to show how our proposed constraints improves generalizability and compare with prior work [4, 5, 18, 25, 46].

Table 7. Results on the respective test split, evaluated by the submission systems. Training on RHD leads to poor accuracy on both FH and HO-3D. Adding weakly-supervised data improves results, as expected. By including our proposed \(\mathbf {\mathcal {L}_\mathrm {\mathbf {BMC}}}\), our model incurs a significant boost in accuracy, especially evident for the INTERP score.

FH. The second column of Table 7 shows the dataset performance for R = FH. Training solely on RHD (\(\mathbf {3D_\mathrm {RHD}}\)) performs the worst. Adding real data (\(+ \mathbf {2D_\mathrm {FH}}\)) with 2D labels reduces the error, as we reduce the real/synthetic domain gap. Including the proposed \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) results in an accuracy boost.

HO-3D. The third column of Table 7 shows a similar trend for R = HO-3D. Most notably, our constraints yield a decrease of 14.85 mm for INTERP. This is significantly larger than the relative decrease the 2D data adds (-8.41mm). For EXTRAP, BMC yields an improvement of 1.15mm, which is close to the 1.27mm gained from 2D data. This demonstrates that \(\mathcal {L}_\mathrm {\mathbf {BMC}}\) is beneficial in leveraging 2D data more effectively in unseen scenarios.

Table 8. Datasets used by prior work for evaluation on D+O. With solely fully-supervised synthetic and weakly-supervised real data, we outperform recent works and perform on par with [46]. All other works rely on full supervision from real and synthetic data. *These works report unaligned results.

D+O. In Table 8 we demonstrate the cross-data performance on D+O for R = FH. Most recent works have made use of MANO [4, 5, 46], leveraging a low-dimensional embedding of highly detailed hand scans and require custom synthetic data [4, 5] to fit the shape. Using only fully supervised synthetic data and weakly-supervised real data in conjunction with \(\mathcal {L}_\mathrm {\mathbf {BMC}}\), we reach state-of-the-art.

5.6 Bootstrapping with Real Data

We study the impact of our biomechanical constrains on reducing the number of labeled samples required in scenarios where few real 3D labeled samples are available. We train a model in a setting where a fraction of the data contains the full 3D labels and the remainder contains only 2D supervision.

Fig. 5.
figure 5

Number of 3D samples required to reach a certain aligned AUC on FH.

Here we choose \(R = \) FH, use the entire training set and evaluate on the test set. For each fraction of fully labelled data we evaluate two models. The first is trained on both the fully and weakly labeled samples. The second is trained with the addition of our proposed constraints. We show the results in Fig. 5. For a given AUC, we plot the number of labeled samples required to reach it. We observe that for lower labeling percentages, the amount of labeled data required is approximately half using \(\mathcal {L}_\mathrm {\mathbf {BMC}}\). This showcases its effectiveness in low label settings and demonstrates the decrease in requirement for fully annotated training data.

6 Conclusion

We propose a set of fully differentiable biomechanical losses to more effectively leverage weakly supervised data. Our method consists of a novel procedure to encourage anatomically correct predictions of a backbone network via a set of novel losses that penalize invalid bone length, joint angles as well as palmar structures. Furthermore, we have experimentally shown that our constraints can more effectively leverage weakly-supervised data, which show improvement on both within- and cross-dataset performance. Our method reaches state-of-the-art performance on the aligned D+O objective using 3D synthetic and 2D real data and reduces the need of training data by half in low label settings on FH.