Depth-Based Hand Pose Estimation: Methods, Data, and Challenges

Supančič, James Steven; Rogez, Grégory; Yang, Yi; Shotton, Jamie; Ramanan, Deva

doi:10.1007/s11263-018-1081-7

Depth-Based Hand Pose Estimation: Methods, Data, and Challenges

Published: 12 April 2018

Volume 126, pages 1180–1198, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Computer Vision Aims and scope Submit manuscript

Depth-Based Hand Pose Estimation: Methods, Data, and Challenges

Download PDF

James Steven Supančič III ORCID: orcid.org/0000-0003-1631-682X¹,
Grégory Rogez^2,3,
Yi Yang⁴,
Jamie Shotton⁵ &
…
Deva Ramanan⁶

2859 Accesses
50 Citations
1 Altmetric
Explore all metrics

Abstract

Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new advances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable number of systems, and have released software and evaluation code. We summarize important conclusions here: (1) Coarse pose estimation appears viable for scenes with isolated hands. However, high precision pose estimation [required for immersive virtual reality and cluttered scenes (where hands may be interacting with nearby objects and surfaces) remain a challenge. To spur further progress we introduce a challenging new dataset with diverse, cluttered scenes. (2) Many methods evaluate themselves with disparate criteria, making comparisons difficult. We define a consistent evaluation criteria, rigorously motivated by human experiments. (3) We introduce a simple nearest-neighbor baseline that outperforms most existing systems. This implies that most systems do not generalize beyond their training sets. This also reinforces the under-appreciated point that training data is as important as the model itself. We conclude with directions for future progress.

Real-Time Hand Pose Estimation Using Depth Camera

Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation Under Hand-Object Interaction

Key Developments in Human Pose Estimation for Kinect

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Human hand pose estimation empowers many practical applications, for example sign language recognition (Keskin et al. 2012), visual interfaces (Melax et al. 2013), and driver analysis (Ohn-Bar and Trivedi 2014a). Recently introduced consumer depth cameras have spurred a flurry of new advances (Ren et al. 2011; Keskin et al. 2012; Tang and Kim 2013; Li and Kitani 2013; Melax et al. 2013; Xu and Cheng 2013; Tang et al. 2014; Tompson et al. 2014; Qian et al. 2014; Sridhar et al. 2015).

Motivation Recent methods have demonstrated impressive results. But differing (often in-house) testsets, varying performance criteria, and annotation errors impede reliable comparisons (Oberweger et al. 2015a). Indeed, a recent meta-level analysis of object tracking papers reveals that it is difficult to trust the “best” reported method in any one paper (Pang and Ling 2013). In the field of object recognition, comprehensive benchmark evaluation has been vital for progress (Fei-Fei et al. 2007; Deng et al. 2009; Everingham et al. 2010). Our goal is to similarly diagnose the state-of-affairs, and to suggest future strategic directions, for depth-based hand pose estimation.

Contributions Foremost, we contribute the most extensive evaluation of depth-based hand pose estimators to date. We evaluate 13 state-of-the-art hand-pose estimation systems across 4 testsets under uniform scoring criteria. Additionally, we provide a broad survey of contemporary approaches, introduce a new testset that addresses prior limitations, and propose a new baseline for pose estimation based on nearest-neighbor (NN) exemplar volumes. Surprisingly, we find that NN exceeds the accuracy of most existing systems (Fig. 1). We organize our discussion along three axes: test data (Sect. 2), training data (Sect. 3), and model architectures (Sect. 4). We survey and taxonomize approaches for each dimension, and also contribute novelty to each dimension (e.g. new data and models). After explicitly describing our experimental protocol (Sect. 5), we end with an extensive empirical analysis (Sect. 6).

Preview We foreshadow our conclusions here. When hands are easily segmented or detected, current systems perform quite well. However, hand “activities” involving interactions with objects/surfaces are still challenging (motivating the introduction of our new dataset). Moreover, in such cases even humans perform imperfectly. For reasonable error measures, annotators disagree 20% of the time (due to self and inter-object occlusions and low resolution). This has immediate implications for test benchmarks, but also imposes a challenge when collecting and annotating training data. Finally, our NN baseline illustrates some surprising points. Simple memorization of training data performs quite well, outperforming most existing systems. Variations in the training data often dwarf variations in the model architectures themselves (e.g. decision forests versus deep neural nets). Thus, our analysis offers the salient conclusion that “it’s all about the (training) data”.

Prior Work Our work follows in the rich tradition of benchmarking (Everingham et al. 2010; Dollar et al. 2012; Russakovsky et al. 2013) and taxiomatic analysis (Scharstein 2002; Erol et al. 2007). In particular, Erol et al. (2007) reviewed hand pose analysis in 2007. Contemporary approaches have considerably evolved, prompted by the introduction of commodity depth cameras. We believe the time is right for another look. We do extensive cross-dataset analysis, by training and testing systems on different datasets (Torralba and Efros 2011). Human-level studies in benchmark evaluation (Martin et al. 2004) inspired our analysis of human-performance. Finally, our NN-baseline is closely inspired by non-parametric approaches to pose estimation (Shakhnarovich et al. 2003). In particular, we use volumetric depth features in a 3D scanning-window (or volume) framework, similar to Song and Xiao (2014). But, our baseline does not need SVM training or multi-cue features, making it simpler to implement.

Table 1 Testing data sets. We group existing benchmark testsets into 3 groups based on challenges addressed - articulation, viewpoint, and/or background clutter

Full size table

2 Testing Data

Test scenarios for depth-based hand-pose estimation have evolved rapidly. Early work evaluated on synthetic data, while contemporary work almost exclusively evaluates on real data. However, because of difficulties in manual annotation (a point that we will revisit), evaluation was not always quantitative—instead, it has been common to show select frames to give a qualitative sense of performance (Delamarre and Faugeras 2001; Bray et al. 2004; Oikonomidis et al. 2011; Pieropan et al. 2014). We fundamentally assume that quantitative evaluation on real data will be vital for continued progress.

Test Set Properties We have tabulated a list of contemporary test benchmarks in Table 1, giving URLs on our website.^{Footnote 1} We refer the reader to the caption for a detailed summary of specific dataset properties. Per dataset, Fig. 2 visualizes the pose-space covered using multi-dimensional scaling (MDS). We embed both the camera viewpoint angles and joint angles (in a normalized coordinate frame that is centered, scaled and rotated to the camera viewpoint). We conclude that previous datasets make different assumptions about articulation, viewpoint, and perhaps most importantly, background clutter. Such assumptions are useful because they allow researchers to focus on particular aspects of the problem. However it is crucial to make such assumptions explicit (Torralba and Efros 2011), which much prior work does not. We do so below.

Articulation Many datasets focus on pose estimation with the assumption that detection and overall hand viewpoint is either given or limited in variation. Example datasets include MSRA-2014 (Qian et al. 2014), A-Star (Xu and Cheng 2013), and Dexter (Sridhar et al. 2013). While these test sets focus on estimating hand articulation, not all test sets contain the same amount of pose variation. For example, a sign language test set will exhibit a small number of discrete poses. To quantify articulation, we fit a multi-variate Gaussian distribution to a test set’s finger joint angles. Then we compute the differential entropy for the test set’s distribution:

$$\begin{aligned} h(\varSigma ) = .5 \log \left( (2 \pi e)^{N} \det (\varSigma ) \right) \end{aligned}$$

(1)

where $\varSigma $ is the covariance of the test set’s joint angles and N is the number of joint angles in each pose vector. This analysis suggests that our proposed test set contains greater pose variation (entropy, $h=89$) than the ICL ($h=34$), NYU ($h=82$), FORTH ($h=65$) or A-STAR ($h=79$) test sets. We focus on ICL (Tang et al. 2014) as a representative example for experimental evaluation because it has been used in multiple prior published works (Tang et al. 2014; Tang and Kim 2013; Oberweger et al. 2015a).

Art and Viewpoint Other testsets have focused on both viewpoint variation and articulation. FORTH (Oikonomidis et al. 2011) provides five test sequences with varied articulations and viewpoints, but these are unfortunately unannotated. The CVAR-EGO (Oberweger et al. 2016) dataset provides highly precise joint annotations but contains fewer frames and only one subject. In our experiments, we analyze the NYU dataset (Tompson et al. 2014) because of its wide pose variation (see Fig. 2), larger size, and accurate annotations (see Sect. 3).

Table 2 Training data sets. We broadly categorize training datasets by the method used to generate the data and annotations: real data + manual annotations, real data + automatic annotations, or synthetic data (and automatic annotations)

Full size table

Art. + View. + Clutter The most difficult datasets contain cluttered backgrounds that are not easy to segment away. These datasets tend to focus on “in-the-wild” hands performing activities and interacting with nearby objects and surfaces. The KTH Dataset (Pieropan et al. 2014) provides a rich set of 3rd person videos showing humans interacting with objects. Unfortunately, annotations are not provided for the hands (only the objects). Similarly, the LISA (Ohn-Bar and Trivedi 2014a) dataset provides cluttered scenes captured inside vehicles. However, joint positions are not annotated, only coarse gesture. The UCI-EGO (Rogez et al. 2014) dataset provides challenging sequences from an egocentric perspective with joint level annotations, and so is included in our benchmark analysis.

Our Testset Our empirical evaluation will show that in-the-wild hand activity is still challenging. To push research in this direction, we have collected and annotated our own testset of real images (labeled as Ours in Table 1, examples in Fig. 3). As far as we are aware, our dataset is the first to focus on hand pose estimation across multiple subjects and multiple cluttered scenes. This is important, because any practical application must handle diverse subjects, scenes, and clutter.

3 Training Data

Here we discuss various approaches for generating training data (ref. Table 2). Real annotated training data has long been the gold standard for supervised learning. However, the generally accepted wisdom (for hand pose estimation) is that the space of poses is too large to manually annotate. This motivates approaches to leverage synthetically generated training data, discussed further below.

Real Data + Manual Annotation Arguably, the space of hand poses exceeds what can be sampled with real data. Our experiments identify a second problem: perhaps surprisingly, human annotators often disagree on pose annotations. For example, in our testset, human annotators disagree on 20% of pose annotations (considering a 20 mm threshold) as plotted in Fig. 21. These disagreements arise from limitations in the raw sensor data, either due to poor resolution or occlusions. We found that low resolution consistently corresponds to annotation ambiguities, across test sets. See Sect. 5.2) for further discussion and examples. These ambiguities are often mitigated by placing the hand close to the camera (Xu and Cheng 2013; Tang et al. 2014; Qian et al. 2014; Oberweger et al. 2016). As an illustrative example, we evaluate the ICL training set (Tang et al. 2014).

Real Data + Automatic Annotation Data gloves directly obtain automatic pose annotations for real data (Xu and Cheng 2013). However, they require painstaking per-user calibration. Magnetic markers can partially alleviate calibration difficulties (Wetzler et al. 2015) but still distort the hand shape that is observed in the depth map. When evaluating depth-only systems, colored markers can provide ground-truth through the RGB channel (Sharp et al. 2015). Alternatively, one could use a “passive” motion capture system. We evaluate the larger NYU training set (Tompson et al. 2014) that annotates real data by fitting (offline) a skinned 3D hand model to high-quality 3D measurements. Finally, integrating model fitting with tracking lets one leverage a small set of annotated reference frames to annotate an entire video (Oberweger et al. 2016).

Quasi-Synthetic Data Augmenting real data with geometric computer graphics models provides an attractive solution. For example, one can apply geometric transformations (e.g. rotations) to both real data and its annotations (Tang et al. 2014). If multiple depth cameras are used to collect real data (that is then registered to a model), one can synthesize a larger set of varied viewpoints (Sridhar et al. 2015; Tompson et al. 2014). Finally, mimicking the noise and artifacts of real data is often important when using synthetic data. Domain transfer methods (Tang and Kim 2013) learn the relationships between a small real dataset and large synthetic one.

Synthetic Data Another hope is to use data rendered by a computer graphics system. Graphical synthesis sidesteps the annotation problem completely: precise annotations can be rendered along with the features. One can easily vary the size and shape of synthesized training hands, a fact which allows us to explore how user-specific training data impacts accuracy. Our experiments (ref. Sect. 6) verify that results may be optimistic when the training and test datasets contain the same individuals, as non-synthetic datasets commonly do (ref. Table 2). When synthesizing novel exemplars, it is important define a good sampling distribution. A common strategy for generating a sampling distribution is to collect pose samples with motion capture data (Castellini et al. 2011; Feix et al. 2013). The UCI-EGO training set (Rogez et al. 2014) synthesizes data with an egocentric prior over viewpoints and grasping poses.

3.1 libhand Training Set

To further examine the effect of training data, we created a massive custom training set of 25,000,000 RGB-D training instances with the open-source libhand model (some examples are shown in Fig. 7). We modified the code to include a forearm and output depth data, semantic segmentations, and keypoint annotations. We emphasize that this synthetic training set is distinct from our new test dataset of real images.

Table 3 Synthetic hand distribution. We render synthetic hands with joint angles sampled from the above uniform distributions

Full size table

Table 4 Summary of methods: we broadly categorize the pose estimation systems that we evaluate by their overall approach: decision forests, deep models, trackers, or others

Full size table

Synthesis Parameters To avoid biasing our synthetic training set away from unlikely, but possible, poses we do not use motion capture data. Instead, we take a brute-force approach based on rejection-sampling. We uniformly and independently sample joint angles (from a bounded range), and throw away invalid samples that yield self-intersecting 3D hand poses. Specifically, using the libhand joint identifiers shown in Fig. 4, we generate poses by uniformly sampling from bounded ranges, as shown in Table. 3.

Quasi-Synthetic Backgrounds Hand synthesis engines commonly under-emphasize the importance of image backgrounds (Šarić 2011; Oikonomidis et al. 2011; Tompson et al. 2014). For methods operating on pre-segmented images (Keskin et al. 2012; Sridhar et al. 2013; Qian et al. 2014), this is likely not an issue. However, for active hands “in-the-wild”, the choice of synthetic backgrounds, surfaces, and interacting objects becomes important. Moreover, some systems require an explicit negative set (of images not containing hands) for training. To synthesize a robust background/negative training set, we take a quasi-synthetic approach by applying random affine transformations to 5000 images of real scenes, yielding a total of 1,000,0000 pseudo-synthetic backgrounds. We found it useful to include human bodies in the negative set because faces are common distractors for hand models.

4 Methods

Next we survey existing approaches to hand pose estimation (summarized in Table 4). We conclude by introducing a novel volumetric nearest-neighbor (NN) baseline.

4.1 Taxonomy

Trackers Versus Detectors We focus our analysis on single-frame methods. For completeness, we also consider several tracking baselines (Oikonomidis et al. 2011; PrimeSense 2013; Intel 2013) needing ground-truth initialization. Manual initialization may provide an unfair advantage, but we will show that single-frame methods are still nonetheless competitive, and in most cases, outperform tracking-based approaches. One reason is that single-frame methods essentially “reinitialize” themselves at each frame, while trackers cannot recover from an error.

Discrete Versus Continuous Pose We further concentrate our analysis on the continuous pose regression problem. However historically, much prior work has tackled the problem from a discrete gesture classification perspective (Mo and Neumann 2006; PrimeSense 2013; Premaratne et al. 2010; Ohn-Bar and Trivedi 2014b). Yet, these perspectives are closely related because one can tackle continuous pose estimation using a large number of discrete classes. As such, we evaluate several discrete classifiers in our benchmark (Muja and Lowe 2014; Rogez et al. 2015a).

Data-Driven Versus Model-Driven Historic attempts to estimate hand pose optimized a geometric model to fit observed data (Delamarre and Faugeras 2001; Bray et al. 2004; Stenger et al. 2006). Recently, Oikonomidis et al. (2011) demonstrated hand tracking using GPU accelerated Particle Swarm Optimization (PSO). However, such optimizations remain notoriously difficult due to local minima in the objective function. As a result, model driven systems have historically found their successes mostly limited to the tracking domain, where initialization constrains the search space (Sridhar et al. 2013; Melax et al. 2013; Qian et al. 2014). For single image detection, various fast classifiers and regressors have obtained real-time speeds (Keskin et al. 2012; Intel 2013; Oberweger et al. 2015a, b; Tang et al. 2015; Sun et al. 2015; Li et al. 2015; Wan et al. 2016). Most of the systems we evaluate fall into this category. When these classifiers are trained with data synthesized from a geometric model, they can be seen as efficiently approximating model fitting.

Multi-stage Pipelines Systems commonly separate their work into discrete stages: detecting, posing, refining and validating hands. Some systems use special purpose detectors as a “pre-processing” stage (Girard and Maciejewski 1985; Oikonomidis et al. 2011; Keskin et al. 2012; Cooper 2012; Xu and Cheng 2013; Intel 2013; Romero et al. 2009; Tompson et al. 2014). A segmentation pre-processing stage has been historically popular. Typically, RGB skin classification (Vezhnevets et al. 2003) or morphological operations on the depth image (Premaratne et al. 2010) segment the hand from the background. Such segmentation allows computation of Zernike moment (Cooper 2012) or skeletonization (Premaratne et al. 2010) features. While RGB features compliment depth (Rogez et al. 2014; Gupta et al. 2014), skin segmentation appears difficult to generalize across subjects and scenes with varying lighting (Qian et al. 2014). We evaluate a depth-based segmentation system (Intel 2013) for completeness. Other systems use a model for inverse-kinematics/IK (Tompson et al. 2014; Xu and Cheng 2013), geometric refinement/validation (Melax et al. 2013; Tang et al. 2015), or collaborative filtering (Choi et al. 2015) during a “post-processing” stage. For highly precise hand pose estimation, recent hybrid pipelines compliment data-driven per-frame reinitialization with model-based refinement (Taylor et al. 2016; Ballan et al. 2012; Sridhar et al. 2015; Qian et al. 2014; Ye et al. 2016).

4.2 Architectures

In this section, we describe popular architectures for hand-pose estimation, placing in bold those systems that we empirically evaluate.

Decision Forests Decision forests constitute a dominant paradigm for estimating hand pose from depth. Hough Forests (Xu and Cheng 2013) take a two-stage approach of hand detection followed by pose estimation. Random Decision Forests (RDFs) (Keskin et al. 2012) and Latent Regression Forests (LRFs) (Tang et al. 2014) leave the initial detection stage unspecified, but both make use of coarse-to-fine decision trees that perform rough viewpoint classification followed by detailed pose estimation. We experimented with several detection front-ends for RDFs and LRFs, finally selecting the first-stage detector from Hough Forests for its strong performance.

Part Model Pictorial structure models have been popular in human body pose estimation (Yang and Ramanan 2013), but they appear somewhat rarely in the hand pose estimation literature. For completeness, we evaluate a deformable part model defined on depth image patches (Felzenszwalb et al. 2010). We specifically train an exemplar part model (EPM) constrained to model deformations consistent with 3D exemplars (Zhu et al. 2012).

Deep Models Recent systems have explored using deep neural nets for hand pose estimation. We consider three variants in our experiments. DeepJoint (Tompson et al. 2014) uses a three stage pipeline that initially detects hands with a decision forest, regresses joint locations with a deep network, and finally refines joint predictions with inverse kinematics (IK). DeepPrior (Oberweger et al. 2015a) is based on a similar deep network, but does not require an IK stage and instead relies on the network itself to learn a spatial prior. DeepSeg (Farabet et al. 2013) takes a pixel-labeling approach, predicting joint labels for each pixel, followed by a clustering stage to produce joint locations. This procedure is reminiscent of pixel-level part classification of Kinect (Shotton et al. 2013), but substitutes a deep network for a decision forest.

4.3 Volumetric Exemplars

We propose a nearest-neighbor (NN) baseline for additional diagnostic analysis. Specifically, we convert depth map measurements into a 3D voxel grid, and simultaneously detect and estimate pose by scanning over this grid with volumetric exemplar templates. We introduce several modifications to ensure an efficient scanning search.

Voxel Grid Depth cameras report depth as a function of pixel (u, v) coordinates: D(u, v). To construct a voxel grid, we first re-project these image measurements into 3D using known camera intrinsics $f_u,f_v$.

$$\begin{aligned} \left( x,y,z\right) = \left( \frac{u}{f_u} D(u,v), \frac{v}{f_v} D(u,v), D(u,v)\right) \end{aligned}$$

(2)

Given a test depth image, we construct a binary voxel grid V[x, y, z] that is ‘1’ if a depth value is observed at a quantized (x, y, z) location. To cover the rough viewable region of a camera, we define a coordinate frame of $M^3$ voxels, where $M=200$ and each voxel spans $10\,\text {mm}^3$. We similarly convert training examples into volumetric exemplars E[x, y, z], but instead use a smaller $N^3$ grid of voxels (where $N=30$), consistent with the size of a hand.

Occlusions When a depth measurement is observed at a position $(x',y',z')=1$, all voxels behind it are occluded $z > z'$. We define occluded voxels to be ‘1’ for both the test-time volume V and training exemplar E.

Distance Measure Let $V_j$ be the $j{\text {th}}$ subvolume (of size $N^3$) extracted from V, and let $E_i$ be the $i{\text {th}}$ exemplar. We simultaneously detect and estimate pose by computing the best match in terms of Hamming distance:

$$\begin{aligned} (i^*,j^*)&=\mathop {{{\mathrm{\hbox {argmin}}}}}\limits _{i,j} \text {Dist}(E_i,V_j) \qquad \text {where} \end{aligned}$$

(3)

$$\begin{aligned} \text {Dist}(E_i,V_j)&= \sum _{x,y,z} \mathcal {I} (E_i[x,y,z] \ne V_j[x,y,z]), \end{aligned}$$

(4)

such that $i^*$ is the best-matching training exemplar and $j^*$ is its detected position.

Efficient Search A naive search over exemplars and subvolumes is prohibitively slow. But because the underlying features are binary and sparse, there exist considerable opportunities for speedup. We outline two simple strategies. First, one can eliminate subvolumes that are empty, fully occluded, or out of the camera’s field-of-view. Song and Xiao (2014) refer to such pruning strategies as “jumping window” searches. Second, one can compute volumetric Hamming distances with 2D computations:

$$\begin{aligned}&\text {Dist}(E_i,V_j) =\sum _{x,y}\left| e_i[x,y] - v_j[x,y] \right| \ \qquad \text {where} \nonumber \\&e_i[x,y] = \sum _z E_i[x,y,z], \qquad v_j[x,y] = \sum _z V_j[x,y,z]. \end{aligned}$$

(5)

Intuition for Our Encoding Because our 3D volumes are projections of 2.5D measurements, they can be sparsely encoded with a 2D array (see Fig. 5). Taken together, our two simple strategies imply that a 3D volumetric search can be as practically efficient as a 2D scanning-window search. For a modest number of exemplars, our implementation still took tens of seconds per frame, which sufficed for our offline analysis. We posit faster NN algorithms could yield real-time speed (Moore et al. 2001; Muja and Lowe 2014).

Comparison Our volumetric exemplar baseline uses a scanning volume search and 2D depth encodings. It is useful to contrast this with a “standard” 2D scanning-window template on depth features (Janoch et al. 2013). First, our exemplars are defined in metric coordinates (Eq. 2). This means that they will not fire on the small hands of a toy figurine, unlike a scanning window search over scales. Second, our volumetric search ensures that the depth encoding from a local window contain features only within a fixed $N^3$ volume. This gives it the ability to segment out background clutter, unlike a 2D window (Fig. 6).

5 Protocols

5.1 Evaluation

Reprojection Error Following past work, we evaluate pose estimation as a regression task that predicts a set of 3D joint locations (Oikonomidis et al. 2011; Keskin et al. 2012; Qian et al. 2014; Taylor et al. 2014; Tang et al. 2014). Given a predicted and ground-truth pose, we compute both the average and max 3D reprojection error (in mm) across all joints. We use the skeletal joints defined by libhand (Šarić 2011). We then summarize performance by plotting the proportion of test frames whose average (or max) error falls below a threshold.

Error Thresholds Much past work considers performance at fairly low error thresholds, approaching 10 mm (Xu and Cheng 2013; Tang et al. 2014; Tompson et al. 2014). Interestingly, (Oberweger et al. 2015a) show that established benchmarks such as the ICL testset include annotation errors of above 10 mm in over a third of their frames. Ambiguities arise from manual labeling of joints versus bones and centroids versus surface points. We rigorously evaluate human-level performance through inter-annotator agreement on our new testset (Fig. 21). Overall, we find that max-errors of 20 mm approach the limit of human accuracy for closeby hands. We present a qualitative visualization of max error at different thresholds in Fig. 7. 50 mm appears consistent with a roughly correct pose, while an error within 100 mm appears consistent with a correct detection. Our qualitative analysis is consistent with empirical studies of human grasp (Bullock et al. 2013) and gesture communication (Stokoe 2005), which also suggest that a max-joint difference of 50 mm differentiates common gestures and grasps. But in general, precision requirements depend greatly on the application; So we plot each method’s performance across a broad range of thresholds (Fig. 8). We highlight 50 and 100 mm thresholds for additional analysis.

Vocabulary Size Versus Threshold To better interpret max-error-thresholds, we ask “for a discrete vocabulary of N poses, what max-joint-error precision will suffice?”. Intuitively, larger pose vocabularies require greater precision. To formalize this notion, we assume the user always perfectly articulates one of N poses from a discrete vocabulary $\varTheta $, with $|\varTheta | = N$. Given a fixed vocabulary $\varTheta $, a recognition system needs to be precise within $\texttt {prec}$ mm to avoid confusing any two poses from $\varTheta $:

$$\begin{aligned} \mathtt {prec} < \min _{\theta _1\in \varTheta ,\theta _2\in \varTheta } \frac{\text {dist}(P(\theta _1)-P(\theta _2))}{2} \end{aligned}$$

(6)

where $\theta _1$ and $\theta _2$ represent two poses in $\varTheta $, $P(\theta )$ projects the pose $\theta $’s joints into metric space, and $\text {dist}$ gives the maximum metric distance between the corresponding joints from each pose. To find the minimum precision required for each N, we construct a maximally distinguishable vocabulary $\varTheta $ by maximizing the value of $\mathtt {prec}$, subject to the kinematic constraints of libhand. Finding this most distinguishable pose vocabulary is an NP-hard problem. So, we take a greedy approach to optimize a vocabulary $\varTheta $ for each vocabulary size N.

Detection Issues Reprojection error is hard to define during detection failures: that is, false positive hand detections or missed hand detections. Such failures are likely in cluttered scenes or when considering scenes containing zero or two hands. If a method produced zero detections when a hand was present, or produced one if no hand was present, this was treated as a “maxed-out” reprojection error (of $\infty \ \text {mm}$). If two hands were present, we scored each method against both and took the minimum error. Though we have released our evaluation software, we give pseudocode in Algorithm. 1.

Missing Data Another challenge with reprojection error is missing data. First, some methods predict 2D screen coordinates for joints, not 3D metric coordinates (Premaratne et al. 2010; Intel 2013; Farabet et al. 2013; Tompson et al. 2014). Approximating $z \approx D(u,v)$, inferring 3d joint positions should be straightforward with Eq. 2. But, small 2D position errors can cause significant errors in the approximated depth, especially around the hand silhouette. To mitigate, we instead use the centroid depth of a segmented/detected hand when the measured depth lies outside the segmented volume. Past comparisons appear not to do this (Oberweger et al. 2015a), somewhat unfairly penalizing 2D approaches (Tompson et al. 2014). Second, some methods may predict a subset of joints (Intel 2013; Premaratne et al. 2010). To ensure a consistent comparison, we force such methods to predict the locations of visible joints with a post-processing inverse-kinematics (IK) stage (Tompson et al. 2014). We fit the libhand kinematic model to the predicted joints, and infer the location of missing ones. Third, ground-truth joints may be occluded. By convention, we only evaluate visible joints in our benchmark analysis.

Implementations We use public code when available (Oikonomidis et al. 2011; PrimeSense 2013; Intel 2013). Some authors responded to our request for their code (Rogez et al. 2014). When software was not available, we attempted to re-implement methods ourselves. We were able to successfully reimplement (Keskin et al. 2012; Xu and Cheng 2013; Oberweger et al. 2015a), matching the accuracy on published results (Tang et al. 2014; Oberweger et al. 2015a). In other cases, our in-house implementations did not suffice (Tompson et al. 2014; Tang et al. 2014). For these latter cases, we include published performance reports, but unfortunately, they are limited to their own datasets. This partly motivated us to perform a multi-dataset analysis. In particular, previous benchmarks have shown that one can still compare algorithms across datasets using head-to-head matchups (similar to approaches that rank sports teams which do not directly compete (Pang and Ling 2013)). We use our NN baseline to do precisely this. Finally, to spur further progress, we have made our implementations publicly available, together with our evaluation code.

5.2 Annotation

We now describe how we collect ground truth annotations. We present the annotator with cropped RGB and depth images. They then click semantic key-points, corresponding to specific joints, on either the RGB or depth images. To ease the annotator’s task and to get 3D keypoints from 2D clicks we invert the forward rendering (graphics) hand model provided by libhand which projects model parameters $\theta $ to 2D keypoints $P(\theta )$. While they label joints, an inverse kinematic solver minimizes the distance between the currently annotated 2D joint labels, $\forall _{j \in J}L_j$, and those projected from the libhand model parameters, $\forall _{j \in J}P_j(\theta )$.

$$\begin{aligned} \min _\theta \sum _{j \in J} ||L_j - P_j(\theta ) ||_2 \end{aligned}$$

(7)

The currently fitted libhand model, shown to the annotator, updates online as more joints are labeled. When the annotator indicates satisfaction with the fitted model, we proceed to the next frame. We give an example of the annotation process in Fig. 9.

Strengths Our annotation process has several strengths. First, kinematic constraints prevent some possible combination of keypoints: so it is often possible to fit the model by labeling only a subset of keypoints. Second, the fitted model provides annotations for occluded keypoints. Third and most importantly, the fitted model provides 3D (x,y,z) keypoint locations given only 2D (u,v) annotations.

Disagreements As shown in in Fig. 21, annotators disagree substantially on the hand pose, in a surprising number of cases. In applications, such as sign language (Stokoe 2005) ambiguous poses are typically avoided. We believe it is important to acknowledge that, in general, it may not be possible to achieve full precision. For our proposed test set (with an average hand distance of 1100 mm), we encountered an average annotation disagreement of about 20 mm. For only nearby hands ($\le 750$ mm from the camera, with an average distance of 550 mm) we encountered an average annotation disagreement of about 10 mm. The ICL dataset (Tang et al. 2014) exhibits similar annotation inconsistencies at similar ranges (Oberweger et al. 2015a). For hands at an average distance 235 mm from the camera, (Oberweger et al. 2016) reduced annotation disagreements to approximately 4 mm. This suggests that distance (which is inversely proportional to resolution) directly relates to annotation accuracy. Figure 10 illustrates two examples of annotator disagreement on our test set.

6 Results

We now report our experimental results, comparing datasets and methods. We first address the “state of the problem”: what aspects of the problem have been solved, and what remain open research questions? Fig. 11 qualitatively characterizes our results. We conclude by discussing the specific lessons we learned and suggesting directions for future systems.

Mostly-Solved (Distinct Poses) Figure 12 shows that coarse hand pose estimation is viable on datasets of uncluttered scenes where hands face the camera (i.e. ICL). Deep models, decision forests, and NN all perform quite well, both in terms of articulated pose estimation (85% of frames are within 50 mm max-error) and hand detection (100% are within 100 mm max-error). Surprisingly, NN outperforms decision forests by a bit. However, when NN is trained on other datasets with larger pose variation, performance is considerably worse. This suggests that the test poses remarkably resemble the training poses. Novel poses (those not seen in training data) account for most of the remaining failures. More training data (perhaps user-specific) or better model generalization should correct these. Yet, this may be reasonable for applications targeting sufficiently distinct poses from a small and finite vocabulary (e.g. a gaming interface). These results suggest that the state-of-the-art can accurately predict distinct poses (i.e. 50 mm apart) in uncluttered scenes.

Major Progress (Unconstrained Poses) The NYU testset still considers isolated hands, but includes a wider range of poses, viewpoints, and subjects compared to ICL (see Fig. 2). Figure 20 reveals that deep models perform the best for both articulated pose estimation (96% accuracy) and hand detection (100% accuracy). While decision forests struggle with the added variation in pose and viewpoint, NN still does quite well. In fact, when measured with average (rather than max) error, NN nearly matches the performance of (Tompson et al. 2014). This suggests that exemplars get most, but not all fingers, correct [see Fig. 13 and cf. Fig. 11(c, ii) vs. (d, ii)]. Overall, we see noticeable progress on unconstrained pose estimation since 2007 (Erol et al. 2007).

Unsolved (Low-Res, Objects, Occlusions, Clutter) When considering our testset (Fig. 21) with distant (low-res) hands and background clutter consisting of objects or interacting surfaces (Fig. 14), results are significantly worse. Note that many applications (Shotton et al. 2013) often demand hands to lie at distances greater than 750 mm. For such scenes, hand detection is still a challenge. Scanning window approaches (such as our NN baseline) tend to outperform multistage pipelines (Keskin et al. 2012; Farabet et al. 2013), which may make an unrecoverable error in the first (detection and segmentation) stage. We show some illustrative examples in Fig. 15. Yet, overall performance is still lacking, particularly when compared to human performance. Notably, human (annotator) accuracy also degrades for low-resolution hands far away from the camera (Fig. 21). This annotation uncertainty (“Human” in Fig. 21) makes it difficult to compare methods for highly precise pose estimation. As hand pose estimation systems become more precise, future work must make test data annotation more precise (Oberweger et al. 2016). Our results suggest that scenes of in-the-wild hand activity are still beyond the reach of the state-of-the-art.

Table 5 Cross-dataset generalization.

Full size table

Unsolved (Egocentric) The egocentric setting commonly presents (Fig. 16) the same problems discussed before, with the exception of low-res. While egocentric images do not necessarily contain clutter, most data in this area targets applications with significant clutter (see Fig. 17). And, in some sense, egocentric views make hand detection fundamentally harder. We cannot merely assume that the nearest pixel in the depth image corresponds to the hand, as we can with many 3rd person gesture test sets. In fact, the forearm often provides the primary salient feature. In Fig. 11(c–d, iii) both the deep and the 1-NN models need the arm to estimate the hand position. But, 1-NN wrongly predicts that the palm faces downwards, not towards the coffee maker. With such heavy occlusion and clutter, these errors are not surprising. The deep model’s detector (Tompson et al. 2014; Oberweger et al. 2015a) proved less robust in the egocentric setting. Perhaps it developed sensitivity to changes in noise patterns, between the synthetic training and real test datasets. But, the NN and deep detectors wrongly assume translation-invariance for egocentric hands. Hand appearance and position are linked by perspective effects coupled with the kinematic constraints imposed by the arm. As a result, an egocentric-specific whole volume classification model (Rogez et al. 2015a) outperformed both.

Training Data We use our NN-baseline to analyze the effect of training data in Table 5. Our NN model performed better using the NYU training set (Tompson et al. 2014) (consisting of real data automatically labeled with a geometrically-fit 3D CAD model) than with the libhand training set. While enlarging the synthetic training set increases performance (Fig. 18), computation fast becomes intractable. This reflects the difficulty in using synthetic data: one must carefully model priors (Oberweger et al. 2015a), sensor noise, (Gupta et al. 2014) and hand shape variations between users (Taylor et al. 2014; Khamis et al. 2015). In Fig. 19 we explore the impact of each of these factors to uncover two salient conclusions: First, training with the test-time user’s hand geometry (user-specific training data) showed modestly better performance, suggesting that results may be optimistic when using the same subjects for training and testing. Second, for synthetic hand data, modeling the pose-prior (i.e. choosing likely poses to synthesize) overshadows other considerations. Finally, in some cases, the variation in the performance of NN (dependent on the particular training set) exceeded the variation between model architectures (decision forests versus deep models)—Fig. 12. Our results suggest the diversity and realism of the training set is as important as the model learned from it.

Surprising NN Performance Overall, our 1-NN baseline proved to be surprisingly potent, outperforming or matching the performance of most prior systems. This holds true even for moderately-sized training sets with tens of thousands of examples (Tompson et al. 2014; Tang et al. 2014), suggesting that simple memorization outperforms much prior work. To demonstrate generalization, future work on learning based methods will likely benefit from more and better training data. One contribution of our analysis is the notion that NN-exemplars provides a vital baseline for understanding the behavior of a proposed system in relation to its training set.

NN Versus Deep Models In fact, DeepJoint (Tompson et al. 2014) and DeepPrior (Oberweger et al. 2015a) were the sole approaches to significantly outperform 1-NN (Figs. 12, 20). This indicates that deep architectures generalize well to novel test poses. Yet, the deep-model (Oberweger et al. 2015a) did show greater sensitivity to objects and clutter than the 1-NN model. We see this qualitatively in Fig. 11(c–d, iii–iv) and quantitatively in Figs. 21 and 16. But, we can understand the deep-model’s failures: we did not train it with clutter, so it “generalizes” that the bottle and hand are a single large hand. This may contrast with existing folk wisdom about deep models: that the need for large training sets suggests that these models essentially memorize. Our results indicate otherwise. Finally, the deep-model performed worse on more distant hands; this is understandable because it requires a larger canonical template (128 $\times $ 128) than the 1-NN model (30 $\times $ 30).

Conclusion The past several years have shown tremendous progress regarding hand pose: training sets, testing sets, and models. Some applications, such as gaming interfaces and sign-language recognition, appear to be well-within reach for current systems. Less than a decade ago, this was not true (Erol et al. 2007; Premaratne et al. 2010; Cooper 2012). Thus, we have made progress! But, challenges remain nonetheless. Specifically, when segmentation is hard due to active hands or clutter, many existing methods fail. To illustrate these realistic challenges we introduce a novel testset. We demonstrate that realism and diversity in training sets is crucial, and can be as important as the choice of model architecture. Thus, future work should investigate building large, realistic, and diverse training sets. In terms of model architecture, we perform a broad benchmark evaluation and find that deep models appear particularly well-suited for pose estimation. Finally, we demonstrate that NN using volumetric exemplars provides a startlingly potent baseline, providing an additional tool for analyzing both methods and datasets.

Notes

http://www.ics.uci.edu/~jsupanci/#HandData.

References

Ballan, L., Taneja, A., Gall, J., Gool, L. J. V., & Pollefeys, M. (2012). Motion capture of hands in action using discriminative salient points. In ECCV (6).
Bray, M., Koller-Meier, E., Müller, P., Van Gool, L., & Schraudolph, N. N. (2004). 3D hand tracking by rapid stochastic gradient descent using a skinning model. In 1st European conference on visual media production (CVMP).
Bullock, I. M., Member, S., Zheng, J. Z., Rosa, S. D. L., Guertler, C., & Dollar, A. M. (2013). IEEE transactions on grasp frequency and usage in daily household and machine shop tasks, Haptics.
Camplani, M., & Salgado, L. (2012). Efficient spatio-temporal hole filling strategy for kinect depth maps. In Proceedings of SPIE.
Castellini, C., Tommasi, T., Noceti, N., Odone, F., & Caputo, B. (2011). Using object affordances to improve object recognition. In IEEE transactions on autonomous mental development.
Choi, C., Sinha, A., Hee Choi, J., Jang, S., & Ramani, K. (2015). A collaborative filtering approach to real-time hand pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 2336–2344).
Cooper, H. (2012). Sign language recognition using sub-units. The Journal of Machine Learning Research, 13, 2205.
MATH Google Scholar
Delamarre, Q., & Faugeras, O. (2001). 3D articulated models and multiview tracking with physical forces. Computer Vision and Image Understanding., 81, 328.
Article MATH Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Computer vision and pattern recognition (CVPR). IEEE.
Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. In IEEE transactions on pattern analysis and machine intelligence.
Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., & Twombly, X. (2007). Vision-based hand pose estimation: A review. Computer Vision and Image Understanding., 108, 52.
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88, 303.
Article Google Scholar
Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2013). Learning hierarchical features for scene labeling. In IEEE transactions on pattern analysis and machine intelligence.
Fathi, A., Ren, X., & Rehg, J. M. (2011). Learning to recognize objects in egocentric activities. In 2011 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3281–3288). IEEE.
Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106, 59.
Article Google Scholar
Feix, T., Romero, J., Ek, C. H., Schmiedmayer, H., & Kragic, D. (2013). A metric for comparing the anthropomorphic motion capability of artificial hands. In IEEE transactions on robotics.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. In IEEE transactions on pattern analysis and machine intelligence.
Girard, M., & Maciejewski, A. A. (1985). Computational modeling for the computer animation of legged figures. ACM SIGGRAPH Computer Graphics, 19, 263.
Article Google Scholar
Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (ECCV). Springer.
Intel. (2013). Perceptual computing SDK.
Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., et al. (2013). A category-level 3d object dataset: Putting the kinect to work. In Consumer depth cameras for computer vision. Springer, London
Keskin, C., Kıraç, F., Kara, Y. E., & Akarun, L. (2012). Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In European conference on computer vision (ECCV).
Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., & Fitzgibbon, A. (2015). Learning an efficient model of hand shape variation from depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2540–2548).
Li, C., & Kitani, K. M. (2013). Pixel-level hand detection in ego-centric videos. In Computer vision and pattern recognition (CVPR).
Li, P., Ling, H., Li, X., & Liao, C. (2015). 3d hand pose estimation using randomized decision forest with segmentation index points. In Proceedings of the IEEE international conference on computer vision (pp. 819–827).
Martin, D. R., Fowlkes, C. C., & Malik, J. (2004). Learning to detect natural image boundaries using local brightness, color, and texture cues. in IEEE transactions on pattern analysis and machine intelligence.
Melax, S., Keselman, L., & Orsten, S. (2013). Dynamics based 3D skeletal hand tracking. In Proceedings of the ACM SIGGRAPH symposium on interactive 3D graphics and games-I3D ’13.
Mo, Z., & Neumann, U. (2006). Real-time hand pose recognition using low-resolution depth images. In 2006 IEEE computer society conference on computer vision and pattern recognition (vol. 2, pp. 1499–1505). IEEE.
Moore, A. W., Connolly, A. J., Genovese, C., Gray, A., Grone, L., & Kanidoris, N, I. I., et al. (2001). Fast algorithms and efficient statistics: N-point correlation functions. In Mining the Sky. Springer.
Muja, M., & Lowe, D. G. (2014). Scalable nearest neighbor algorithms for high dimensional data. In IEEE transactions on pattern analysis and machine intelligence.
Oberweger, M., Riegler, G., Wohlhart, P., & Lepetit, V. (2016). Efficiently creating 3d training data for fine hand pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4957–4965).
Oberweger, M., Wohlhart, P., & Lepetit, V. (2015a). Hands deep in deep learning for hand pose estimation. In Computer vision winter workshop (CVWW).
Oberweger, M., Wohlhart, P., & Lepetit, V. (2015b). Training a feedback loop for hand pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 3316–3324).
Ohn-Bar, E., & Trivedi, M. M. (2014a). Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. In IEEE transactions on intelligent transportation systems.
Ohn-Bar, E., & Trivedi, M. M. (2014b). Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Transactions on Intelligent Transportation Systems, 15(6), 2368–2377.
Article Google Scholar
Oikonomidis, I., Kyriazis, N., & Argyros, A. (2011). Efficient model-based 3D tracking of hand articulations using kinect. In British machine vision conference (BMVC).
Pang, Y., & Ling, H. (2013). Finding the best from the second bests-inhibiting subjective bias in evaluation of visual tracking algorithms. In International conference on computer vision (ICCV).
Pieropan, A., Salvi, G., Pauwels, K., & Kjellstrom, H. (2014). Audio-visual classification and detection of human manipulation actions. In International conference on intelligent robots and systems (IROS).
Premaratne, P., Nguyen, Q., & Premaratne, M. (2010). Human computer interaction using hand gestures. Berlin: Springer.
Book MATH Google Scholar
PrimeSense. (2013). Nite2 middleware, Version 2.2.
Qian, C., Sun, X., Wei, Y., Tang, X., & Sun, J. (2014). Realtime and robust hand tracking from depth. In Computer vision and pattern recognition (CVPR).
Ren, Z., Yuan, J., & Zhang, Z. (2011). Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera. In Proceedings of the 19th ACM international conference on Multimedia. ACM.
Rogez, G., Khademi, M., Supancic, III, J., Montiel, J. M. M., & Ramanan, D. (2014). 3D hand pose detection in egocentric RGB-D images. CDC4CV workshop, European conference on computer vision (ECCV).
Rogez, G., Supancic, III, J., & Ramanan, D. (2015a). First-person pose recognition using egocentric workspaces. In Computer vision and pattern recognition (CVPR).
Rogez, G., Supancic, J. S., & Ramanan, D. (2015b). Understanding everyday hands in action from RGB-D images. In Proceedings of the IEEE international conference on computer vision (pp. 3889–3897).
Romero, J., Kjellstr, H., & Kragic, D. (2009). Monocular real-time 3D articulated hand pose estimation. In International conference on humanoid robots.
Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: What have we done, and where are we going? In International conference on computer vision (ICCV). IEEE.
Šarić, M. (2011). Libhand: A library for hand articulation, Version 0.9.
Scharstein, D. (2002). A taxonomy and evaluation of dense two-frame stereo. International Journal of Computer Vision, 47, 7.
Article MATH Google Scholar
Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In International conference on computer vision (ICCV). IEEE.
Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., & Izadi, S. (2015). Accurate, robust, and flexible real-time hand tracking. In ACM conference on computer–human interaction.
Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al. (2013). Real-time human pose recognition in parts from single depth images. Communications of the ACM., 56, 116.
Article Google Scholar
Song, S., & Xiao, J. (2014). Sliding shapes for 3D object detection in depth images. In European conference on computer vision (ECCV).
Sridhar, S., Mueller, F., Oulasvirta, A., & Theobalt, C. (2015). Fast and robust hand tracking using detection-guided optimization. In Computer vision and pattern recognition (CVPR).
Sridhar, S., Oulasvirta, A., & Theobalt, C. (2013). Interactive markerless articulated hand motion tracking using RGB and depth data. In International conference on computer vision (ICCV).
Stenger, B., Thayananthan, A., Torr, P. H. S., & Cipolla, R. (2006). Model-based hand tracking using a hierarchical Bayesian filter. In IEEE transactions on pattern analysis and machine intelligence.
Stokoe, W. C. (2005). Sign language structure: An outline of the visual communication systems of the American deaf. Journal of Deaf Studies and Deaf Education, 10, 3.
Article Google Scholar
Sun, X., Wei, Y., Liang, S., Tang, X., & Sun, J. (2015). Cascaded hand pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 824–832).
Tang, D., Chang, H. J., Tejani, A., & Kim, T.-K. (2014). Latent regression forest: Structured estimation of 3D articulated hand posture. In Computer vision and pattern recognition (CVPR).
Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.-K., & Shotton, J. (2015). Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In Proceedings of the IEEE international conference on computer vision (pp. 3325–3333).
Tang, D., Yu, T.H. & Kim, T.-K. (2013). Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In International conference on computer vision (ICCV).
Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., & Izadi, S., et al. (2014). User-specific hand modeling from monocular depth sequences. In Computer vision and pattern recognition (CVPR). IEEE.
Taylor, J., Bordeaux, L., Cashman, T., Corish, B., Keskin, C., Sharp, T., et al. (2016). Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics (TOG), 35(4), 143.
Article Google Scholar
Tompson, J., Stein, M., Lecun, Y., & Perlin, K. (2014). Real-time continuous pose recovery of human hands using convolutional networks. In ACM Transactions on Graphics.
Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In Computer vision and pattern recognition (CVPR). IEEE.
Tzionas, D., Srikantha, A., Aponte, P., & Gall, J. (2014). Capturing hand motion with an RGB-D sensor, fusing a generative model with salient points. In German Conference on Pattern Recognition (GCPR). Lecture notes in computer science. Springer.
Vezhnevets, V., Sazonov, V., & Andreeva, A. (2003). A survey on pixel-based skin color detection techniques. In Proceedings of the Graphicon, Moscow, Russia.
Wan, C., Yao, A., & Van Gool, L. (2016). Hand pose estimation from local surface normals. In European conference on computer vision (pp. 554–569). Springer.
Wetzler, A., Slossberg, R., & Kimmel, R. (2015). Rule of thumb: Deep derotation for improved fingertip detection. In British machine vision conference (BMVC). BMVA Press.
Xu, C., & Cheng, L. (2013). Efficient hand pose estimation from a single depth image. InInternational conference on computer vision (ICCV).
Yang, Y., & Ramanan, D. (2013). Articulated pose estimation with flexible mixtures-of-parts. In IEEE transactions on pattern analysis and machine intelligence.
Ye, Q., Yuan, S., & Kim, T.-K. (2016). Spatial attention deep net with partial PSO for hierarchical hybrid hand pose estimation. In European conference on computer vision (pp. 346–361). Springer.
Zhu, X., Vondrick, C., Ramanan, D., & Fowlkes, C. (2012). Do we need more training data or better models for object detection? British Machine Vision Conference (BMVC), 3, 5.
Google Scholar

Download references

Acknowledgements

National Science Foundation Grant 0954083, Office of Naval Research-MURI Grant N00014-10-1-0933, and the Intel Science and Technology Center-Visual Computing supported JS&DR. The European Commission FP7 Marie Curie IOF grant “Egovision4Health” (PIOF-GA-2012-328288) supported GR.

Author information

Authors and Affiliations

University of California, Irvine, USA
James Steven Supančič III
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP*, LJK, 38000, Grenoble, France
Grégory Rogez
Institute of Engineering Univ., Grenoble Alpes, France
Grégory Rogez
Baidu Institute of Deep Learning, Sunnyvale, USA
Yi Yang
Microsoft Research, Cambridge, UK
Jamie Shotton
Carnegie Mellon University, Pittsburgh, PA, USA
Deva Ramanan

Authors

James Steven Supančič III
View author publications
You can also search for this author in PubMed Google Scholar
Grégory Rogez
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jamie Shotton
View author publications
You can also search for this author in PubMed Google Scholar
Deva Ramanan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to James Steven Supančič III.

Additional information

Communicated by J. Rehg.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Supančič, J.S., Rogez, G., Yang, Y. et al. Depth-Based Hand Pose Estimation: Methods, Data, and Challenges. Int J Comput Vis 126, 1180–1198 (2018). https://doi.org/10.1007/s11263-018-1081-7

Download citation

Received: 03 December 2015
Accepted: 09 March 2018
Published: 12 April 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s11263-018-1081-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Depth-Based Hand Pose Estimation: Methods, Data, and Challenges

Abstract

Similar content being viewed by others

Real-Time Hand Pose Estimation Using Depth Camera

Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation Under Hand-Object Interaction

Key Developments in Human Pose Estimation for Kinect

1 Introduction

2 Testing Data