Abstract
Predicting accurately and in real-time 3D body joint positions from a depth image is the cornerstone for many safety, biomedical, and entertainment applications. Despite the high quality of the depth images, the accuracy of existing human pose estimation methods from single depth images remains insufficient for some applications. In order to enhance the accuracy, we suggest to leverage a rough orientation estimation to dynamically select a 3D joint position prediction model specialized for this orientation. This orientation estimation can be obtained in real-time either from the image itself, or from any other clue like tracking. We demonstrate the merits of this general principle on a pose estimation method similar to the one used with Kinect cameras. Our results show that the accuracy is improved by up to 45.1 %, with respect to a method using the same model for all orientations.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Markerless pose estimation has attracted much interest since the release of low-cost depth cameras like the Microsoft Kinect. Shotton et al. and Girshick et al. made an important step by presenting methods that infer a full-body pose reconstruction in real-time. Their details were explained, chronologically, in [3, 8, 9]. Despite this technological breakthrough, the accuracy of human pose estimation from single depth images remains insufficient for some applications.
The straightforward strategy to improve the pose estimation is to substantially increase the size and the diversity of the learning set, but this is costly, impractical, and often impossible. Other ideas to improve the method of Shotton et al. have also been developed. Yeung et al. [11] presented a way to combine the predictions of two Kinect cameras in order to reduce the problems related to unwanted joints positions vibration and bone-length variation observed with the method described in [9]. Wei et al. [10] used a method equivalent to [8] in combination with a tracking algorithm and showed that it improved the robustness and the accuracy on the estimation of the joint positions. In this paper, we present a principle for improvement that can be used with any markerless pose estimation method based on machine learning techniques. Instead of taking advantage of additional cameras or filtering the predictions in a post-processing step, we start by estimating the orientation of the observed person.
Our contribution is to show how an estimation of the orientation of the observed person improves the accuracy of a pose estimation algorithm. Our idea consists in slicing the full orientation range into smaller ranges and learning a different model for each of these smaller ranges. When the models are used to recover the pose, given the estimation of the orientation of the observed person, we use the appropriate model to make the predictions for the joints positions. To take into account the uncertainty on the orientation estimation, we consider slightly overlapping orientation ranges when the models are learned. An illustration of our method is shown in Fig. 1.
2 Principle of Leveraging an Orientation Estimation
The intuition for having several models depending on smaller orientation ranges is the following. From our experience, when it comes to analyze silhouettes annotated with depth in each pixel (see Fig. 3), machine learning methods tend to grant a high importance to the information related to the external contour and not enough importance to the information related to the depth signal. The problem is that there are two different poses corresponding to the same silhouette shape [7] (when the small details of the silhouette corresponding to the perspective effects are neglected), and this ambiguity leads to large errors when an average solution is predicted. Note that with the arbitrary convention taken in this paper (see Fig. 2), one of the two possible poses is associated with an orientation of \(\theta \), while the other one is associated with an orientation of \(360^{\circ }-\theta \).
Therefore, except for the rare cases where the observed person has an orientation very close to \(0^{\circ }\) (seen from his right side) or \(180^{\circ }\) (seen from his left side), the knowledge of the orientation is sufficient to overcome the pose ambiguity, even if it is only roughly estimated. Our method is based on the idea that it is preferable to rely on an additional method that is specifically designed for orientation estimation instead of trying to recover the joint positions and disambiguate the silhouette orientation all at once. We observed that when a machine learning method does not have to simultaneously estimate the orientation and the pose, and can focus on the pose estimation given that a rough orientation estimation is provided to it, its task is eased and the accuracy of the predictions is improved.
Several clues can be used to estimate the orientation. When the observed person is walking, his orientation is given by his velocity vector, and can therefore be estimated by tracking. This tracking can be done directly from the depth camera, or from range laser scanners [6]. The orientation can also be estimated directly from a single depth image [5].
One way of forcing the pose estimation method to take the orientation into account is to consider several ranges of orientation and to learn a different model for each range. During the pose estimation step, given the orientation estimation, we use the appropriate model to predict the pose. Note that the overlap between consecutive ranges should be adapted to the maximum uncertainty of the selected orientation estimation method. In the case of the estimation from the depth image, Piérard et al. [5] showed that it is possible to achieve an average uncertainty of \(4.3^{\circ }\) (measured on synthetic, noise-free data), but no bound was given. In practice, the errors are larger, but the temporal variance can be filtered out, leading to reliable estimates as shown on the video on the author’s website. We take an overlap of \(20^{\circ }\) for this orientation estimation method.
3 Experiments
To assess the effectiveness of our principle, we implement a simplified (for practical reasons) version of the pose estimation method described in Girshick et al. [3]. The main differences are that we use a general regression random forest model (the ExtRaTrees [2]) instead of a custom one, that we use 500 features rather than 2, 000 to describe the pixels environments, and that the models are learned from another, smaller, dataset.
To generate the learning and test datasets, we followed a method similar to the one described in [9] except that we used the open source softwares Blender and MakeHuman. Moreover, we used only one human model and did not add clothes to it. Without loss of generality, our small dataset is sufficient to establish that our principle helps to improve the accuracy of the pose estimation. The poses used to generate the data were taken randomly from the CMU motion capture database [1]. A few unrealistic poses, that do not correspond to a standing person, have been manually excluded (less than \(1\,\%\)). A total of 24, 000 silhouettes annotated with depth have been generated from the same amount of poses for the learning set, and 10, 000 for the test set. In the generated depth images, the distance from the human model to the camera varies from 1.5 to 5.74 meters. Note that we used the specifications of the Kinect v2 of Microsoft to generate the depth images and we added a Gaussian noise with the characteristics given in [4]. Some examples of our input depth images are shown in Fig. 3 with the projection of the ground truth body joints positions in green.
We report the results obtained with 1, 4, and 12 models specialized according to the orientation. We analyze 8 body joints: neck, head, shoulder, elbow, wrist, hip, knee, ankle. We only consider the right joints given that the prediction accuracy will be symmetrical for the left ones.
3.1 Improvement with a Constant Global Learning Dataset Size
Our first experiment shows what happens when we increase the number of models, with smaller orientation ranges, while keeping a constant learning dataset size. Table 1 gives the mean Euclidean errors for 1, 4 and 12 models. We see a significant reduction of the error for all joints when going from 1 to 4 models. These results underline that using multiple models designed for narrow ranges of orientations is preferable than using a unique model. However, going from 4 to 12 models slightly worsens the performance.
With a learning dataset, whose size cannot be increased, there is a trade-off between, on the one side, the improvement that is obtained from the knowledge of an approximative orientation estimation by the use of specialized pose estimation models, and on the other side, the deterioration due to the reduction of the learning set size. Nevertheless, the optimal solution takes advantage of a few models, and benefits from the knowledge of the orientation.
Note that the predictions for the head and the neck are less influenced by the number of models used. Indeed, the joints on the spine (that is the person’s rotation axis) are less affected than those in the limbs by a change of the orientation. Moreover, we observe the largest errors on the wrist, as it is the joint that has the higher freedom to move in space. The magnitude of the mean error is thus related to the variety of poses in the test set. The general trend is higher errors at limb extremities, and lower errors at joints close to the torso.
The curves of Fig. 4 depict the mean Euclidean errors (estimated with a Gaussian filter of \(\sigma =8^{\circ }\)) affecting the pose estimation at every joint with respect to the orientation of the observed person. The results obtained with a single \(360^{\circ }\)-model is shown in red, while the one with four \(110^{\circ }\)-models is shown in blue. As can be seen, the errors are anisotropic, and the best improvement obtained thanks to our principle is for people facing the camera, or seen from their back. Moreover, for all the joints of the right limbs, we observe larger errors when the person is seen from his left side, which is probably due to the fact that these joints have a higher chance of being occluded.
3.2 Improvement with a Constant Learning Dataset Size per Model
Figure 4 also shows the behavior when the same experiment is performed with all models derived from the same amount of learning samples. The dark gray curves correspond to a single \(360^{\circ }\)-model, the purple ones to four \(110^{\circ }\)-models, and the light gray ones to twelve \(50^{\circ }\)-models. Each of these models has been learned from 2, 000 samples. To the contrary of our first experiment, we observe a systematic decrease of the error when the number of models is increased. However, the small difference between 4 and 12 models suggests a plateau is reached after 4 models. Therefore, relying on too many models is useless. This suggests that a rough orientation estimation suffices to improve the performance of pose estimation.
4 Conclusion
This work presents the principle of using an estimation of the orientation of the observed person to improve the accuracy of a pose estimation algorithm. Instead of learning a unique model over the \(360^{\circ }\)-range of orientation, we learn several models designed for smaller ranges of orientations. We tested this principle for different amounts of models and showed that the accuracy is significantly improved when the number of models increases while keeping a constant learning dataset size.
References
Carnegie Mellon University: Motion capture database. http://mocap.cs.cmu.edu
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Girshick, R., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A.: Efficient regression of general-activity human poses from depth images. In: International Conference on Computer Vision (ICCV), Barcelona, Spain, pp. 415–422, November 2011
Kerl, C., Souiai, M., Sturm, J., Cremers, D.: Towards illumination-invariant 3D reconstruction using ToF RGB-D cameras. In: International Conference on 3D Vision (3DV), Tokyo, Japan, vol. 1, pp. 39–46, December 2014
Piérard, S., Leroy, D., Hansen, J.-F., Van Droogenbroeck, M.: Estimation of human orientation in images captured with a range camera. In: Blanc-Talon, J., Kleihorst, R., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2011. LNCS, vol. 6915, pp. 519–530. Springer, Heidelberg (2011)
Piérard, S., Pierlot, V., Barnich, O., Van Droogenbroeck, M., Verly, J.: A platform for the fast interpretation of movements and localization of users in 3D applications driven by a range camera. In: 3DTV Conference, Tampere, Finland, June 2010
Piérard, S., Van Droogenbroeck, M.: On the human pose recovery based on a single view. In: International Conference on Pattern Recognition Applications and Methods (ICPRAM), Vilamoura, Portugal, vol. 2, pp. 310–315, February 2012
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Providence, Rhode Island, USA, pp. 1297–1304, June 2011
Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T., Cook, M., Finocchio, M., Moore, R., Kohli, P., Criminisi, A., Kipman, A., Blake, A.: Efficient human pose estimation from single depth images. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2821–2840 (2013)
Wei, X., Zhang, P., Chai, J.: Accurate realtime full-body motion capture using a single depth camera. ACM Trans. Graph 31(6), 188.1–188.12 (2012)
Yeung, K.-Y., Kwok, T.-H., Wang, C.: Improved skeleton tracking by duplex Kinects: a practical approach for real-time applications. J. Comput. Inf. Sci. Eng. 13(4), 041007-1–041007-10 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Azrour, S., Piérard, S., Van Droogenbroeck, M. (2016). Leveraging Orientation Knowledge to Enhance Human Pose Estimation Methods. In: Perales, F., Kittler, J. (eds) Articulated Motion and Deformable Objects. AMDO 2016. Lecture Notes in Computer Science(), vol 9756. Springer, Cham. https://doi.org/10.1007/978-3-319-41778-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-41778-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41777-6
Online ISBN: 978-3-319-41778-3
eBook Packages: Computer ScienceComputer Science (R0)