Leveraging Orientation Knowledge to Enhance Human Pose Estimation Methods

Azrour, S.; Piérard, S.; Van Droogenbroeck, M.

doi:10.1007/978-3-319-41778-3_8

S. Azrour¹⁵,
S. Piérard¹⁵ &
M. Van Droogenbroeck¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9756))

Included in the following conference series:

International Conference on Articulated Motion and Deformable Objects

929 Accesses
1 Citations
6 Altmetric

Abstract

Predicting accurately and in real-time 3D body joint positions from a depth image is the cornerstone for many safety, biomedical, and entertainment applications. Despite the high quality of the depth images, the accuracy of existing human pose estimation methods from single depth images remains insufficient for some applications. In order to enhance the accuracy, we suggest to leverage a rough orientation estimation to dynamically select a 3D joint position prediction model specialized for this orientation. This orientation estimation can be obtained in real-time either from the image itself, or from any other clue like tracking. We demonstrate the merits of this general principle on a pose estimation method similar to the one used with Kinect cameras. Our results show that the accuracy is improved by up to 45.1 %, with respect to a method using the same model for all orientations.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Estimating human body orientation from image depth data and its implementation

Article 19 March 2022

A Two-Step Methodology for Human Pose Estimation Increasing the Accuracy and Reducing the Amount of Learning Samples Dramatically

Efficient Human Pose Estimation from Single Depth Images

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Markerless pose estimation has attracted much interest since the release of low-cost depth cameras like the Microsoft Kinect. Shotton et al. and Girshick et al. made an important step by presenting methods that infer a full-body pose reconstruction in real-time. Their details were explained, chronologically, in [3, 8, 9]. Despite this technological breakthrough, the accuracy of human pose estimation from single depth images remains insufficient for some applications.

The straightforward strategy to improve the pose estimation is to substantially increase the size and the diversity of the learning set, but this is costly, impractical, and often impossible. Other ideas to improve the method of Shotton et al. have also been developed. Yeung et al. [11] presented a way to combine the predictions of two Kinect cameras in order to reduce the problems related to unwanted joints positions vibration and bone-length variation observed with the method described in [9]. Wei et al. [10] used a method equivalent to [8] in combination with a tracking algorithm and showed that it improved the robustness and the accuracy on the estimation of the joint positions. In this paper, we present a principle for improvement that can be used with any markerless pose estimation method based on machine learning techniques. Instead of taking advantage of additional cameras or filtering the predictions in a post-processing step, we start by estimating the orientation of the observed person.

Our contribution is to show how an estimation of the orientation of the observed person improves the accuracy of a pose estimation algorithm. Our idea consists in slicing the full orientation range into smaller ranges and learning a different model for each of these smaller ranges. When the models are used to recover the pose, given the estimation of the orientation of the observed person, we use the appropriate model to make the predictions for the joints positions. To take into account the uncertainty on the orientation estimation, we consider slightly overlapping orientation ranges when the models are learned. An illustration of our method is shown in Fig. 1.

2 Principle of Leveraging an Orientation Estimation

The intuition for having several models depending on smaller orientation ranges is the following. From our experience, when it comes to analyze silhouettes annotated with depth in each pixel (see Fig. 3), machine learning methods tend to grant a high importance to the information related to the external contour and not enough importance to the information related to the depth signal. The problem is that there are two different poses corresponding to the same silhouette shape [7] (when the small details of the silhouette corresponding to the perspective effects are neglected), and this ambiguity leads to large errors when an average solution is predicted. Note that with the arbitrary convention taken in this paper (see Fig. 2), one of the two possible poses is associated with an orientation of \(\theta \), while the other one is associated with an orientation of \(360^{\circ }-\theta \).

Therefore, except for the rare cases where the observed person has an orientation very close to \(0^{\circ }\) (seen from his right side) or \(180^{\circ }\) (seen from his left side), the knowledge of the orientation is sufficient to overcome the pose ambiguity, even if it is only roughly estimated. Our method is based on the idea that it is preferable to rely on an additional method that is specifically designed for orientation estimation instead of trying to recover the joint positions and disambiguate the silhouette orientation all at once. We observed that when a machine learning method does not have to simultaneously estimate the orientation and the pose, and can focus on the pose estimation given that a rough orientation estimation is provided to it, its task is eased and the accuracy of the predictions is improved.

Several clues can be used to estimate the orientation. When the observed person is walking, his orientation is given by his velocity vector, and can therefore be estimated by tracking. This tracking can be done directly from the depth camera, or from range laser scanners [6]. The orientation can also be estimated directly from a single depth image [5].

One way of forcing the pose estimation method to take the orientation into account is to consider several ranges of orientation and to learn a different model for each range. During the pose estimation step, given the orientation estimation, we use the appropriate model to predict the pose. Note that the overlap between consecutive ranges should be adapted to the maximum uncertainty of the selected orientation estimation method. In the case of the estimation from the depth image, Piérard et al. [5] showed that it is possible to achieve an average uncertainty of \(4.3^{\circ }\) (measured on synthetic, noise-free data), but no bound was given. In practice, the errors are larger, but the temporal variance can be filtered out, leading to reliable estimates as shown on the video on the author’s website. We take an overlap of \(20^{\circ }\) for this orientation estimation method.

3 Experiments

To assess the effectiveness of our principle, we implement a simplified (for practical reasons) version of the pose estimation method described in Girshick et al. [3]. The main differences are that we use a general regression random forest model (the ExtRaTrees [2]) instead of a custom one, that we use 500 features rather than 2, 000 to describe the pixels environments, and that the models are learned from another, smaller, dataset.

To generate the learning and test datasets, we followed a method similar to the one described in [9] except that we used the open source softwares Blender and MakeHuman. Moreover, we used only one human model and did not add clothes to it. Without loss of generality, our small dataset is sufficient to establish that our principle helps to improve the accuracy of the pose estimation. The poses used to generate the data were taken randomly from the CMU motion capture database [1]. A few unrealistic poses, that do not correspond to a standing person, have been manually excluded (less than \(1\,\%\)). A total of 24, 000 silhouettes annotated with depth have been generated from the same amount of poses for the learning set, and 10, 000 for the test set. In the generated depth images, the distance from the human model to the camera varies from 1.5 to 5.74 meters. Note that we used the specifications of the Kinect v2 of Microsoft to generate the depth images and we added a Gaussian noise with the characteristics given in [4]. Some examples of our input depth images are shown in Fig. 3 with the projection of the ground truth body joints positions in green.

We report the results obtained with 1, 4, and 12 models specialized according to the orientation. We analyze 8 body joints: neck, head, shoulder, elbow, wrist, hip, knee, ankle. We only consider the right joints given that the prediction accuracy will be symmetrical for the left ones.

3.1 Improvement with a Constant Global Learning Dataset Size

Our first experiment shows what happens when we increase the number of models, with smaller orientation ranges, while keeping a constant learning dataset size. Table 1 gives the mean Euclidean errors for 1, 4 and 12 models. We see a significant reduction of the error for all joints when going from 1 to 4 models. These results underline that using multiple models designed for narrow ranges of orientations is preferable than using a unique model. However, going from 4 to 12 models slightly worsens the performance.

Table 1. Mean errors on the positions of the considered body joints for different number of models used with a constant learning dataset size. There is an optimal number of models (4 in this experiment) for a constant learning dataset size (8000 samples).

Full size table

With a learning dataset, whose size cannot be increased, there is a trade-off between, on the one side, the improvement that is obtained from the knowledge of an approximative orientation estimation by the use of specialized pose estimation models, and on the other side, the deterioration due to the reduction of the learning set size. Nevertheless, the optimal solution takes advantage of a few models, and benefits from the knowledge of the orientation.

Note that the predictions for the head and the neck are less influenced by the number of models used. Indeed, the joints on the spine (that is the person’s rotation axis) are less affected than those in the limbs by a change of the orientation. Moreover, we observe the largest errors on the wrist, as it is the joint that has the higher freedom to move in space. The magnitude of the mean error is thus related to the variety of poses in the test set. The general trend is higher errors at limb extremities, and lower errors at joints close to the torso.

The curves of Fig. 4 depict the mean Euclidean errors (estimated with a Gaussian filter of \(\sigma =8^{\circ }\)) affecting the pose estimation at every joint with respect to the orientation of the observed person. The results obtained with a single \(360^{\circ }\)-model is shown in red, while the one with four \(110^{\circ }\)-models is shown in blue. As can be seen, the errors are anisotropic, and the best improvement obtained thanks to our principle is for people facing the camera, or seen from their back. Moreover, for all the joints of the right limbs, we observe larger errors when the person is seen from his left side, which is probably due to the fact that these joints have a higher chance of being occluded.

3.2 Improvement with a Constant Learning Dataset Size per Model

Figure 4 also shows the behavior when the same experiment is performed with all models derived from the same amount of learning samples. The dark gray curves correspond to a single \(360^{\circ }\)-model, the purple ones to four \(110^{\circ }\)-models, and the light gray ones to twelve \(50^{\circ }\)-models. Each of these models has been learned from 2, 000 samples. To the contrary of our first experiment, we observe a systematic decrease of the error when the number of models is increased. However, the small difference between 4 and 12 models suggests a plateau is reached after 4 models. Therefore, relying on too many models is useless. This suggests that a rough orientation estimation suffices to improve the performance of pose estimation.

4 Conclusion

This work presents the principle of using an estimation of the orientation of the observed person to improve the accuracy of a pose estimation algorithm. Instead of learning a unique model over the \(360^{\circ }\)-range of orientation, we learn several models designed for smaller ranges of orientations. We tested this principle for different amounts of models and showed that the accuracy is significantly improved when the number of models increases while keeping a constant learning dataset size.

References

Carnegie Mellon University: Motion capture database. http://mocap.cs.cmu.edu
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Article MATH Google Scholar
Girshick, R., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A.: Efficient regression of general-activity human poses from depth images. In: International Conference on Computer Vision (ICCV), Barcelona, Spain, pp. 415–422, November 2011
Google Scholar
Kerl, C., Souiai, M., Sturm, J., Cremers, D.: Towards illumination-invariant 3D reconstruction using ToF RGB-D cameras. In: International Conference on 3D Vision (3DV), Tokyo, Japan, vol. 1, pp. 39–46, December 2014
Google Scholar
Piérard, S., Leroy, D., Hansen, J.-F., Van Droogenbroeck, M.: Estimation of human orientation in images captured with a range camera. In: Blanc-Talon, J., Kleihorst, R., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2011. LNCS, vol. 6915, pp. 519–530. Springer, Heidelberg (2011)
Chapter Google Scholar
Piérard, S., Pierlot, V., Barnich, O., Van Droogenbroeck, M., Verly, J.: A platform for the fast interpretation of movements and localization of users in 3D applications driven by a range camera. In: 3DTV Conference, Tampere, Finland, June 2010
Google Scholar
Piérard, S., Van Droogenbroeck, M.: On the human pose recovery based on a single view. In: International Conference on Pattern Recognition Applications and Methods (ICPRAM), Vilamoura, Portugal, vol. 2, pp. 310–315, February 2012
Google Scholar
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Providence, Rhode Island, USA, pp. 1297–1304, June 2011
Google Scholar
Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T., Cook, M., Finocchio, M., Moore, R., Kohli, P., Criminisi, A., Kipman, A., Blake, A.: Efficient human pose estimation from single depth images. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2821–2840 (2013)
Article Google Scholar
Wei, X., Zhang, P., Chai, J.: Accurate realtime full-body motion capture using a single depth camera. ACM Trans. Graph 31(6), 188.1–188.12 (2012)
Article Google Scholar
Yeung, K.-Y., Kwok, T.-H., Wang, C.: Improved skeleton tracking by duplex Kinects: a practical approach for real-time applications. J. Comput. Inf. Sci. Eng. 13(4), 041007-1–041007-10 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

INTELSIG Laboratory, Department of Electrical Engineering and Computer Science, University of Liège, Liège, Belgium
S. Azrour, S. Piérard & M. Van Droogenbroeck

Authors

S. Azrour
View author publications
You can also search for this author in PubMed Google Scholar
S. Piérard
View author publications
You can also search for this author in PubMed Google Scholar
M. Van Droogenbroeck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Azrour .

Editor information

Editors and Affiliations

UIB-Universitat de les Illes Balears, Palma de Mallorca, Spain
Francisco José Perales
University of Surrey, Guildford, United Kingdom
Josef Kittler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Azrour, S., Piérard, S., Van Droogenbroeck, M. (2016). Leveraging Orientation Knowledge to Enhance Human Pose Estimation Methods. In: Perales, F., Kittler, J. (eds) Articulated Motion and Deformable Objects. AMDO 2016. Lecture Notes in Computer Science(), vol 9756. Springer, Cham. https://doi.org/10.1007/978-3-319-41778-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-41778-3_8
Published: 02 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41777-6
Online ISBN: 978-3-319-41778-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Leveraging Orientation Knowledge to Enhance Human Pose Estimation Methods

Abstract

Similar content being viewed by others

Estimating human body orientation from image depth data and its implementation

A Two-Step Methodology for Human Pose Estimation Increasing the Accuracy and Reducing the Amount of Learning Samples Dramatically

Efficient Human Pose Estimation from Single Depth Images

Keywords

1 Introduction

2 Principle of Leveraging an Orientation Estimation

3 Experiments

3.1 Improvement with a Constant Global Learning Dataset Size

3.2 Improvement with a Constant Learning Dataset Size per Model

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Leveraging Orientation Knowledge to Enhance Human Pose Estimation Methods

Abstract

Similar content being viewed by others

Estimating human body orientation from image depth data and its implementation

A Two-Step Methodology for Human Pose Estimation Increasing the Accuracy and Reducing the Amount of Learning Samples Dramatically

Efficient Human Pose Estimation from Single Depth Images

Keywords

1 Introduction

2 Principle of Leveraging an Orientation Estimation

3 Experiments

3.1 Improvement with a Constant Global Learning Dataset Size

3.2 Improvement with a Constant Learning Dataset Size per Model

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation