Abstract
Head pose estimation systems have quickly evolved from simple classifiers estimating a few yaw angles, to the most recent regression approaches that provide precise 3D face orientations in images acquired “in-the-wild”. Accurate evaluation of these algorithms is an open issue. Although the most recent approaches are tested using a few challenging annotated databases, their published results are not comparable. In this paper we review these works, define a common evaluation methodology, and establish a new state-of-the-art for this problem.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
We define head pose as the yaw, pitch and roll angles that determine the orientation of the head in the camera reference system [13]. It has attracted much research due to its relevance as a pre-processing step of many face analysis tasks such as alignment of facial landmarks [2, 20] or facial expressions recognition [4]. It is also used in video-surveillance [11] and intrinsically linked with human-computer interaction in social communication [12], gaze [18] and focus of attention [1] estimation.
There are many approaches for image-based head pose estimation. Some of them use very low resolution images [11] or 3D range data [5]. In this paper we only consider methods that use 2D images of average or high resolution. Among these, manifold embedding and non-linear regression techniques are possibly the most popular ones. The former assume that separated continuous head pose sub-spaces exist according to appearance [14]. Non-linear regression methods learn a mapping from image features to pose angles. Random Forests [5, 19] and Convolutional Neural Networks (CNNs) [6, 10, 15] are some of the most prevailing.
At present, the best performing approaches are based on CNNs. Yang et al. [20] use a small CNN for regression of yaw, pitch and roll angles with 3 convolutional layers, 3 pooling layers and 2 fully connected layers. Ranjan et al. [15] fuse intermediate feature layers at different resolutions, and use a multi-task approach to detect faces, estimate facial landmarks, head pose and gender. The H-CNN architecture [10] uses an inception module [17] that pools and concatenate features from intermediate layers and is jointly trained on the visibility, facial landmarks and head pose estimation parameters. In Table 1 we show the performance of these approaches. Although they use the same databases, their results cannot be immediately compared. This will be further discussed in Sect. 2.
In this paper we review the problem of estimating head pose by regressing the yaw, pitch and roll head angles from medium/high resolution images acquired “in-the-wild”, i.e. in realistic unrestricted conditions. Our contributions are:
-
A brief survey of the best head pose estimation algorithms.
-
Definition of an evaluation methodology and publicly available benchmark to precisely compare the performance of head pose estimation algorithms.
-
The establishment of the state-of-the-art on this benchmark.
2 Benchmarking Head Pose
There are many public databases with face labeled data. However very few of them provide ground truth head pose, because of the difficulty in accurately estimating these angles. Traditionally, pose estimation algorithms have been evaluated with databases acquired in laboratory conditions and with imprecise angular information [13]. Later, more realistic and accurate data-sets such as AFLW [8] emerged. They have images in challenging real-world situations acquired without any position, illumination or quality restriction.
Here we propose the use of three databases:
-
AFLW [8]. It contains a collection of 25993 faces acquired in an uncontrolled scenario with head poses ranging between ±120\(^{\circ }\) for yaw and ±90\(^{\circ }\) for pitch and roll angles. It provides a mean face 3D structure and manual annotations for 21 face landmarks. We compute the pose angles from the labeled landmarks using the POSIT algorithm [3] and assuming each face has the 3D structure of the mean face. We have found several annotations errors and, consequently, removed these faces from our benchmark. From the remaining faces we randomly choose 21074, 2068 and 1000 instances for training, validation and testing respectively. These images will be available after publication.
-
AFW [21]. This small database has been traditionally used only for testing purposes. It has 250 images with 468 faces in quite challenging settings. It provides discrete yaw labels ranging from −90\(^{\circ }\) to 90\(^{\circ }\) with 15\(^{\circ }\) intervals, plus the facial bounding box. These labels were manually annotated, hence often they are not very accurate.
-
300W Footnote 1. It includes 689 challenging faces obtained from the testing subsets of other databases (HELEN, LFPW and IBUG). This is the most popular face alignment benchmark. It provides face bounding boxes and 68 manually annotated landmarks. It does not provide any pose information. We use again AFLW mean 3D face and the POSIT algorithm [3] to estimate the three pose angles for each face instance. This data-set will also be publicly available.
In Table 1 we show the published results of the best head pose estimation algorithms. AFLW figures are not comparable among any of the cited works. Some select 1000 test images at random and use the rest for training [10, 15]. Valle et al. [19] chose 10% of the images for testing and the rest for training. Gao et al. [6] use 15561 randomly chosen image faces for training and the remaining 7848 for testing. Moreover, none of these AFLW subsets are publicly available, hence it is impossible to make a fair comparison among any of these approaches.
Similarly, the results for AFW are not comparable. Some approaches test on the whole database [15, 19]. However, each was trained on a different subset of AFLW. Moreover, Kumar et al. [10] test on the 341 images whose height is larger than 150 pixels. Peng et al. [14] test on a different set of 459 faces.
Finally, the head pose labels for 300 W are not available. Yang [20] computes them from an average face composed of 49 3D points. Unfortunately, this information is not public.
In summary, to have comparable results all algorithms should use the same train, validation and test data-sets. For our benchmark we propose to use a single train and validation data-set composed respectively by 21074 and 2068 face images randomly chosen from AFLW. For testing we have three data-sets: the AFLW test is performed on the remaining 1000 images; when testing with AFW and 300 W we use respectively all 468 and 689 faces from AFW and 300 W test sets.
Note also that our labels may also have small errors caused by the assumption that all faces have the same 3D structure.
3 Experiments
3.1 Methodology
Following the models used by the best published results [6, 10, 15, 20], we use a distributed face representation extracted from a deep CNN. Training such a model from scratch requires a large amount of data and computing power. The usual approach in computer vision is to use a general architecture already trained on a related problem and fine-tune it for the task at hand (see Fig. 1).
To build our baseline regressors we use AlexNet [9], GoogLeNet [17], VGG [16] and ResNet [7] trained architectures, top performers in the image classification task of the ILSVRC competition. AlexNet was also used by Ranjan et al. [15], GoogLeNet by Kumar et al. [10], and VGG-NetFootnote 2 by Gao et al. [6]. In each architecture we change the last 1000 units Softmax classification layer with an Euclidean Loss layer with three units for modeling the yaw, pitch and roll angles.
For fine-tuning and evaluation we use the Caffe framework with a GeForce GTX 1080 (8 GB) graphics processor. We followed the same procedure for each model. We use Nesterov Accelerated Gradient Descent (NVG) method, initialize the learning rate to \(\alpha =10^{-5}\) and reduce it with \(\gamma =0.1\) factor after “step size” iterations (see Table 2). Momentum was set to \(\mu =0.9\). Table 2 reports the remaining optimization of parameters for each architecture. We optimize the GPU memory occupation by setting the batch length and number of iterations on the basis of the network size. So, large networks use a small batch and larger number of iterations (see Table 2). The network weights used for tests are those at the last iteration. They will be publicly available after publication.
It takes 8 h for fine tuning the parameters of the largest net, ResNet-152, and process test images on average at a rate of 4 FPS. In Fig. 2 we show a pair of learning curves for VGG-19 and ResNet-152 architectures. Validation curves are more stable because we always process all test images. However, depending on the batch, the training performance has a larger variance. Vertical dashed red lines mark the number of iterations required to complete an epoch.
In Table 3 we present the results of the baseline classifiers for each network architecture. In general, these results confirm that the deeper the representation, the better the performance. This is a well-known fact in the deep learning literature [7].
In AFLW we use the Mean Absolute Error (MAE) of each angle as evaluation metric. Hence, the baseline model using AlexNet achieves better performance than Ranjan et al. [15]. Similarly, GoogLeNet results improve those by Kumar et al. [10]. For VGG-16, results are only marginally better thank those by Gao et al. [6], although our net was trained on the more general ImageNet data-set.
In AFW, since it provides discrete labels, we use as metric the classification success rate. Here, although again the results are also not strictly comparable, the models by Kumar et al. [10] and Ranjan et al. [15] improve those achieved by our baseline classifiers. This is surprising since in the more precise AFLW regression case, the result is the opposite. Perhaps in this case the discretization played against our models or, since AFW was manually labeled, the annotation error is higher. Hence, the MAE differences are less significant.
The MAEs of Yang et al. [20] in 300 W, although not strictly comparable, are better than those of our baseline classifiers. This may be caused by the fact that they train their CNN on the 300 W training data-set and, perhaps, over-fit to it.
Finally, in Fig. 3 we present some representative face images with head pose estimation errors greater than 15\(^{\circ }\) obtained using ResNet-152 architecture. As can be noticed, sometimes the estimation seems to be more accurate than the annotation. This may be caused by the manual annotation error.
4 Conclusions
We have surveyed the state-of-the-art on face pose estimation “in-the-wild”. Although some of the best performing approaches use the same train and test data-sets, their results are not comparable.
In this paper we have defined an evaluation procedure and benchmark data-sets with images captured in unrestricted settings. We have also trained a set of CNN-based classifiers that provide baseline results for our benchmark. The results in Table 3 represent the reproducible state-of-the-art for this problem.
The model based on the deepest network architecture, ResNet, provides the best overall performance. Hence, confirming that deeper representations have better generalization capabilities. When confronted with the best published results in the literature, although not strictly comparable, the ResNet model achieves better performance in the challenging AFLW dataset.
By making publicly available the baseline classifiers and the benchmark data-sets, we expect that future algorithms will be compared on fair grounds.
Notes
- 1.
- 2.
They used VGG-Face, a VGG-16 architecture trained on the VGG face database.
References
Ba, S.O., Odobez, J.M.: Multiperson visual focus of attention from head pose and meeting contextual cues. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 101–116 (2011)
Dantone, M., Gall, J., Fanelli, G., Gool, L.V.: Real-time facial feature detection using conditional regression forests. In: Proceedings Conference on Computer Vision and Pattern Recognition (2012)
DeMenthon, D., Davis, L.S.: Model-based object pose in 25 lines of code. Int. J. Comput. Vis. 15(1–2), 123–141 (1995)
Demirkus, M., Precup, D., Clark, J.J., Arbel, T.: Soft biometric trait classification from real-world face videos conditioned on head pose estimation. In: Proceedings of Conference on Computer Vision and Pattern Recognition Workshops (2012)
Fanelli, G., Dantone, M., Gall, J., Fossati, A., Van Gool, L.: Random forests for real time 3D face analysis. Int. J. Comput. Vis. 101(3), 437–458 (2013)
Gao, B.B., Xing, C., Xie, C.W., Wu, J., Geng, X.: Deep label distribution learning with label ambiguity. IEEE Trans. Image Process. 26(6), 2825–2838 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of Conference on Computer Vision and Pattern Recognition (2016)
Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: IEEE International Workshop on Benchmarking Facial Image Analysis Technologies (2011)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of Neural Information Processing Systems (NIPS) (2012)
Kumar, A., Alavi, A., Chellappa, R.: KEPLER: keypoint and pose estimation of unconstrained faces by learning efficient H-CNN regressors. In: Proceedings of International Conference on Automatic Face and Gesture Recognition (2017)
Lee, D., Yang, M., Oh, S.: Fast and accurate head pose estimation via random projection forests. In: Proceedings of Conference on Computer Vision and Pattern Recognition (2015)
Marín-Jiménez, M.J., Zisserman, A., Eichner, M., Ferrari, V.: Detecting people looking at each other in videos. Int. J. Comput. Vis. 106(3), 282–296 (2014)
Murphy-Chutorian, E., Trivedi, M.M.: Head pose estimation in computer vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 607–626 (2009)
Peng, X., Huang, J., Hu, Q., Zhang, S., Metaxas, D.N.: Three-dimensional head pose estimation in-the-wild. In: Proceedings of International Conference on Automatic Face and Gesture Recognition (2015)
Ranjan, R., Patel, V.M., Chellappa, R.: HyperFace: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. CoRR abs/1603.01249 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings Conference on Computer Vision and Pattern Recognition (2015)
Valenti, R., Sebe, N., Gevers, T.: Combining head pose and eye location information for gaze estimation. IEEE Trans. Image Process. 21(2), 802–815 (2012)
Valle, R., Buenaposada, J.M., Valdés, A., Baumela, L.: Head-pose estimation in-the-wild using a random forest. In: Proceedings of Articulated Motion and Deformable Objects (AMDO) (2016)
Yang, H., Mou, W., Zhang, Y., Patras, I., Gunes, H., Robinson, P.: Face alignment assisted by head pose estimation. In: Proceedings of British Machine Vision Conference (2015)
Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: Proceedings of Conference on Computer Vision and Pattern Recognition (2012)
Acknowledgments
The authors gratefully acknowledge computer resources provided by the Super-computing and Visualization Center of Madrid (CeSViMa) and funding from the Spanish Ministry of Economy and Competitiveness under projects TIN2013-47630-C2-2-R and TIN2016-75982-C2-2-R.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Amador, E., Valle, R., Buenaposada, J.M., Baumela, L. (2018). Benchmarking Head Pose Estimation in-the-Wild. In: Mendoza, M., Velastín, S. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2017. Lecture Notes in Computer Science(), vol 10657. Springer, Cham. https://doi.org/10.1007/978-3-319-75193-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-75193-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75192-4
Online ISBN: 978-3-319-75193-1
eBook Packages: Computer ScienceComputer Science (R0)