Transfer useful knowledge for headpose estimation from low resolution images

Li, Ping’an; Li, Yufeng; Tan, Lixin

doi:10.1007/s11042-016-3297-2

Transfer useful knowledge for headpose estimation from low resolution images

Published: 18 February 2016

Volume 75, pages 9395–9408, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Transfer useful knowledge for headpose estimation from low resolution images

Download PDF

Ping’an Li¹,
Yufeng Li¹ &
Lixin Tan¹

183 Accesses
2 Citations
Explore all metrics

Abstract

The knowledge of where a person is looking is useful in human computer interaction as well as human behavior analysis. Headpose estimation from low resolution images is still a challenge problem due to noisy feature representation for low resolution images. In this paper, we investigate transfer learning technique to conquer the weakness of the apperance-based feature of humans head-pose when their relative locations to far-field cameras are different. We evaluate our methods on public datasets which prove the efficiency of our proposed method.

An Adaptation Framework for Head-Pose Classification in Dynamic Multi-view Scenarios

Multiple Teacher Knowledge Distillation for Head Pose Estimation Without Keypoints

Article 29 September 2023

A Data-Driven Approach to Improve 3D Head-Pose Estimation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Tracking of people and localizing their head orientation or gaze direction provides additional support to human behavior studies. For example, people nod to indicate that they understand what is being said. Computing the direction of one’s head/eye orientation enables the identification of who the intended target of a conversation is. However, most contemporary methods, which estimated a person’s focus of attention employing head-pose and eye-gaze cues, rely on high resolution images, close-range camera and a highly constrained context.

From the visual analysis viewpoint, focus-of-attention (FOA) by estimating the head pose in a dynamic environment is challenging due to its unstructured behavior, possiblity of occlusion, low resolusion image, not linear relationship between headpose angle and the target location, etc.

In this work, a multi-cameras system is used to overcome the occulusion problem. Then the robust Kullback Leibler Divergence [21] and Covariance Descriptor [29] features are proposed for representing low resolusion images. Finally, we combine a tracker-based pose estimator and an appearance-based head pose predictor to do head pose estimation which would eventually lead us to estimate the personality of the participants.

Moreover, under a dynamic and unstructured setting, persons move freely which might cause the changing in their appreance. The feature distribution changes dramatically with respect to the relative location of far-field cameras. Therefore, the appearance-based feature model of human head-pose needs to be updated according to their relative location with the camera. Conventional machine learning algorithms perform poorly under this variant distribution setting. Consequently, we introduce the transfer learning concept to conquer this challenging problem. By doing this, we only need to collect a small number of samples in different locations in the room and use transfer learning to estimate the head pose in all other locations of the room.

The rest of this proposal is organized as follows. In Section 2, we review the state of the art on head pose estimation, classification and transfer learning. We discuss of our method in details in Section 3. Section 4 illustrates some results. Section 5 draws the concluding remarks of this paper.

2 Related work

During the past decades, researchers modeled the human behaviour using multimodal approaches based on video and audio. Generally speaking, most of the works firstly extract some discriminant features from humans, especially visual features of humans’ head/face and audio features of humans’ speaking activity. Then they fuse the data to model some specific behaviours such as interest, puzzlement and frustration, etc.

2.1 Head pose estimation

For human head pose estimation, what we need to do is to estimate the three head rotations with respect to the cameras which are represented by the yaw (or pan) angle, pitch (or tilt) angle and roll angle (as seen in Fig. 1).

Several popular methods [7], such as Appearance template methods, Tracking methods, Geometric methods are used in the research area of human head pose estimation. Appearance template methods compare a new image of a head to a set of exemplars to find the most similar view which is always the shortest distance between the new image and the exemplars. Tracking methods combine the tracking of the person and estimating head pose simultaneously from video frames. Geometric methods use the location of particular features such as eyes and mouth to determine the pose from their relative configuration.

For high-resolution images, there have been already some effort to do head pose estimation and gaze extraction. Perez and Cordoba [1] investigated gaze recognition through tracking of the eyeballs. Gee and Cipolla’s [15] gaze recognition method was based on 3D geometric relationship of facial features. However, all the features extracted for these methods are based on the high-resolution image. Most cameras in public areas are subject to complex condition and low-resolution images which make these methods unreliable in practise.

For low-resolution images, Robertson and Reid [25] proposed a skin and hair color based feature using color histograms for head pose estimation. However, this approach relied critically on good segmentation of skin and non-skin regions of a head image. Ba and Odobez [3, 4, 26] proposed a method for low-resolution head pose estimation. However, their methods used a fixed camera, which could not solve the occlusion problem. Also, they modeled tracking and pose classification as two paired tasks in a single framework which has the problem that the head pose estimation accuracy is affected by the accuracy of the tracking results. Tosato et al. [28] proposed the array of variances feature and classified the feature on Riemannian manifolds. The array of variance feature could describe visual object at low resolution better than other methods. However, the classification on Riemannian manifold rather than on Euclidean manifold is time consuming and probably could not be used in real-time systems. Yan et al. [32, 33, 37] proposed multi-task learning and transfer learning for headpose estimation.

2.2 Classification

Appearance-based head pose estimation can be translated into a machine learning problem. There are several basic learning algorithms that are widely used in machine learning area, such as K-Nearest-Neighbor (KNN), Boosting and Support Vector Machine (SVMs).

K-Nearest-Neighbor (KNN) algorithm is a method for classifying objects based on closest training examples in the feature space. The training examples are vectors in a multi-dimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabelled vector (a query or test point) is classified by assigning the label which is the most frequent among the k training samples nearest to that query point.

B o o s t i n g refers to an effective method of producing an accurate prediction rule by combining rough and moderate inaccurate rules of thumb. Freund and Schapire [14] proposed A d a B o o s t which solved many practical difficulties of earlier boosting algorithms. A d a B o o s t calls a given weak or base learning algorithm repeatedly in a series of rounds t=1,...,T. Once the weak hypothesis h _t has been received. AdaBoost chooses a parameter α _t according to the error. The final hypothesis H is a weighted majority vote of T weak hypotheses where α _t is the weight assigned to h _t.

Support Vector Machines (SVMs) [13] consider a d-dimensional feature space F which is a subset of R ^d and is spanned by a mapping φ. In a support vector (SV) setting, any φ corresponds to a Mercer Kernel k(x,x ^′)=<φ(x)⋅φ(x ^′)> implicitly computing the dot product in F. The goal of SVMs is to find some separating hyperplane described by a vector w in the feature space F. Given training set pairs (x _i, y _i), i=1,…,l,x _i∈R ^d and y _i∈{−1,1}, the classification requires the solution of the following optimization problem:

$$ \begin{array}{l} =\min\limits_{w,b,\zeta } \frac{1}{2}w^{T} w + C\sum\limits_{i = 1}^{l} {\xi_{i} } \\ \\ s.t. y_{i} (w^{T} \varphi (x_{i} ) + b) \ge 1 - \xi_{i} \\ \quad\quad \xi_{i} \ge 0 \end{array} $$

(1)

where ξ _i represents the slack variable and $\sum \limits _{i = 1}^{l} {\xi _{i} }$ measures the total classification error. The objective function seeks a decision boundary that achieves a small classification error and meanwhile creates a large margin, with two goals balanced by the scalar cost factor C.

2.3 Transfer learning

Traditional machine learning approches already achieve significant success in computer vision area including classification, regression and clustering. However, traditional machine learning algorithms are based on the assumption that training and testing data share the same feature space and the same distribution. When the training and testing data distributions are different, the accuracy of classification drops significantly. In this case, transfer learning between different domains is desirable. Transfer learning assumes that training and testing data could be from different domains and distributions. It is motivated by the fact that people can intelligently apply knowledge learning previously to solve new problems faster. The target of transfer learning is to find some commom property which is shared between the training (or source) and test (or target) domain. Some representative work used for event detection, egocentric activity recognition and multiview action recognition are [34–36]

In transfer learning, we have three main research issues: (1) what to transfer, (2) how to transfer, and (3) when to transfer [22]. “What to transfer” solves the problem of which part of knowledge can be transferred across domains or tasks. After discovering which knowledge can be transferred, learning algorithms are developed to do “how to transfer”. “When to transfer” asks under which situation the knowledge could be transfered in case some negative transfer could even hurt the performance of the target domain.

There are several approaches to transfer learning. Instance-transfer [8, 17, 23, 27, 39, 40] is to re-weight some labeled data in the source domain for using in the target domain due to the assumption that certain parts of the data in the source domain can be reused for the target domain. Feature-reperesentation-transfer [2, 9] is to find a “good” feature representation that reduces the difference between the source target domains and the error of classification and regression models. Parameter-transfer [5, 12] is to discover shared parameters or priors between the source domain and target domain models which can benefit from transfer learning. Relational-knowledge-transfer [20] is to build a mapping of relational knowledge between the source domain and the target domain.

For our situation, we wish to minimize the difference among the changes of appearance features when people stand in different locations relative to the cameras. We use an adaptive multiple kernel learning method which belongs to the instance-transfer learning category. Specifically, for each type of local features, we train a set of SVM classifiers based on a combined training set from the two domains by using multiple base kernels of different kernel types and parameters, which are further fused with equal weights to obtain an average classifier. The objective function minimizes the structural risk functional and the mismatch of data distributions between the source and the target domain simultaneously. The next section presents our solution in detail.

3 Transfer learning for head pose estimation

The head-pose estimation process in the party scenario involves four steps. (i) Head feature representation. (ii) Head pose classification.

3.1 Head feature representation

Face crops for the four camera views, obtained from the head localization procedure are used for head pose prediction. We focus more attention on where are the persons looking at, especially for the humans’ head horizontal rotation. Therefore, we discretized the space of possible head rotations into 24 classes, 8 classes for pan (horizontal rotation) and 3 classes for tilt (vertical rotation). We resize the head pose image to 20×20 pixels for one cropping and make four image as one panorama image. A template matching method is used for the head pose estimation. In order to solve the problem of occlusion, we combine four camera images as a panorama image to extract features and feed into the classifier to do the prediction. This method improves the accuracy of estimation compared with using only one camera output.

For low-level feature representation, there are two kinds of representation methods. The first category is a sparse representation which consists of a set of representative local regions obtained by an interest point detection algorithm. Reliable interest points should contain valuable information about the local image content and should remain stable under changes, such as viewpoint and illumination changes. Histogram-based representations of gradients, such as scale-invariant feature transform descriptors (SIFT) [19] and shape contexts belong to this category.

The second category is a dense representation which consists of a set of representative regions obtained inside a detection window. The entire image is scanned densely and a learned classifier of object model is evaluated. Intesity templates and principal component analysis (PCA) coeffients belong to this category.

We present two kinds of low-level features for human head representation which are Kullback Leibler Divergence [21] and Covariance Descriptor [29].

It is critical to represent the head pose based on the good separation of background, hair and skin/non-skin pixels. The idea is to compute each input image pixel to a set of mean appearance regardless pose. We compute the Kullback Leibler divergence (KL) distance [21] between the input test image and the mean template image for every pose. We choose the maximum value of KL distance for each pixel in RGB channel as the feature.

$$ \theta_{i,j} = \max\limits_{c} \left\{ {\max\limits_{RGB} \left\{ {p_{i,j}^{c} * \log \frac{{p_{i,j}^{c} }}{{q_{i,j}^{c} }}} \right\}} \right\} $$

(2)

where i,j represents each pixel, $p_{i,j}^{c}$ and $q_{i,j}^{c}$ are the mean image and test image pixel intensity value. 𝜃 _i,j is the maximum coefficient from all 24 classes and RGB channels for each pixel.

We also investigate another feature representation called covariance descriptor [29]. For low resolution images, the number of features that can be extracted are relatively small and quite unreliable.

Let I be a digital image and F be the W × H × d dimensional feature image extracted from I

$$ F(x,y) = \varphi (I,x,y) $$

(3)

where φ can be any mapping such as intensity, color, gradients, filter responses, etc. The region R in the image can be represented with a d × d convariance matrix of the feature points

$$ C_{R} = \frac{1}{{S - 1}}\sum\limits_{i = 1}^{S} {(z_{i} - \mu )(z_{i} - \mu )^{T} } $$

(4)

{z _i}_i=1..S are the d-dimensional feature points inside a region, μ is the mean of the points. Figure 2 shows the covariance descriptor of a region in the image (Fig. 2).

Integral images are intermediate image representations used for the fast calculation of region sums [30]. Let P be the W × H × d tensor of integral images

$$ P(x^{\prime},y^{\prime},i) = \sum\limits_{x \le x^{\prime},y \le y^{\prime}} {F(x,y,i)} \qquad i = 1...d $$

(5)

and Q be the W × H × d × d tensor of second-order of integral images

$$ Q(x^{\prime},y^{\prime},i,j) = \sum\limits_{x \le x^{\prime},y \le y^{\prime}} {F(x,y,i)F(x,y,j)} \qquad i,j = 1...d $$

(6)

Then P _x,y is the d-dimensional vector and Q _x,y is the d × d dimensional matrix

$$ P_{x,y} = [P(x,y,1)...P(x,y,d)]^{T} $$

(7)

$$ Q_{x,y} = \left[ \begin{array}{l} Q(x,y,1,1)...Q(x,y,1,d) \\ ... \\ Q(x,y,d,1)...Q(x,y,d,d) \end{array} \right] $$

(8)

for a region R. We could fast calculate the covariance of a region R(x ^′,y ^′;x ^″,y ^″) using integral image as

$$ \begin{array}{l} C_{R(x^{\prime},y^{\prime};x^{\prime\prime},y^{\prime\prime})} \\ = \frac{1}{{S - 1}}[Q_{x^{\prime\prime},y^{\prime\prime}} + Q_{x^{\prime} - 1,y^{\prime} - 1} - Q_{x^{\prime\prime},y^{\prime} - 1} - Q_{x^{\prime} - 1,y^{\prime\prime}} \\ \quad - \frac{1}{S}(P_{x^{\prime\prime},y^{\prime\prime}} + P_{x^{\prime} - 1,y^{\prime} - 1} - P_{x^{\prime\prime},y^{\prime} - 1} - P_{x^{\prime} - 1,y^{\prime\prime}} ) \\ \quad (P_{x^{\prime\prime},y^{\prime\prime}} + P_{x^{\prime} - 1,y^{\prime} - 1} - P_{x^{\prime\prime},y^{\prime} - 1} - P_{x^{\prime} - 1,y^{\prime\prime}} )^{T} ] \\ \end{array} $$

(9)

where (x ^′,y ^′) and (x ^″,y ^″) are the upper-left and bottom-right coordinator of a region in an image and S=(x ^″−x ^′+1)⋅(y ^″−y ^′+1). Therefore, after constructing integral images, the covariance of any rectangular region can be computed in O(d ²) time.

The benefit of covariance descriptor is that it could combine several different features such as color, texture in a single descriptor.

3.2 Transfer learning on head-pose estimation

Assume P r(x,y) and P r ^′(x,y) are two different distributions, the objective of learning method is to minimize the expected risk

$$ R[\Pr ,\theta ,l(x,y,\theta )] = E_{(x,y) \sim \Pr } [l(x,y,\theta )] $$

(10)

of a loss function l(x,y,𝜃) which depends on a parameter 𝜃. Here, the notation (x,y)∼P r means (x,y) belongs to the distribution P r(x,y).

In practice, we only observe examples (x,y) drawn from P r(x,y) which means we use empirical average

$$ R_{emp} [Pr,\theta ,l(x,y,\theta )] = \frac{1}{m}\sum\limits_{i = 1}^{m} {l(x_{i} ,y_{i} ,\theta )} $$

(11)

To avoid overfitting, we add a regularizer Ω(𝜃) and minimize the following equation

$$ R_{reg} [Pr,\theta ,l(x,y,\theta )] = R_{emp} [Pr,\theta ,l(x,y,\theta )] + \lambda {\Omega} (\theta ) $$

(12)

where λ is the trade-off coefficient between loss function and regularizer.

Importance sampling is a general technique for estimating properties of a particular distribution, while only having samples generated from a different distribution rather than the distribution of interest. If we use importance sampling,

$$ \begin{array}{l} R[{\mathrm{Pr^{\prime}}},\theta,l(x,y,\theta )] = E_{(x,y)\sim{\mathrm{Pr^{\prime}}}} [l(x,y,\theta )] \\ \\ = E_{(x,y)\sim{\mathrm{Pr^{\prime}}}}[\Pr (x,y)\frac{{{\mathrm{Pr^{\prime}}} (x,y)}}{{\Pr(x,y)}}l(x,y,\theta)] \\ \\ = E_{(x,y)\sim\Pr}[\frac{{{\mathrm{Pr^{\prime}}}(x,y)}}{{\Pr(x,y)}}l(x,y,\theta)] \\ \\ = R[\Pr,\theta,\beta(x,y)l(x,y,\theta )] \end{array} $$

(13)

where $\beta (x,y) = \frac {{{\mathrm {Pr^{\prime }}} (x,y)}}{{\Pr (x,y)}}$ is a reweighting factor for the training example. However, coefficients β(x,y) are usually unknown, which means we need to estimate β(x,y).

Sugiyama et. al [27] propose a least-squares approach to directly estimate this importance coefficients β(x,y). They model the importance coefficients β(x,y) by the linear model

$$ \hat \beta (x) = \sum\limits_{i = 1}^{m} {\alpha_{i} } \phi_{i} (x) $$

(14)

where α=(α ₁...α _m) are the parameters learned from data samples and ϕ _i(x) are the basis functions we need to choose. They use a least-squares approach to minimize $J(\alpha ) = \frac {1}{2}\int {\left ({\hat \beta (x,y) - \beta (x,y)} \right )}^{2} \Pr (x)dx$.

Then they formulate the problem as an optimization problem

$$ \min\limits_{\alpha} \frac{1}{2}\alpha^{T} \hat H\alpha - \hat h^{T} \alpha + \lambda \alpha, \ \ s.t.\ \alpha \ge 0 $$

(15)

where $\hat H_{i,j} = \frac {1}{{n_{tr} }}\sum \limits _{i = 1}^{n_{tr} } {\phi _{i} (x_{i}^{tr} )} \phi _{j} (x_{i}^{tr} )$ and $\hat h_{l} = \frac {1}{{n_{te} }}\sum \limits _{i = 1}^{n_{te} } {\phi _{l} (x_{i}^{tr} )}$, and n _{t
r} and n _{t
e} represent the number of training and test samples. This method gives a closed-form solution.

Huang et. al [17] propose the Kernel Mean Matching method which does not need to estimate the density of the function directly. Let Φ:X→F be a mapping into a feature space F and μ:P→F be the expectation operateor

$$ \mu (\Pr ) = E_{x \sim \Pr (x)} [{\Phi} (x)] $$

(16)

Then we can infer a suitable β by solving the following optimization problem

$$ \begin{array}{l} \min_{\beta} \left\| {\mu ({Pr}^{\prime} ) - E_{x \sim \Pr (x)} [\beta (x){\Phi} (x)]} \right\| \\ s.t. \quad \beta(x) \ge 0 \\ \\ \qquad E_{x \sim \Pr (x)} [\beta (x)] =1 \end{array} $$

(17)

In practice, we use empirical means instead of density distribution, then we have

$$ \begin{array}{l} \quad \left\| {\frac{1}{m}\sum\limits_{i = 1}^{m} {\beta_{i} {\Phi} (} x_{i} ) - \frac{1}{{m^{\prime}}}\sum\limits_{i = 1}^{m^{\prime}} {\Phi (x_{i}^{\prime} )} } \right\|^{2} \\ \\ = \frac{1}{{m^{2} }}\beta^{T} K\beta - \frac{2}{{m^{2} }}\kappa^{T} \beta + C \end{array} $$

(18)

Here K _{i
j} = k(x _i, x _j) and $\kappa _{i} = \frac {m}{{m^{\prime }}}\sum \limits _{j = 1}^{m^{\prime }} {k(x_{i} ,x_{j}^{\prime } } )$, and C is a const factor.

The optimatization problem can be reformulated as a quadratic problem as following

$$ \begin{array}{l} \min\limits_{\beta} \frac{1}{2}\beta^{T} K\beta - \kappa^{T} \beta \\ s.t. \quad \beta_{i} \in [0,B], \\ \qquad \left| {\sum\limits_{i = 1}^{m} {\beta_{i} - m} } \right| \le m\xi_{i} \end{array} $$

(19)

where B is the upbound of β and ξ _i is a slack variable.

Due to the hardness of kernel parameter choosen of SVM model, Rakotomamonjy et. al [24] proposed a multiple kernel method to simultaneously learn a kernel and the associated predictor in a supervised learning setting. They address the multiple kernel learning problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations.

They define the kernel as a linear combination of M base kernels as

$$ \begin{array}{l} K(x^{\prime},x) = \sum\limits_{m = 1}^{M} {d_{m} K_{m} (x,x^{\prime})} \\ s.t. \quad d_{m} \ge 0,\sum\limits_{m = 1}^{M} {d_{m} } = 1 \end{array} $$

(20)

Then they formulate the optimization problem as

$$ \begin{array}{l} \min\limits_{d} \ T(d) \\ s.t. \quad d_{m} \ge 0,\sum\limits_{m = 1}^{M} {d_{m} } = 1 \end{array} $$

(21)

where

$$ T(d) = \left\{ {\begin{array}{*{20}c} {\min\limits_{\{ f\} ,b,\xi } \frac{1}{2}\sum\limits_{m} {\frac{1}{{d_{m} }}\left\| {f_{m} } \right\|^{2} + C\sum\limits_{i} {\xi_{i} } } } \\ \\ {s.t. \quad y_{i} \sum\limits_{m} {f_{m} (x_{i} ) + y_{i} b \ge 1 - \xi_{i} } } \\ \\ {\xi_{i} \ge 0} \\ \end{array}} \right. $$

(22)

Here f _m is the m-th decision function, $\sum \limits _{i} {\xi _{i} }$ measures the total classification error, b is a constant factor, and T(d) is a traditional SVM format which can be solved by the optimization problem using a gradient method.

Recently, several adaptation methods for support vector machine (SVM) classifier were proposed in the video retrieval literature [18, 31, 38]. In order to make the SVM classifier adaptive to new domain, we formulate the target decision function for any sample x as

$$ f^{T} (x) = \sum\limits_{p = 1}^{P} {\gamma_{p} f_{p} (x) + \sum\limits_{m = 1}^{M} {d_{m} w^{\prime}_{m} \phi_{m} (x) + b} } $$

(23)

Where f _p(x) are the prelearned classifiers trained based on the labeled data from both domains. γ _p and d _m are coefficients of prelearned classifiers and multiple kernels, respectively.

For transfer learning [10, 11], the first objective is to reduce the mismatch between the source and the target domain. Gretton et al. [16] propose a measurement method of two different distributions. The mismatch is measured by Maximum Mean Discrepancy(MMD) based on the distance between the means of sample from source domain and target domain in the Reproducing Kernel Hilbert Space(RKHS) namely:

$$ DIST(D^{S} ,D^{T} ) = {\Omega} (d) = \parallel \frac{1}{{n_{S} }}\sum\limits_{i = 1}^{n_{S} } {\phi (} {x_{i}^{S}} ) - \frac{1}{{n_{T} }}\sum\limits_{i = 1}^{n_{T} } {\phi \left( {x_{i}^{T}} \right)\parallel_{H} } $$

(24)

where ${x_{i}^{S}}$ and ${x_{i}^{T}}$ are the samples from the source and target domains, respectively.

The second objective is to minimize the structural risk functional. If we combine these two objectives, the optimization problem is given by

$$ \min\limits_{d} G(d) = \frac{1}{2}{\Omega}^{2} (d) + \theta J(d) $$

(25)

where

$$ \begin{array}{l} J(d) = \min\limits_{w_{m} ,\gamma ,b,\xi_{i} } \frac{1}{2}\left( \sum\limits_{m = 1}^{M} {d_{m} } \left\| {w_{m} } \right\|^{2} + \lambda \left\| \gamma \right\|^{2} \right) + C\sum\limits_{i = 1}^{n} {\xi_{i} } \\ \\ \qquad\qquad s.t. \quad y_{i} f^{T} (x_{i} ) \ge 1 - \xi_{i}, \ \xi_{i} \ge 0 \end{array} $$

(26)

Here, γ = [γ ₁,...,γ _P]^′ and λ,C≥0 are the regularization parameters. If we define $\widetilde {w_{m} } = \left [w_{m}^{\prime } ,\sqrt \lambda \gamma '\right ]'$, $\widetilde {v_{m}}=d_{m} \widetilde {w_{m}} $ and $\widetilde {{\Phi }_{m} }(x_{i} ) = \left [{\Phi }_{m} (x_{i} )',\frac {1}{{\sqrt \lambda }}f(x_{i} )'\right ]'$, where f(x _i)=[f ₁(x _i),...,f _P(x _i)]. Then we can derive the following equation

$$ \begin{array}{l} J(d) = \min\limits_{\widetilde{v_{m}} ,b,\xi_{i} } \frac{1}{2}\sum\limits_{m = 1}^{M} {\frac{{\left\| {\widetilde{v_{m} }} \right\|^{2} }}{{d_{m} }}} + C\sum\limits_{i = 1}^{n} {\xi_{i} } \\ \\ s.t.\quad y_{i} \left( \sum\limits_{m = 1}^{M} {\widetilde{v_{m}}^{\prime}} \widetilde{\Phi} (x_{i} ) + b\right) \ge 1 - \xi_{i} ,\ \xi_{i} \ge 0 \end{array} $$

(27)

By introducing the Lagrangian multipliers α, the dual form of the optimazition is:

$$ J(d) = \max_{\alpha} \alpha ' - \frac{1}{2}(\alpha \cdot y)'\left( \sum\limits_{m = 1}^{M} {d_{m} } \widetilde{K_{m} }\right)(\alpha \cdot y) $$

(28)

This is the same form as the dual form of primary SVM with kernel matrix $\sum \limits _{m = 1}^{M} {d_{m} } \widetilde {K_{m} }$. Then the optimization problem can be solved by an existing SVM solver [6] .

It was proven in [24] that this optimization problem is jointly convex with respect to $d, \widetilde {v_{m} },b$ and ξ _i. For the multiple kernel learning parameter d could be updated by d _t+1 = d _t−η _t g _t, where $g_{t} = ({\nabla _{t}^{2}} G)^{- 1} \nabla {}_{t}G$ according alternative coordinate descent method shown in Table 1.

Table 1 Adaptive Multiple Kernel Learning Algorithm

Full size table

4 Results

In this section, we evaluate our proposed method by testing head-pose estimation accuracy and transfer learning between two datasets.

4.1 Head pose estimation accuracy

We evaluated the performance of the pose estimation framework on the UcoHead and DPOSE dataset, for which pan, tilt and roll head rotation measurements are available with the datasets. Upon resizing the face crops to 20×20 pixels, and computing the appearance templates (of size 80×20) for each class, we performed three-fold cross validation, with two parts used for training and one for testing. The mean accuracy obtained from the three runs is presented in Table 2. We can observe that covariance descriptor feature is better than KL-distance feature for these two datasets.

Table 2 Head-pose classification accuracy using SVM.

Full size table

4.2 Transfer learning between two datasets

We evaluate transfer learning benefit between the UcoHead dataset and the DPOSE dataset. We discretized the space of possible head horizontal rotations (Pan) into 8 classes as we propose above. For each pan class, we randomly select images from the dataset and repeat the experiments 5 times to calculate mean results. We evaluate our methods by two experiments. One is using UcoHead dataset as source domain (471 images) and DPOSE dataset (60 images). Then we use 610 images from DPOSE dataset, which is not included in the training set, as testing set. The other experiment is using party data (610 images) we record as source domain and UcoHead (80 images) as target domain for training dataset. Then we use 629 images from UcoHead dataset, which is not included in the training set, as testing set. Here the image number for source domain is nearly 8 times larger than target domain in training set which is reasonable for transfer learning. We use 5 base Gaussian kernels (i.e., K(x _i, x _j) = e x p(−γ D ²(x _i, x _j)) ) with different kernel parameters γ = {−2,−1,0,1,2} respectively. Table 3 illustrates the classification results for head-pose estimation between two different dataset with transfer learning and without transfer learning.

Table 3 Head-pose classification results of using transfer learning technique. (A - UcoHead, B - DPOSE)

Full size table

From Table 3, we could observe that the classification accuracy is relatively low if we do not use transfer learning technique between two datasets. With transfer learning, we could actually extract some useful information from the source dataset to help classify on target dataset which increases the classification accuracy a lot.

4.3 Comparison

At last, we also compare with other low-resolution headpose estimation methods. Table 4 shows the comparison with other state-of-the-art methods.

Table 4 Comparison with other low-resolution headpose estimation methods

Full size table

5 Conclusion

Human head pose is the first step in understanding the behaviors of human. We propose a framework to do the head pose estimation in low resolution images. We propose an adaptive multiple kernel transfer learning technique to overcome the weakness of apperance-based feature representation. Experimental results on public dataset prove the efficiency of our proposed method.

References

APerez, Cordoba M (2003) A precise eye-gaze detection and tracking system. In: Proc Intl Conf. on Computer Graphics, Visualization and Computer Vision
Argyrious A, Evegenious T (2007) Multi-task feature learning. In: NIPS
Ba O, Odobez J (2005) Evaluation of multiple cues head pose estimation algorithms in natural environments. In: Proc. of the Intl. Conf. on Multi-media and Expo
Ba O, Odobez J (2011) Multiperson visual focus of attention from head pose and meeting contextual cues. In: IEEE Transactions on Pattern Analysis and Machine Intelligence
Bonilla E, Chai K, Williams C (2008) Multi-task gaussian process prediction. In: NIPS
Chang CC, Lin CJ (2001) Libsvm: a library for support vector machines
Chutorian E M, Trivedi M M (2009) Head pose estimation in computer vision: a survey. In: IEEE Transations on Pattern Analysis and Machine Intelligence
Dai W, Yang Q, Yu Y (2007) Boosting for transfer learning. In: ICML
Daume H (2007) Frustratingly easy domain adaptation. In: Proc. of the Assoc. Computational Linguistics
Duan L, Tsang I W, Xu D, Maybank SJ (2009) Domain transfer svm for video concept detection. In: CVPR
Duan L, Xu D, Tsang IW, Luo J (2010) Visual event recognition in videos by learning from web data. In: CVPR
Evgeniou T, Pontil M (2004) Regularized multi-task learning. In: Proc. 10th ACM SIGKDD Int Conf. Knowledge Discovery and Data Mining
Evgenious T, Poggio T, Verri A (2002) Regularization and statistical learning theory for data analysis. Computational Statistics and Data Analysis 38:421–432
Article MathSciNet MATH Google Scholar
Freund Y, Schapire R (1999) A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14(5):771–780
Google Scholar
Gee A, Cipolla R (1994) Determining the gaze of faces in images, vol 12
Gretton A, Borgwardt K, Scholkopt B (2006) A kernel method for the two-sample-problem. In: NIPS
Huang J, Smola A, Scholkopf B (2007) Correcting sample selection bias by unlabeled data. In: NIPS
Jiang W, Zavesky E, Chang SF, Loui A (2008) Cross-domain learning methods for high-level visual concept classification. In: Proc. IEEE Int Conf. Image Processing, pp. 161164
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Mihalkova L, Huynh T, Mooney R (2007) Mapping and revising markov logic networks for transfer learning. In: Proc. for the Advancement of Artificial Intelligence (AAAI) Conf. Artificial Intelligence
Orozco J, Gong S, Xiang T (2009) Head pose classification in crowded scenes. In: British Machine Vision Conference
Pan SJ, Yang Q (2010) A survey on transfer learning. In: IEEE Transactions on Knowledge and Data Engineering, Vol 22. NO 10
Pan SJ, Kwok JT, Yang Q (2008) Transfer learning via dimensionality reduction. In: Proc. Assoc. for the Advancement of Artificial Intelligence
Rakotomamonjy A, Bach FR (2008) Simplemkl. In: Journal of Machine Learning Research
Robertson N, Reid I (2006) Estimating gaze direction from low-resolution faces in video. In: Proc. European Conf. Computer Vision
Smith K, Ba O, Odobez J (2008) Tracking the visual focus of attention for a varying number of wandering people. In: IEEE Transactions on Pattern Analysis and Machine Intelligence
Sugiyama M, Nakajima S, Kawanabe M (2008) Direct importance estimation with model selection and its application to covariate shift adaptation. In: NIPS
Tosato D, Farenzena M, Cristani M (2010) Multi-class classification on riemannian manifolds for video surveillance. In: Proc. European Conf. Computer Vision
Tuzal O, Porikli F, Meer P (2008) Pedestrian detection via classification on riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(10):1713–1727
Article Google Scholar
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: CVPR
Wu P (2004) Improving svm accuracy by training on auxiliary data sources. In: ICML
Yan Y, Subramanian R, Lanz O, Sebe N (2012) Active transfer learning for multi-view head-pose classification. In: ICPR
Yan Y, Ricci E, Subramanian R, Lanz O, Sebe N (2013) No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion. In: ICCV
Yan Y, Ricci E, Subramanian R, Liu G, Sebe N (2014) Multi-task linear discriminant analysis for multi-view action recognition. IEEE Trans Image Process 23 (12):5599–5611
Article MathSciNet Google Scholar
Yan Y, Ricci E, Liu G, Sebe N (2015) Egocentric daily activity recognition via multitask clustering. IEEE Transactions on Image Processing 24(10):2984–2995
Article MathSciNet Google Scholar
Yan Y, Yang Y, Meng D, Liu G, Tong W, Hauptmann A, Sebe N (2015) Event oriented dictionary learning for complex event detection. IEEE Transactions on Image Processing 24(6):1867–1878
Article MathSciNet Google Scholar
Yan Y, Ricci E, Subramanian R, Liu G (2016) A multi-task learning framework for head pose estimation under target motion. IEEE Transactions on Pattern Analysis and Machine Intelligence
Yang J, Yan R, Hauptmann AG (2007) Cross-domain video concept detection using adaptive svms. In: Proc. ACM Int Conf. Multimedia
Yao Y, Dorretto G (2010) Boosting for transfer learning with multiple sources. In: CVPR
Zheng VW, Pan SJ, Yang Q (2008) Transferring multi-device localization models using latent multi-task learning. In: Proc. Assoc. for the Advancement of Artificial Intelligence

Download references

Author information

Authors and Affiliations

School of Electronic Engineering, Hunan College of Information, Hunan, China
Ping’an Li, Yufeng Li & Lixin Tan

Authors

Ping’an Li
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Lixin Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ping’an Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, P., Li, Y. & Tan, L. Transfer useful knowledge for headpose estimation from low resolution images. Multimed Tools Appl 75, 9395–9408 (2016). https://doi.org/10.1007/s11042-016-3297-2

Download citation

Received: 29 September 2015
Revised: 17 January 2016
Accepted: 22 January 2016
Published: 18 February 2016
Issue Date: August 2016
DOI: https://doi.org/10.1007/s11042-016-3297-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Transfer useful knowledge for headpose estimation from low resolution images

Abstract

Similar content being viewed by others

An Adaptation Framework for Head-Pose Classification in Dynamic Multi-view Scenarios

Multiple Teacher Knowledge Distillation for Head Pose Estimation Without Keypoints

A Data-Driven Approach to Improve 3D Head-Pose Estimation

1 Introduction