1 Introduction

Tracking of people and localizing their head orientation or gaze direction provides additional support to human behavior studies. For example, people nod to indicate that they understand what is being said. Computing the direction of one’s head/eye orientation enables the identification of who the intended target of a conversation is. However, most contemporary methods, which estimated a person’s focus of attention employing head-pose and eye-gaze cues, rely on high resolution images, close-range camera and a highly constrained context.

From the visual analysis viewpoint, focus-of-attention (FOA) by estimating the head pose in a dynamic environment is challenging due to its unstructured behavior, possiblity of occlusion, low resolusion image, not linear relationship between headpose angle and the target location, etc.

In this work, a multi-cameras system is used to overcome the occulusion problem. Then the robust Kullback Leibler Divergence [21] and Covariance Descriptor [29] features are proposed for representing low resolusion images. Finally, we combine a tracker-based pose estimator and an appearance-based head pose predictor to do head pose estimation which would eventually lead us to estimate the personality of the participants.

Moreover, under a dynamic and unstructured setting, persons move freely which might cause the changing in their appreance. The feature distribution changes dramatically with respect to the relative location of far-field cameras. Therefore, the appearance-based feature model of human head-pose needs to be updated according to their relative location with the camera. Conventional machine learning algorithms perform poorly under this variant distribution setting. Consequently, we introduce the transfer learning concept to conquer this challenging problem. By doing this, we only need to collect a small number of samples in different locations in the room and use transfer learning to estimate the head pose in all other locations of the room.

The rest of this proposal is organized as follows. In Section 2, we review the state of the art on head pose estimation, classification and transfer learning. We discuss of our method in details in Section 3. Section 4 illustrates some results. Section 5 draws the concluding remarks of this paper.

2 Related work

During the past decades, researchers modeled the human behaviour using multimodal approaches based on video and audio. Generally speaking, most of the works firstly extract some discriminant features from humans, especially visual features of humans’ head/face and audio features of humans’ speaking activity. Then they fuse the data to model some specific behaviours such as interest, puzzlement and frustration, etc.

2.1 Head pose estimation

For human head pose estimation, what we need to do is to estimate the three head rotations with respect to the cameras which are represented by the yaw (or pan) angle, pitch (or tilt) angle and roll angle (as seen in Fig. 1).

Fig. 1
figure 1

Three degree-of-freedom of a human head

Several popular methods [7], such as Appearance template methods, Tracking methods, Geometric methods are used in the research area of human head pose estimation. Appearance template methods compare a new image of a head to a set of exemplars to find the most similar view which is always the shortest distance between the new image and the exemplars. Tracking methods combine the tracking of the person and estimating head pose simultaneously from video frames. Geometric methods use the location of particular features such as eyes and mouth to determine the pose from their relative configuration.

For high-resolution images, there have been already some effort to do head pose estimation and gaze extraction. Perez and Cordoba [1] investigated gaze recognition through tracking of the eyeballs. Gee and Cipolla’s [15] gaze recognition method was based on 3D geometric relationship of facial features. However, all the features extracted for these methods are based on the high-resolution image. Most cameras in public areas are subject to complex condition and low-resolution images which make these methods unreliable in practise.

For low-resolution images, Robertson and Reid [25] proposed a skin and hair color based feature using color histograms for head pose estimation. However, this approach relied critically on good segmentation of skin and non-skin regions of a head image. Ba and Odobez [3, 4, 26] proposed a method for low-resolution head pose estimation. However, their methods used a fixed camera, which could not solve the occlusion problem. Also, they modeled tracking and pose classification as two paired tasks in a single framework which has the problem that the head pose estimation accuracy is affected by the accuracy of the tracking results. Tosato et al. [28] proposed the array of variances feature and classified the feature on Riemannian manifolds. The array of variance feature could describe visual object at low resolution better than other methods. However, the classification on Riemannian manifold rather than on Euclidean manifold is time consuming and probably could not be used in real-time systems. Yan et al. [32, 33, 37] proposed multi-task learning and transfer learning for headpose estimation.

2.2 Classification

Appearance-based head pose estimation can be translated into a machine learning problem. There are several basic learning algorithms that are widely used in machine learning area, such as K-Nearest-Neighbor (KNN), Boosting and Support Vector Machine (SVMs).

K-Nearest-Neighbor (KNN) algorithm is a method for classifying objects based on closest training examples in the feature space. The training examples are vectors in a multi-dimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabelled vector (a query or test point) is classified by assigning the label which is the most frequent among the k training samples nearest to that query point.

B o o s t i n g refers to an effective method of producing an accurate prediction rule by combining rough and moderate inaccurate rules of thumb. Freund and Schapire [14] proposed A d a B o o s t which solved many practical difficulties of earlier boosting algorithms. A d a B o o s t calls a given weak or base learning algorithm repeatedly in a series of rounds t=1,...,T. Once the weak hypothesis h t has been received. AdaBoost chooses a parameter α t according to the error. The final hypothesis H is a weighted majority vote of T weak hypotheses where α t is the weight assigned to h t .

Support Vector Machines (SVMs) [13] consider a d-dimensional feature space F which is a subset of R d and is spanned by a mapping φ. In a support vector (SV) setting, any φ corresponds to a Mercer Kernel k(x,x )=<φ(x)⋅φ(x )> implicitly computing the dot product in F. The goal of SVMs is to find some separating hyperplane described by a vector w in the feature space F. Given training set pairs (x i , y i ), i=1,…,l,x i R d and y i ∈{−1,1}, the classification requires the solution of the following optimization problem:

$$ \begin{array}{l} =\min\limits_{w,b,\zeta } \frac{1}{2}w^{T} w + C\sum\limits_{i = 1}^{l} {\xi_{i} } \\ \\ s.t. y_{i} (w^{T} \varphi (x_{i} ) + b) \ge 1 - \xi_{i} \\ \quad\quad \xi_{i} \ge 0 \end{array} $$
(1)

where ξ i represents the slack variable and \(\sum \limits _{i = 1}^{l} {\xi _{i} }\) measures the total classification error. The objective function seeks a decision boundary that achieves a small classification error and meanwhile creates a large margin, with two goals balanced by the scalar cost factor C.

2.3 Transfer learning

Traditional machine learning approches already achieve significant success in computer vision area including classification, regression and clustering. However, traditional machine learning algorithms are based on the assumption that training and testing data share the same feature space and the same distribution. When the training and testing data distributions are different, the accuracy of classification drops significantly. In this case, transfer learning between different domains is desirable. Transfer learning assumes that training and testing data could be from different domains and distributions. It is motivated by the fact that people can intelligently apply knowledge learning previously to solve new problems faster. The target of transfer learning is to find some commom property which is shared between the training (or source) and test (or target) domain. Some representative work used for event detection, egocentric activity recognition and multiview action recognition are [3436]

In transfer learning, we have three main research issues: (1) what to transfer, (2) how to transfer, and (3) when to transfer [22]. “What to transfer” solves the problem of which part of knowledge can be transferred across domains or tasks. After discovering which knowledge can be transferred, learning algorithms are developed to do “how to transfer”. “When to transfer” asks under which situation the knowledge could be transfered in case some negative transfer could even hurt the performance of the target domain.

There are several approaches to transfer learning. Instance-transfer [8, 17, 23, 27, 39, 40] is to re-weight some labeled data in the source domain for using in the target domain due to the assumption that certain parts of the data in the source domain can be reused for the target domain. Feature-reperesentation-transfer [2, 9] is to find a “good” feature representation that reduces the difference between the source target domains and the error of classification and regression models. Parameter-transfer [5, 12] is to discover shared parameters or priors between the source domain and target domain models which can benefit from transfer learning. Relational-knowledge-transfer [20] is to build a mapping of relational knowledge between the source domain and the target domain.

For our situation, we wish to minimize the difference among the changes of appearance features when people stand in different locations relative to the cameras. We use an adaptive multiple kernel learning method which belongs to the instance-transfer learning category. Specifically, for each type of local features, we train a set of SVM classifiers based on a combined training set from the two domains by using multiple base kernels of different kernel types and parameters, which are further fused with equal weights to obtain an average classifier. The objective function minimizes the structural risk functional and the mismatch of data distributions between the source and the target domain simultaneously. The next section presents our solution in detail.

3 Transfer learning for head pose estimation

The head-pose estimation process in the party scenario involves four steps. (i) Head feature representation. (ii) Head pose classification.

3.1 Head feature representation

Face crops for the four camera views, obtained from the head localization procedure are used for head pose prediction. We focus more attention on where are the persons looking at, especially for the humans’ head horizontal rotation. Therefore, we discretized the space of possible head rotations into 24 classes, 8 classes for pan (horizontal rotation) and 3 classes for tilt (vertical rotation). We resize the head pose image to 20×20 pixels for one cropping and make four image as one panorama image. A template matching method is used for the head pose estimation. In order to solve the problem of occlusion, we combine four camera images as a panorama image to extract features and feed into the classifier to do the prediction. This method improves the accuracy of estimation compared with using only one camera output.

For low-level feature representation, there are two kinds of representation methods. The first category is a sparse representation which consists of a set of representative local regions obtained by an interest point detection algorithm. Reliable interest points should contain valuable information about the local image content and should remain stable under changes, such as viewpoint and illumination changes. Histogram-based representations of gradients, such as scale-invariant feature transform descriptors (SIFT) [19] and shape contexts belong to this category.

The second category is a dense representation which consists of a set of representative regions obtained inside a detection window. The entire image is scanned densely and a learned classifier of object model is evaluated. Intesity templates and principal component analysis (PCA) coeffients belong to this category.

We present two kinds of low-level features for human head representation which are Kullback Leibler Divergence [21] and Covariance Descriptor [29].

It is critical to represent the head pose based on the good separation of background, hair and skin/non-skin pixels. The idea is to compute each input image pixel to a set of mean appearance regardless pose. We compute the Kullback Leibler divergence (KL) distance [21] between the input test image and the mean template image for every pose. We choose the maximum value of KL distance for each pixel in RGB channel as the feature.

$$ \theta_{i,j} = \max\limits_{c} \left\{ {\max\limits_{RGB} \left\{ {p_{i,j}^{c} * \log \frac{{p_{i,j}^{c} }}{{q_{i,j}^{c} }}} \right\}} \right\} $$
(2)

where i,j represents each pixel, \(p_{i,j}^{c}\) and \(q_{i,j}^{c}\) are the mean image and test image pixel intensity value. 𝜃 i,j is the maximum coefficient from all 24 classes and RGB channels for each pixel.

We also investigate another feature representation called covariance descriptor [29]. For low resolution images, the number of features that can be extracted are relatively small and quite unreliable.

Let I be a digital image and F be the W × H × d dimensional feature image extracted from I

$$ F(x,y) = \varphi (I,x,y) $$
(3)

where φ can be any mapping such as intensity, color, gradients, filter responses, etc. The region R in the image can be represented with a d × d convariance matrix of the feature points

$$ C_{R} = \frac{1}{{S - 1}}\sum\limits_{i = 1}^{S} {(z_{i} - \mu )(z_{i} - \mu )^{T} } $$
(4)

{z i } i=1..S are the d-dimensional feature points inside a region, μ is the mean of the points. Figure 2 shows the covariance descriptor of a region in the image (Fig. 2).

Fig. 2
figure 2

Covariance descriptor

Integral images are intermediate image representations used for the fast calculation of region sums [30]. Let P be the W × H × d tensor of integral images

$$ P(x^{\prime},y^{\prime},i) = \sum\limits_{x \le x^{\prime},y \le y^{\prime}} {F(x,y,i)} \qquad i = 1...d $$
(5)

and Q be the W × H × d × d tensor of second-order of integral images

$$ Q(x^{\prime},y^{\prime},i,j) = \sum\limits_{x \le x^{\prime},y \le y^{\prime}} {F(x,y,i)F(x,y,j)} \qquad i,j = 1...d $$
(6)

Then P x,y is the d-dimensional vector and Q x,y is the d × d dimensional matrix

$$ P_{x,y} = [P(x,y,1)...P(x,y,d)]^{T} $$
(7)
$$ Q_{x,y} = \left[ \begin{array}{l} Q(x,y,1,1)...Q(x,y,1,d) \\ ... \\ Q(x,y,d,1)...Q(x,y,d,d) \end{array} \right] $$
(8)

for a region R. We could fast calculate the covariance of a region R(x ,y ;x ,y ) using integral image as

$$ \begin{array}{l} C_{R(x^{\prime},y^{\prime};x^{\prime\prime},y^{\prime\prime})} \\ = \frac{1}{{S - 1}}[Q_{x^{\prime\prime},y^{\prime\prime}} + Q_{x^{\prime} - 1,y^{\prime} - 1} - Q_{x^{\prime\prime},y^{\prime} - 1} - Q_{x^{\prime} - 1,y^{\prime\prime}} \\ \quad - \frac{1}{S}(P_{x^{\prime\prime},y^{\prime\prime}} + P_{x^{\prime} - 1,y^{\prime} - 1} - P_{x^{\prime\prime},y^{\prime} - 1} - P_{x^{\prime} - 1,y^{\prime\prime}} ) \\ \quad (P_{x^{\prime\prime},y^{\prime\prime}} + P_{x^{\prime} - 1,y^{\prime} - 1} - P_{x^{\prime\prime},y^{\prime} - 1} - P_{x^{\prime} - 1,y^{\prime\prime}} )^{T} ] \\ \end{array} $$
(9)

where (x ,y ) and (x ,y ) are the upper-left and bottom-right coordinator of a region in an image and S=(x x +1)⋅(y y +1). Therefore, after constructing integral images, the covariance of any rectangular region can be computed in O(d 2) time.

The benefit of covariance descriptor is that it could combine several different features such as color, texture in a single descriptor.

3.2 Transfer learning on head-pose estimation

Assume P r(x,y) and P r (x,y) are two different distributions, the objective of learning method is to minimize the expected risk

$$ R[\Pr ,\theta ,l(x,y,\theta )] = E_{(x,y) \sim \Pr } [l(x,y,\theta )] $$
(10)

of a loss function l(x,y,𝜃) which depends on a parameter 𝜃. Here, the notation (x,y)∼P r means (x,y) belongs to the distribution P r(x,y).

In practice, we only observe examples (x,y) drawn from P r(x,y) which means we use empirical average

$$ R_{emp} [Pr,\theta ,l(x,y,\theta )] = \frac{1}{m}\sum\limits_{i = 1}^{m} {l(x_{i} ,y_{i} ,\theta )} $$
(11)

To avoid overfitting, we add a regularizer Ω(𝜃) and minimize the following equation

$$ R_{reg} [Pr,\theta ,l(x,y,\theta )] = R_{emp} [Pr,\theta ,l(x,y,\theta )] + \lambda {\Omega} (\theta ) $$
(12)

where λ is the trade-off coefficient between loss function and regularizer.

Importance sampling is a general technique for estimating properties of a particular distribution, while only having samples generated from a different distribution rather than the distribution of interest. If we use importance sampling,

$$ \begin{array}{l} R[{\mathrm{Pr^{\prime}}},\theta,l(x,y,\theta )] = E_{(x,y)\sim{\mathrm{Pr^{\prime}}}} [l(x,y,\theta )] \\ \\ = E_{(x,y)\sim{\mathrm{Pr^{\prime}}}}[\Pr (x,y)\frac{{{\mathrm{Pr^{\prime}}} (x,y)}}{{\Pr(x,y)}}l(x,y,\theta)] \\ \\ = E_{(x,y)\sim\Pr}[\frac{{{\mathrm{Pr^{\prime}}}(x,y)}}{{\Pr(x,y)}}l(x,y,\theta)] \\ \\ = R[\Pr,\theta,\beta(x,y)l(x,y,\theta )] \end{array} $$
(13)

where \(\beta (x,y) = \frac {{{\mathrm {Pr^{\prime }}} (x,y)}}{{\Pr (x,y)}}\) is a reweighting factor for the training example. However, coefficients β(x,y) are usually unknown, which means we need to estimate β(x,y).

Sugiyama et. al [27] propose a least-squares approach to directly estimate this importance coefficients β(x,y). They model the importance coefficients β(x,y) by the linear model

$$ \hat \beta (x) = \sum\limits_{i = 1}^{m} {\alpha_{i} } \phi_{i} (x) $$
(14)

where α=(α 1...α m ) are the parameters learned from data samples and ϕ i (x) are the basis functions we need to choose. They use a least-squares approach to minimize \(J(\alpha ) = \frac {1}{2}\int {\left ({\hat \beta (x,y) - \beta (x,y)} \right )}^{2} \Pr (x)dx\).

Then they formulate the problem as an optimization problem

$$ \min\limits_{\alpha} \frac{1}{2}\alpha^{T} \hat H\alpha - \hat h^{T} \alpha + \lambda \alpha, \ \ s.t.\ \alpha \ge 0 $$
(15)

where \(\hat H_{i,j} = \frac {1}{{n_{tr} }}\sum \limits _{i = 1}^{n_{tr} } {\phi _{i} (x_{i}^{tr} )} \phi _{j} (x_{i}^{tr} )\) and \(\hat h_{l} = \frac {1}{{n_{te} }}\sum \limits _{i = 1}^{n_{te} } {\phi _{l} (x_{i}^{tr} )}\), and n t r and n t e represent the number of training and test samples. This method gives a closed-form solution.

Huang et. al [17] propose the Kernel Mean Matching method which does not need to estimate the density of the function directly. Let Φ:XF be a mapping into a feature space F and μ:PF be the expectation operateor

$$ \mu (\Pr ) = E_{x \sim \Pr (x)} [{\Phi} (x)] $$
(16)

Then we can infer a suitable β by solving the following optimization problem

$$ \begin{array}{l} \min_{\beta} \left\| {\mu ({Pr}^{\prime} ) - E_{x \sim \Pr (x)} [\beta (x){\Phi} (x)]} \right\| \\ s.t. \quad \beta(x) \ge 0 \\ \\ \qquad E_{x \sim \Pr (x)} [\beta (x)] =1 \end{array} $$
(17)

In practice, we use empirical means instead of density distribution, then we have

$$ \begin{array}{l} \quad \left\| {\frac{1}{m}\sum\limits_{i = 1}^{m} {\beta_{i} {\Phi} (} x_{i} ) - \frac{1}{{m^{\prime}}}\sum\limits_{i = 1}^{m^{\prime}} {\Phi (x_{i}^{\prime} )} } \right\|^{2} \\ \\ = \frac{1}{{m^{2} }}\beta^{T} K\beta - \frac{2}{{m^{2} }}\kappa^{T} \beta + C \end{array} $$
(18)

Here K i j = k(x i , x j ) and \(\kappa _{i} = \frac {m}{{m^{\prime }}}\sum \limits _{j = 1}^{m^{\prime }} {k(x_{i} ,x_{j}^{\prime } } )\), and C is a const factor.

The optimatization problem can be reformulated as a quadratic problem as following

$$ \begin{array}{l} \min\limits_{\beta} \frac{1}{2}\beta^{T} K\beta - \kappa^{T} \beta \\ s.t. \quad \beta_{i} \in [0,B], \\ \qquad \left| {\sum\limits_{i = 1}^{m} {\beta_{i} - m} } \right| \le m\xi_{i} \end{array} $$
(19)

where B is the upbound of β and ξ i is a slack variable.

Due to the hardness of kernel parameter choosen of SVM model, Rakotomamonjy et. al [24] proposed a multiple kernel method to simultaneously learn a kernel and the associated predictor in a supervised learning setting. They address the multiple kernel learning problem through a weighted 2-norm regularization formulation with an additional constraint on the weights that encourages sparse kernel combinations.

They define the kernel as a linear combination of M base kernels as

$$ \begin{array}{l} K(x^{\prime},x) = \sum\limits_{m = 1}^{M} {d_{m} K_{m} (x,x^{\prime})} \\ s.t. \quad d_{m} \ge 0,\sum\limits_{m = 1}^{M} {d_{m} } = 1 \end{array} $$
(20)

Then they formulate the optimization problem as

$$ \begin{array}{l} \min\limits_{d} \ T(d) \\ s.t. \quad d_{m} \ge 0,\sum\limits_{m = 1}^{M} {d_{m} } = 1 \end{array} $$
(21)

where

$$ T(d) = \left\{ {\begin{array}{*{20}c} {\min\limits_{\{ f\} ,b,\xi } \frac{1}{2}\sum\limits_{m} {\frac{1}{{d_{m} }}\left\| {f_{m} } \right\|^{2} + C\sum\limits_{i} {\xi_{i} } } } \\ \\ {s.t. \quad y_{i} \sum\limits_{m} {f_{m} (x_{i} ) + y_{i} b \ge 1 - \xi_{i} } } \\ \\ {\xi_{i} \ge 0} \\ \end{array}} \right. $$
(22)

Here f m is the m-th decision function, \(\sum \limits _{i} {\xi _{i} }\) measures the total classification error, b is a constant factor, and T(d) is a traditional SVM format which can be solved by the optimization problem using a gradient method.

Recently, several adaptation methods for support vector machine (SVM) classifier were proposed in the video retrieval literature [18, 31, 38]. In order to make the SVM classifier adaptive to new domain, we formulate the target decision function for any sample x as

$$ f^{T} (x) = \sum\limits_{p = 1}^{P} {\gamma_{p} f_{p} (x) + \sum\limits_{m = 1}^{M} {d_{m} w^{\prime}_{m} \phi_{m} (x) + b} } $$
(23)

Where f p (x) are the prelearned classifiers trained based on the labeled data from both domains. γ p and d m are coefficients of prelearned classifiers and multiple kernels, respectively.

For transfer learning [10, 11], the first objective is to reduce the mismatch between the source and the target domain. Gretton et al. [16] propose a measurement method of two different distributions. The mismatch is measured by Maximum Mean Discrepancy(MMD) based on the distance between the means of sample from source domain and target domain in the Reproducing Kernel Hilbert Space(RKHS) namely:

$$ DIST(D^{S} ,D^{T} ) = {\Omega} (d) = \parallel \frac{1}{{n_{S} }}\sum\limits_{i = 1}^{n_{S} } {\phi (} {x_{i}^{S}} ) - \frac{1}{{n_{T} }}\sum\limits_{i = 1}^{n_{T} } {\phi \left( {x_{i}^{T}} \right)\parallel_{H} } $$
(24)

where \({x_{i}^{S}}\) and \({x_{i}^{T}}\) are the samples from the source and target domains, respectively.

The second objective is to minimize the structural risk functional. If we combine these two objectives, the optimization problem is given by

$$ \min\limits_{d} G(d) = \frac{1}{2}{\Omega}^{2} (d) + \theta J(d) $$
(25)

where

$$ \begin{array}{l} J(d) = \min\limits_{w_{m} ,\gamma ,b,\xi_{i} } \frac{1}{2}\left( \sum\limits_{m = 1}^{M} {d_{m} } \left\| {w_{m} } \right\|^{2} + \lambda \left\| \gamma \right\|^{2} \right) + C\sum\limits_{i = 1}^{n} {\xi_{i} } \\ \\ \qquad\qquad s.t. \quad y_{i} f^{T} (x_{i} ) \ge 1 - \xi_{i}, \ \xi_{i} \ge 0 \end{array} $$
(26)

Here, γ = [γ 1,...,γ P ] and λ,C≥0 are the regularization parameters. If we define \(\widetilde {w_{m} } = \left [w_{m}^{\prime } ,\sqrt \lambda \gamma '\right ]'\), \(\widetilde {v_{m}}=d_{m} \widetilde {w_{m}} \) and \(\widetilde {{\Phi }_{m} }(x_{i} ) = \left [{\Phi }_{m} (x_{i} )',\frac {1}{{\sqrt \lambda }}f(x_{i} )'\right ]'\), where f(x i )=[f 1(x i ),...,f P (x i )]. Then we can derive the following equation

$$ \begin{array}{l} J(d) = \min\limits_{\widetilde{v_{m}} ,b,\xi_{i} } \frac{1}{2}\sum\limits_{m = 1}^{M} {\frac{{\left\| {\widetilde{v_{m} }} \right\|^{2} }}{{d_{m} }}} + C\sum\limits_{i = 1}^{n} {\xi_{i} } \\ \\ s.t.\quad y_{i} \left( \sum\limits_{m = 1}^{M} {\widetilde{v_{m}}^{\prime}} \widetilde{\Phi} (x_{i} ) + b\right) \ge 1 - \xi_{i} ,\ \xi_{i} \ge 0 \end{array} $$
(27)

By introducing the Lagrangian multipliers α, the dual form of the optimazition is:

$$ J(d) = \max_{\alpha} \alpha ' - \frac{1}{2}(\alpha \cdot y)'\left( \sum\limits_{m = 1}^{M} {d_{m} } \widetilde{K_{m} }\right)(\alpha \cdot y) $$
(28)

This is the same form as the dual form of primary SVM with kernel matrix \(\sum \limits _{m = 1}^{M} {d_{m} } \widetilde {K_{m} }\). Then the optimization problem can be solved by an existing SVM solver [6] .

It was proven in [24] that this optimization problem is jointly convex with respect to \(d, \widetilde {v_{m} },b\) and ξ i . For the multiple kernel learning parameter d could be updated by d t+1 = d t η t g t , where \(g_{t} = ({\nabla _{t}^{2}} G)^{- 1} \nabla {}_{t}G\) according alternative coordinate descent method shown in Table 1.

Table 1 Adaptive Multiple Kernel Learning Algorithm

4 Results

In this section, we evaluate our proposed method by testing head-pose estimation accuracy and transfer learning between two datasets.

4.1 Head pose estimation accuracy

We evaluated the performance of the pose estimation framework on the UcoHead and DPOSE dataset, for which pan, tilt and roll head rotation measurements are available with the datasets. Upon resizing the face crops to 20×20 pixels, and computing the appearance templates (of size 80×20) for each class, we performed three-fold cross validation, with two parts used for training and one for testing. The mean accuracy obtained from the three runs is presented in Table 2. We can observe that covariance descriptor feature is better than KL-distance feature for these two datasets.

Table 2 Head-pose classification accuracy using SVM.

4.2 Transfer learning between two datasets

We evaluate transfer learning benefit between the UcoHead dataset and the DPOSE dataset. We discretized the space of possible head horizontal rotations (Pan) into 8 classes as we propose above. For each pan class, we randomly select images from the dataset and repeat the experiments 5 times to calculate mean results. We evaluate our methods by two experiments. One is using UcoHead dataset as source domain (471 images) and DPOSE dataset (60 images). Then we use 610 images from DPOSE dataset, which is not included in the training set, as testing set. The other experiment is using party data (610 images) we record as source domain and UcoHead (80 images) as target domain for training dataset. Then we use 629 images from UcoHead dataset, which is not included in the training set, as testing set. Here the image number for source domain is nearly 8 times larger than target domain in training set which is reasonable for transfer learning. We use 5 base Gaussian kernels (i.e., K(x i , x j ) = e x p(−γ D 2(x i , x j )) ) with different kernel parameters γ = {−2,−1,0,1,2} respectively. Table 3 illustrates the classification results for head-pose estimation between two different dataset with transfer learning and without transfer learning.

Table 3 Head-pose classification results of using transfer learning technique. (A - UcoHead, B - DPOSE)

From Table 3, we could observe that the classification accuracy is relatively low if we do not use transfer learning technique between two datasets. With transfer learning, we could actually extract some useful information from the source dataset to help classify on target dataset which increases the classification accuracy a lot.

4.3 Comparison

At last, we also compare with other low-resolution headpose estimation methods. Table 4 shows the comparison with other state-of-the-art methods.

Table 4 Comparison with other low-resolution headpose estimation methods

5 Conclusion

Human head pose is the first step in understanding the behaviors of human. We propose a framework to do the head pose estimation in low resolution images. We propose an adaptive multiple kernel transfer learning technique to overcome the weakness of apperance-based feature representation. Experimental results on public dataset prove the efficiency of our proposed method.