Keywords

1 Introduction

Human facial appearances change substantially over time, resulting in great challenges for the face recognition task. On the other hand, the age estimation task expects to catch the age related information. To some extent, the two tasks can be considered as competing with each other. Most existing researches investigate the two tasks independently, while in this paper we explore the coordination between them.

Existing researches on age invariant face recognition (AIFR) fall into two categories: the generative methods and the discriminative methods. The generative methods [1,2,3] focus on generating an aging pattern which relies on accurate age estimation and the model is computationally expensive. The discriminative methods focus on extracting age invariant features for face recognition [4,5,6,7]. However, they ignore the aging effects on facial appearance.

The age estimate task usually performs by two steps: feature extraction and classification (or regression). Most existing methods extract the geometry and texture features and combine with the AAM to simultaneously model the shape and texture of face images [1, 8]. Nevertheless, it is difficult to design effective features for age estimation due to the unclear mechanism of the ageing effect on face images [9].

The method proposed in [10] considers the two tasks with conflicting goals can help inhibit irrelevant features for each task and hence improve the performance. Since the age invariant face recognition expects to extract age insensitive features while the age estimation expects the age sensitive features, it is desired to improve one of them by incorporating information from the other. Gong et al. [11, 12] propose a probabilistic model with the two latent factors and learn an identity space and an age space at the same time. But the model is based on the assumption that the latent factors and random noise follow the Gaussian distribution. The method proposed in [13] consider that the face image can be decomposed into a class specific part and a common part and a dictionary can be split into two corresponding dictionaries. However, they ignore the discrimination ability of the coding coefficients.

In this paper, a novel unified framework is proposed for face recognition and age estimation. The two tasks are modeled together in a joint framework. Identity and age dictionaries are introduced to encode the identity and age parts of a face image onto two separated subspaces. The learned identity subspace only catches the identity information of the face images, which can be used for face recognition. The age subspace catches the age sensitive features which can be used for age estimation.

2 The Joint Model for Face Recognition and Age Estimation

2.1 Face Images Decomposition

Firstly, Suppose the training set is composed of \( c \) classes, then the training samples of the \( i \)-th class \( F_{i} = [f_{1} , \ldots ,f_{{n_{i} }} ] \in R^{{d \times n_{i} }} \), where \( f_{j} \in R^{d} , j = 1, \ldots ,n_{i} \). Let \( F = [F_{1} , \ldots , F_{c} ] \in R^{d \times n} \), \( n = n_{1} + \cdots + n_{c} \) be the training samples matrix. Each training sample \( f_{j} \in R^{d \times 1} ,j = 1, \ldots ,n \) can be decomposed into four parts: the person specific part \( I_{j} \in R^{d \times 1} \), the age related part \( A_{j} \in R^{d \times 1} \), the mean face feature \( m_{j} \in R^{d \times 1} \) and the random noise part \( \varepsilon_{j} \in R^{d \times 1} \). The face image decomposition can be formulated as:

$$ f_{j} = m_{j} + I_{j} + A_{j} + \varepsilon_{j} ,\quad j = 1, \ldots n $$
(1)

2.2 Collaboration Coding with Identity and Age Dictionaries

In this paper, two dictionaries are introduced to encode the identity part and age part of face features: the identity dictionary \( U = [U_{1} ,U_{2} \ldots U_{i} , \ldots U_{c} ] \in R^{d \times p} \), \( U_{i} \in R^{{d \times p_{i} }} \) and the age dictionary \( V = [V_{1} , \ldots V_{s} , \ldots V_{t} ] \in R^{d \times q} \), \( V_{s} \in R^{{d \times q_{s} }} \). The coding coefficients \( X_{i} = \left[ {x_{i}^{1} , \ldots ,x_{i}^{j} , \ldots x_{i}^{{n_{i} }} } \right] \in R^{{p \times n_{i} }} \) denotes the identity component of the \( i \)-th class training samples \( I_{i} \in R^{{d \times n_{i} }} \) coded over the identity dictionary \( U \in R^{d \times p} \), where \( x_{i}^{j} \in R^{p \times 1} \) denotes the identity coding efficient of \( f_{i}^{j} \). Similarly, \( Y_{i} = \left[ {y_{i}^{1} , \ldots ,y_{i}^{j} , \ldots y_{i}^{{n_{i} }} } \right] \in R^{{q \times n_{i} }} \) denotes the age component of the \( i \)-th class training samples \( A_{i} \in R^{{d \times n_{i} }} \) coded over the age dictionary \( V \in R^{d \times q} \), where \( y_{i}^{j} \in R^{q \times 1} \) denotes the age coding efficient of \( f_{i}^{j} \).

In our model, over-complete dictionary is difficult to acquire since the number of training images per class is limited. Here, we introduce collaboration coding [15] and combine it with the reconstruction error to form a unified objective function:

$$ { \hbox{min} }\left\| {F - M - UX - VY} \right\|_{F}^{2} + \lambda_{1} \left\| X \right\|_{F}^{2} + \lambda_{2} \left\| Y \right\|_{F}^{2} , $$
(2)

where \( UX \) and \( VY \) denote the person specific component \( I \) and age related component \( A \). The first term \( \left\| {F - m - UX - VY} \right\|_{F}^{2} \) is based on the idea of data reconstruction. The mean face can be calculated by: \( M = [m,m, \ldots ,m] \in R^{d \times n} \), \( m = \frac{1}{n}\sum\limits_{i = 1}^{c} {f_{i} } \in R^{d \times 1} \), where \( n \) is the size of the total training set. \( \lambda_{1} ,\lambda_{2} \) are introduced to balance the reconstruction error and the collaboration constraints.

2.3 Discriminative Coefficients

The collaboration coefficient of one face image can be regarded as a feature vector representing the data in a new space, which should be discriminant for each class. Here, two label supervised constraints are introduced [14]. Identity label supervised term \( \left\| {H_{1} - AX} \right\|_{F}^{2} \) ensures that the samples from the same identity get the similar identity coding coefficient \( x \). \( \left\| {H_{2} - BY} \right\|_{F}^{2} \) ensures that the samples from the same age group get the similar age coding coefficient \( y \) . The unified objective function is given as follows:

$$ \begin{aligned} & { \hbox{min} }\left\| {F - M - UX - VY} \right\|_{F}^{2} + \left\| {H_{1} - AX} \right\|_{F}^{2} + \beta \left\| {H_{2} - BY} \right\|_{F}^{2} \\ & \quad + \lambda_{1} \left\| X \right\|_{F}^{2} + \lambda_{2} \left\| Y \right\|_{F}^{2} , \\ \end{aligned} $$
(3)

where \( H_{1} \in R^{c \times p} \) and \( H_{2} \in R^{a \times p} \) denote the identity and age label matrixes. If the \( l \)-th training sample belongs to the \( k \)-th identity class, \( H_{1} (k,l) \) equals 1, otherwise \( H_{1} (k,l) \) equals 0, which is similar to the definition of \( H_{2} (m,n) \).

2.4 Optimization of the Joint Model

The objection function in formula (3) is not convex for \( U,V,X,Y,A,B \) jointly. While it is convex with respect to each of them when the other ones are fixed.

  • Firstly, fix \( V,B,X,Y \) and update \( U,A \). Formula (3) can be rewritten as:

$$ J_{{U^{*} ,A^{*} }} = { \arg }\,{ \hbox{min} }_{U,A} \left\| {\left[ {\begin{array}{*{20}c} {F - M - VY} \\ {\sqrt \alpha H_{1} } \\ \end{array} } \right] - \left[ {\begin{array}{*{20}c} U \\ {\sqrt \alpha A} \\ \end{array} } \right]X} \right\|_{F}^{2} , $$
(4)

Let \( D = \left[ {\begin{array}{*{20}c} U \\ {\sqrt \alpha A} \\ \end{array} } \right] \), \( D \) can be updated class by class according to [16].

  • Secondly, fix \( U,A,X,Y \) and update \( V,B \).

The update process of \( V,B \) is similar to the optimization of \( U,A \);

  • Thirdly, fix \( U,V,A,B,Y \) and update \( X \). Formula (3) can be rewritten:

$$ J_{{X^{*} }} = \arg \,\min_{X} \left\| {\left[ {\begin{array}{*{20}c} {F - M - VY} \\ {\sqrt \alpha H_{1} } \\ \end{array} } \right] - \left[ {\begin{array}{*{20}c} U \\ {\sqrt \alpha A} \\ \end{array} } \right]X} \right\|_{F}^{2} + \lambda_{1} \left\| X \right\|_{2} , $$
(5)

\( X \) can be calculated directly in formula (6).

$$ X = \left( {\left[ {\begin{array}{*{20}c} {U^{T} } & {\sqrt \alpha A^{T} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {U} \\ {\sqrt \alpha A} \\ \end{array} } \right] + \lambda_{1} I} \right)^{ - 1} \left[ {\begin{array}{*{20}c} {U^{T} } & {\sqrt \alpha A^{T} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {F - M - VY} \\ {\sqrt \alpha H_{1} } \\ \end{array} } \right]. $$
(6)
  • Fourthly, fix \( U,V,A,B,X \) and updated \( Y \).

The update process of \( Y \) is similar to the optimization of \( X \).

2.5 Face Matching

In the testing, with the learned two dictionaries \( U,V \), any unseen face \( f \in R^{d} \) can be encoded as follow (assume \( \lambda_{1} = \lambda_{2} \)):

$$ J_{{x^{*} ,y^{*} }} = { \arg }\,{ \hbox{min} }_{x,y} \left\| {f - m - \left[ {\begin{array}{*{20}c} U & V \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} x \\ y \\ \end{array} } \right]} \right\|_{F}^{2} + \lambda_{1} \left\| {\left[ {\begin{array}{*{20}c} x \\ y \\ \end{array} } \right]} \right\|_{2} . $$
(7)

The coding coefficient \( x \) can be directly used as an age invariant feature descriptor for face recognition. The coding coefficient \( y \) can be used for age estimation. Cosine distance and the Nearest Neighbor (NN) algorithm are used for both tasks.

3 Experiment

3.1 Implementation Details

Follows the settings in [12], HOG feature is used as feature descriptor. Feature slicing and PCA are used for dimension reduction. The PCA feature dimension \( d \), the number of slices \( p \), label parameters \( \alpha ,\beta \) and regularization parameters \( \lambda_{1} ,\lambda_{2} \) are set as follows: \( d \) = 900, \( p = 6 \), \( (\alpha ,\beta ,\lambda_{1} ,\lambda_{2} ) \) = (2.3, 1.2, 0.005, 0.005) for MORPH database, \( (\alpha ,\beta ,\lambda_{1} ,\lambda_{2} ) \) = (2.3, 0.01, 0.002, 0.002) for FGNET database.

3.2 Experiment on the FGNET Database

The FGNET database contains1002 face images with 82 different individuals. Each one has 13 images on average collected at ages in the range of 0 to 69. To train the joint model, the face images in FGNET database is divided into 9 age groups every 5 age gap.

Following the training and testing split scheme used in [5, 11], the leave-one-person-out (LOPO) scheme is used for validation. Table 1 records the rank-1 recognition rate of these methods. It shows that our method can achieve an accuracy higher than the best published result (76.2%). Moreover, the proposed model outperforms the HOG feature baseline method by a clear margin.

Table 1 Comparison of age invariant face recognition methods on FGNET database

3.3 Experiment on the MORPH Database

The MORPH Album2 database contains 55,134 facial images of 13,618 subjects. Following the experimental settings in [17], we randomly select 10,000 subjects from the MORPH Album2. For each subject, the images with the youngest age is selected as gallery set and the images with the oldest age is used as probe set. Then another 1000 subjects are randomly selected from the dataset as the training set.

The experiment results are shown in Table 2. One thing to note is that the age span of one subject on the MORPH dataset is relative small and face images contain more expression and pose variations. Nevertheless, our approach still achieves a good result on both two tasks. Some failed examples of age estimation are shown in Fig. 1. Some failed results of face matching are shown in Fig. 2.

Table 2 Performance of different methods on the MORPH database
Fig. 1
figure 1

Some failed age estimation results in MORPH Album2

Fig. 2
figure 2

Some failed retrieve results in MORPH Album2. The first row shows the probe images and the second row is the incorrect rank-1 matching results using the proposed model. The third shows the correct match in the gallery

4 Conclusions

In this paper, we develop a joint model to tackle age invariant face recognition and age estimation at the same time. The result of the experiments on the MORPH and FGNET databases indicate the effectiveness of the proposed method.