Keywords

1 Introduction

Age estimation is a challenging and hot research topic, which is to predict the person’s age from his/her facial image. It has a lot of potential applications, including demographic statistics collection, commercial user management, video security surveillance, etc. However, there are numerous internal or external factors that affect the estimation results, including the race, illumination, image quality and so on. Besides, facial images from adjacent ages of the same person, especially for adults, usually look similar, resulting in the label ambiguity.

Fig. 1.
figure 1

The motivation of the proposed method. In each subfigure, the age probability distribution in the lower part corresponds to the middle image in the upper. The images above the dotted line belong to the same person and so do the images below the dotted line. On the one hand, by comparing (a) with (b) or (c) with (d), we can see that the facial appearance variation between adjacent ages of the same person varies at different ages. Correspondingly, the variance of the age probability distribution should differ across ages. On the other hand, by comparing (b) with (c), we can see that even at the same age, the aging process between different persons differs, thus the variance also varies across different persons

Recently, several deep learning methods have been proposed to improve the performance of facial age estimation. The most common methods model the face age prediction as a classification or a regression problem. The classification based methods treat each age as an independent class, which ignores the adjacent relationship between classes. Considering the continuity of age, regression methods predict age according to the extracted features. However, as presented by previous work [31, 33], the regression methods face the overfitting problem, which is caused by the randomness of the human aging process and the ambiguous mapping between facial appearance and the actual age. In addition, some ranking based methods are proposed to achieve more accurate age estimation. Those approaches make use of individuals’ ordinal information and employ multiple binary classifiers to determine the final age of the input image. Furthermore, Geng et al. [8, 13] propose the label distribution learning (LDL) method which assumes that the real age can be represented by a discrete distribution. As their experiments show, it can help improve age estimation using Kullback-Leibler (K-L) divergence to measure the similarity between the predicted and ground truth distribution.

For the label distribution learning methods, the mean of the distribution is the ground truth age. However, the variance of the distribution is usually unknown for a face image. The previous methods often treat variance as a hyper-parameter and simply set it to a fixed value for all images. We think these methods are suboptimal because the variance is highly related to the correlation between adjacent ages and should vary across different ages and different persons, as illustrated in Fig. 1. The assumption that all the images sharing the same variance potentially degrades the model performance.

To tackle the above issues, in this paper, we propose a novel adaptive variance based distribution learning method (AVDL) for age estimation. Specifically, we introduce meta-learning which utilizes validation set as meta-objective and is applicable to online hyper-parameter adaptation work [28], to model sample-specific variance and thus better approximate true age probability distribution. As Fig. 2 shows, we firstly select a small validation set. For each iteration, with a disturbing variable added to variance, we use K-L loss as the training loss to update the training model parameter. Then we share the updated parameter with validation model and use predicted expectation age and ground truth on validation set to get L1 loss as the meta-objective. With this meta-guider, the disturbing variable is updated by gradient descent and adaptively find the proper variance with which model could perform better on validation set. The main contributions of this work can be summarized as follows:

  • We propose a novel adaptive variance based distribution learning (AVDL) method for facial age estimation. AVDL can effectively model the correlation between adjacent ages and better approximate the age label distribution.

  • Unlike the existing deep models which assume the variance across ages and identities is the same, we introduce a data-driven method, meta-learning, to learn the sample-specific variance. To our knowledge, we are the first deep model using meta-learning method to adaptively learn different variances for different samples.

  • Extensive experiments on FG-NET and MORPH II datasets show the superiority of our proposed approach to the existing state-of-the-art methods.

2 Related Work

2.1 Facial Age Estimation

In recent years, with rapid development of convolution neural network (CNN) in computer vision tasks, such as facial landmark detection [23], face recognition [3, 38], pedestrian attribute [35], semantic segmentation [45, 46], deep learning methods were also improved the performance of age estimation. Here we briefly review some representative works in the facial age estimation field. Dex et al. [30] regarded the facial age estimation as a classification problem and predicted ages with the expectation of ages weighted by classification probability. Tan et al. [33] proposed an age group classification method called age group-n-encoding method. However, these classification methods ignored the adjacent relationship between classes or groups. To overcome this, Niu et al. [24] proposed a multiple output CNN learning algorithm which took account of the ordinal information of ages for estimation. Shen et al. [32] proposed Deep Regression Forests by extending differentiable decision trees to deal with regression. Furthermore, Li et al. [22] proposed BridgeNet, which consists of local regressors and gating networks, to effectively explore the continuous relationship between age labels. Tan et al. [34] proposed a complex Deep Hybrid-Aligned Architecture (DHAA) that consists of global, local and global-local branches and jointly optimized the architecture with complementary information. Besides, Xie et al. [39] proposed two ensemble learning methods both utilized ordinal regression modeling for age estimation.

2.2 Distribution Learning

Distribution learning is a learning method proposed to solve the problem of label ambiguity [10], which has been utilized in a number of recognition tasks, such as head pose estimation [8, 12], and age estimation [20, 41]. Geng et al. [11, 13] proposed two adaptive label distribution learning (ALDL) algorithms, i.e. IIS-ALDL and BFGS-ALDL, to iteratively learn the estimation function parameters and the label distributions variance. Though ALDL used an adaptive variance learning, our proposed method is different in three ways. Firstly, ALDL utilized traditional optimization method like BFGS while ours uses deep learning and CNN. Secondly, ALDL chose better samples in current training iteration to estimate new variance while our method uses meta-learning to get adaptive variance. The third point is ALDL updated variance only by estimating the training sample, which may cause overfitting. Our adaptive variance is supervised by validation set to be more general. Distribution learning of label was also used to remedy the shortage of training data with exact ages. Hou et al. [20] proposed a semi-supervised adaptive label distribution learning method. It used unlabeled data to enhance the label distribution adaptation to find a proper variance for each age. However, aging tendencies varies and variances of people at the same age could be different. Gao et al. [9] jointly used LDL and expectation regression to alleviate the inconsistency between training and testing. Moreover, Pan et al. [25] proposed a mean-variance loss for robust age estimation. Li et al. [21] proposed label distribution refinery to adaptively estimate the age distributions without assumptions about the form of label distribution, barely took into account the correlation of adjacent ages. While our method used Gaussian label distribution with adaptively meta-learned variance, which pays more attention to neighboring ages and ordinal information.

2.3 Meta-learning

Our proposed AVDL is an instantiation of meta-learning [1, 36], i.e., learning to learn. According to the type of leveraged meta data, this concept can be classified to several types [37] including transferring knowledge from empirically similar tasks, transferring trained model parameters between tasks, building meta-models to learn data characteristics and learn purely from model evaluations. Model Agnostic Meta-Learning (MAML) [7] learned a model parameter initialization to perform better on target tasks. With the guidance of meta information, MAML took one gradient descent step on meta-objective to update model parameters [16]. The idea of using validation loss as meta-objective was applied in few-shot learning [27]. With reference to few-shot learning, Ren et al. [28] proposed a reweighting method (L2RW) guided by validation set. This method tried to solve the problem that data imbalance and label noise are both in the training set. The crucial criteria of L2RW is a small unbiased clean validation set which was taken as the supervisor of learning sample weight. As validation set performance measures the quality of hyper-parameters, taking it as meta-objective could not only be applied to sample reweighting but also to any other online hyper-parameter adaptation tasks. Inspired by this, we propose AVDL to incorporate validation set based meta-learning and label distribution learning to adaptively learn the label variance.

3 Methodology

In this section, we firstly give a description of the label distribution learning (LDL) method in age estimation. Then we introduce our adaptive variance based distribution learning(AVDL) method based on meta-learning framework.

3.1 The Label Distribution Learning Problem Revisit

Let X denote an input image with ground truth label y, \(y \in \{0,1,...,100\}\). The model is trained to predict a value as close to the ground truth label as possible. For traditional age estimation method, the ground truth is an integer. While in LDL method, to express the ambiguity of labels, Gao et al. [8] transform the real value y to a normal distribution \(\mathbf{p} (y, \sigma )\) to denote the new ground truth. Mean value is set to the ground truth label y and \(\sigma \) is the normal distribution variance. Here we adopt the boldface lowercase letters like \(\mathbf{p} (y, \sigma )\) to denote vectors, and use \(p_k(y, \sigma )\) (\( k \in [0, 100] \)) to represent the k-th element of \(\mathbf{p} (y, \sigma )\):

$$\begin{aligned} {p_k}(y, \sigma ) = \frac{1}{{\sqrt{2\pi } \sigma }}\exp ( - \frac{{{{(k - {y})}^2}}}{{2{\sigma ^2}}}) \end{aligned}$$
(1)

where \(p_k\) is the probability that the true age is k years old. It represents the connection between the class k and y in a normal distribution view.

In the training process, assuming \(G(*,\theta )\) as the classification function of the trained estimation model, \(\theta \) represents the model parameters, \(\mathbf{z} _{}(X, \theta ) = G(X, \theta )\) transforms the input X to the classification vector \(\mathbf{z} _{}(X, \theta )\). A softmax function is utilized to transfer \(\mathbf{z} _{}(X, \theta )\) into a probability distribution \(\hat{\mathbf{p }}(X, \theta )\), the k-th element of which can be denoted by:

$$\begin{aligned} {{\hat{p}_k}(X,\theta )} = \frac{{\exp ({z_k}_{}(X, \theta ))}}{{\sum \nolimits _n {\exp ({z_n}(X, \theta ))} }} \end{aligned}$$
(2)

where \(z_k(X, \theta )\) is the k-th element of \(\mathbf{z} (X, \theta )\).

LDL tries to generate the predicted softmax probability distribution as similar to the ground truth distribution as possible. So the Kullback-Leibler (K-L) divergence is employed to measure the difference between the predicted distribution and ground-truth distribution [8]:

$$\begin{aligned} {L_{KL}(X,y,\theta ,\sigma )} = \sum \limits _k {{p_k(y,\sigma )}ln\frac{{{p_k(y,\sigma )}}}{{{{\hat{p}}_k(X,\theta )}}}} \end{aligned}$$
(3)

Then the K-L loss is used to update model parameters with SGD optimizer.

LDL method aims to construct a normal distribution of ground truth to approximate the real distribution, the key of which is the variance \(\sigma \). For most LDL methods, this hyper-parameter is simply set to a fixed value, 2.0 in most cases. However, in fact, the variances for different people, or people of different ages could not be absolutely the same. So we propose a method to search proper variance for each image.

3.2 Adaptive Distribution Learning Based on Meta-learning

In machine learning, the loss on validation set is one of the guiders to adjust hyper-parameters for generalization. Therefore, using a clean unbiased validation set can help train a more general model. However, traditional training mode usually tunes the hyper-parameter manually. Inspired by the meta-learning work [28], we propose the adaptive variance based distribution learning (AVDL) algorithm guided by validation set, which offers an effective strategy to learn the sample-specific variance.

As we mentioned in Sect. 3.1, the most important hyper-parameter of LDL is the variance \(\sigma \). Because our goal is to search for proper \(\sigma \) of each image while training, in this section we use \(\sigma \) to represent the variance vector for a batch of training data. The optimal \(\sigma \) in each iteration depends on the optimal model parameter \(\theta \):

$$\begin{aligned} {\theta ^*(\sigma )} = \arg \min \limits _{\theta } {{L_{KL}}(X_{tr},y_{tr},\theta ,\sigma )} \end{aligned}$$
(4)
$$\begin{aligned} {\sigma ^*} = \arg \min \limits _{\sigma ,\sigma {\ge } 0} {L_1(X_{val},y_{val},\theta ^*,\sigma )} \end{aligned}$$
(5)

where \(L_1(X_{val},y_{val},\theta ^*,\sigma )\) denotes the validation loss. \(X_{tr}\) is the training input image while \(y_{tr}\) is its label. \(X_{val}\) is the validation input image while its label is \(y_{val}\). To solve the optimization problem, we divided the training process into several process. Figure 2 shows the computation graph of our proposed method.

Fig. 2.
figure 2

Computation graph of AVDL in one iteration. The ground truth of each input image is transformed to a normal distribution. The model on top is for training and the other is for validation. The train model and validation model share the network architecture and parameters. The training loss is K-L loss while the validation loss is L1 loss. Process 1, 2, 3 belongs to traditional training steps. Perturbing variable \(\xi \) is added to initial distribution variance to get variance \(\sigma \). By adding the training gradient descent step \(-\bigtriangledown \theta \), the training model parameter \(\theta \) is updated to \(\theta '\) and is assigned to the validation model. Process 4 uses the descent gradient of \(\xi \) in validation loss to get the modified \(\xi '\) and \(\sigma '\). Process 5, 6 shows the improved forward and backward computation with a proper variance \(\sigma '\).

We choose a fixed number of images with correct labels from each class in the training set n to make a small unbiased validation set with m images, \(m \ll n\). We utilize \(\sigma _i\) to denote variance for i-th image while we set the initial value of variances of all images to a fixed value \({\sigma _i}_0\). To search a proper variance, we perturb each \(\sigma _i\) by \(\xi _i\):

$$\begin{aligned} {\sigma _i} = {{\sigma _i}_0} + {\xi _i} \end{aligned}$$
(6)

where \(\xi _i\) is the i-th component of perturbing vector \(\xi \) which is set to 0 for initialization. Clearly, searching a proper \(\sigma \) is equal to searching a proper \(\xi \).

Firstly, as Fig. 2 process 1, 2 and 3 show, in the t-th iteration, the input training batch calculates K-L loss as described in Sect. 3.1 with a perturbed \(\sigma \). Update the model parameter \(\theta _t\) with SGD to get \(\hat{\theta }_{t+1}\):

$$\begin{aligned} \hat{\theta }_{t+1} = \theta _t - \alpha \bigtriangledown _{\theta } L_{KL}(X_{tr},y_{tr},\theta _t,\sigma ) \end{aligned}$$
(7)

\(\alpha \) is the descent step.

The training loss is related to distribution. To compensate the lack of constrain in the final predicted age value, we adopt L1 loss on validation to measure the distance between expectation age of prediction and the validation ground truth [9]:

$$\begin{aligned} {L_1(X_{val},y_{val},\hat{\theta }_{t+1},\xi )} = \left| {{{\hat{y}}^*}(X_{val},\hat{\theta }_{t+1},\xi ) - {y_{val}}} \right| \end{aligned}$$
(8)
$$\begin{aligned} {{\hat{y}}^ * }(X_{val},\hat{\theta }_{t+1},\xi ) = \sum \nolimits _k {{\hat{p}_k}(X_{val},\hat{\theta }_{t+1},\xi ){l_k}} \end{aligned}$$
(9)

where \(\hat{p}_k\) is the k-th element in the prediction vector of validation input \(X_{val}\) and \(l_k\) denotes the age value of the k-th class, i.e. \(l_k\) \(\in \) \(\mathcal {Y}\). The expectation age computing method is also used for estimating test images in Sect. 4.

figure a

The better hyper-parameter means better validation performance. In that, we update the perturbation \(\xi \) with gradient descent step:

$$\begin{aligned} \hat{\xi } = \xi -\beta \bigtriangledown _{\xi }L_1(X_{val},y_{val},\hat{\theta }_{t+1},\xi ) \end{aligned}$$
(10)

where \(\beta \) is the descent step size. This step is corresponding to the process 4 in Fig. 2. Due to the non-negativity restriction of \(\sigma \), we normalize the \(\xi \) into the range [−1, 1], using the mapping \(\xi _i \rightarrow \frac{{2\xi _i \mathrm{{ - max(\xi ) - min(\xi )}}}}{{\max (\xi ) - \min (\xi ) }}\). Then update the variance \(\sigma \) according to Eq. ( 6 ). In the third step of training, with the modified variance, we calculate forward K-L loss of the training input, then update model parameter with SGD optimizer, as the process 5, 6 in Fig. 2 shows.

We listed step-by-step pseudo code in Algorithm 1. According to step 9 in Algorithm 1, there is a two-stage deviation computation of variable \(\xi \). PyTorch autograd mechanism can achieve this operation handily.

4 Experiments

In this section, we first introduce the datasets used in the experiments, i.e., MORPH II [29], FG-NET [26] and IMDB-WIKI [31]. Then we detail the experiment settings. Next, we validate the superiority of our approach with comparisons to the state-of-the-art facial age estimation methods. Finally, we conduct some ablation studies on our method.

4.1 Datasets

Morph II is the most popular dataset for age estimation. The dataset contains 55,134 color facial images of 13,000 individuals whose ages range from 16 to 77. On this dataset, we employ three typical protocols for evaluation: Setting I: 80-20 protocol. We randomly divide this dataset into two non-overlapped parts, i.e., 80% for training and 20% for testing. Setting II: Partial 80-20 protocol. Following the experimental setting in [33], we extract a subset of 5,493 facial images from Caucasian descent, these images are splitted into two parts: 80% of facial images for training and 20% for testing. Setting III: S1-S2-S3 protocol. Similar to [22, 33], Morph II dataset is splitted into three non-overlapped subsets S1, S2 and S3, and all experiments are repeated twice. Firstly, train on S1 and test on S2+S3. Then, train on S2 and test on S1+S3. The performance of the two experiments and their average MAE are shown respectively.

FG-NET contains 1,002 color or gray facial images of 82 individuals whose ages are ranging from 0 to 69. We follow a widely used leave-one-person-out (LOPO) protocol [4, 25] in our experiments, and the average performance over 82 splits is reported.

IMDB-WIKI is the largest facial image dataset with age labels, which consists of 523,051 images in total. This dataset is constituted of two parts: IMDB (460,723 images) and WIKI (62,328 images). We follow the practice in [22] and use this dataset to pretrain our model. Specifically, We remove non-face images and partial multi-face images. Finally, about 270,000 images are reserved.

4.2 Implementation Details

We use the detection algorithm in [44] to obtain the face detection box and five facial landmark coordinates, which are then used to align the input facial image of the network. We resize the input image to 224 \(\times \) 224.

Following the settings in [9], we augment the face images with random horizontal flipping, scaling, rotating and translating during training time. For testing, we input both the image and its flipped version to the network, and then average their predictions as the final results.

We adopt ResNet-18 [19] as our backbone network and pretrain the network on IMDB-WIKI dataset for better initialization. We use the SGD optimizer with batch size 32 to optimize the network. The weight decay and the momentum are set to 0.0005 and 0.9. The initial learning rate is set to 0.01 and decays by 0.1 for every 20 epochs. We set the initial value of variances of all images to 1, and train the deep convolution neural network with PyTorch on 4 GTX TITAN X GPUs.

Table 1. The comparisons between the proposed method and other state-of-the-art methods on MORPH II under Setting I. Bold indicates the best (\(^*\)indicates the model was pre-trained on the IMDB-WIKI dataset; \(^\dagger \)indicates the model was pre-trained on the MS-Celeb-1M dataset [17])

4.3 Evaluation Criteria

According to previous works [31, 33], we measure the performance of age estimation by the Mean Absolute Error (MAE) which is calculated using the average of the absolute errors between estimated age and chronological age.

4.4 Comparisons with State-of-the-Arts

On Morph II. We first compare the proposed method with other state-of-the-art methods on MORPH II dataset in Setting I, as illustrated in Table 1. We achieve the second best performance, which is slightly lower than DHAA [34] by 0.03. It is worth to note that DHAA is large and complex, their parameters are around 10 times larger than ours, though without additional face dataset for pre-training. Moreover, using the same pre-training dataset, we surpass the M-V Loss by a significant margin of 0.22.

Table 2 shows the test result under Setting II. We achieve the best performance, which is slightly higher than BridgeNet [22] by 0.01. Nevertheless, we have fewer parameters than BridgeNet. That is to say, we achieve the performance nearly to theirs with a significantly lower model complexity at the same time. As Table 3 shows, we achieve MAE of 2.53 under Setting III. Our method performs much better than the current state-of-the-art. All of the above comparisons consistently demonstrate the effectiveness of the proposed method.

Table 2. The comparisons between the proposed method and other state-of-the-art methods on MORPH II dataset (Setting II) and FG-NET dataset. Bold indicates the best (\(^*\)indicates the model was pre-trained on the IMDB-WIKI dataset)

On FG-NET. As shown in Table 2, we compare our model with state-of-the-art models on FG-Net. Our method achieves the lowest MAE of 2.32, which improves the state-of-the-art performance by a large margin of 0.24. Experimental results show that our method is effective even when there are only a few training images.

Table 3. The comparisons between the proposed method and other state-of-the-art methods on MORPH II under Setting III. Bold indicates the best (\(^*\)indicates the model was pre-trained on the IMDB-WIKI dataset)

4.5 Ablation Study

In this subsection, we conduct ablation study on MORPH II dataset under Setting I to conduct ablation study. The Superiority of Adaptive Variance to Fixed Variance Value. We train a set of baseline models, which all adopt ResNet-18 and K-L divergence loss but with different fixed variance values. Theoretically, the larger variance indicates the smoother distribution which refers to the stronger correlation in that age group. In comparison, the smaller variance represents the sharper distribution and the weaker correlation. If the variance is set too high, i.e., the label distribution is too smooth, the age estimation may not perform well. As Fig. 3 shows, the MAE increases along with the growth of variance when it is higher than 3, which indicates the worse performance. When the variance reduces to 0, it assumes there is no correlation between ages which is similar to the assumption when regarding age estimation as classification problem. However, considering the gradual changing of face in aging, taking a proper use of age correlation can help age estimation. As illustrated in Fig. 3, when the fixed variance is less than 3, the MAE fluctuates. It validates that setting a fixed variance is suboptimal because the age correlation can not be the same for different people in different ages. The best performance of baseline is achieved with a variance of 3. However, it is still much worse than our proposed method, AVDL. In Fig. 3, we also show the performance achieved by training the ResNet-18 with cross-entropy loss, which is our baseline method by treating age estimation as classification task. In summary, Fig. 3 demonstrates the superiority of the adaptive variance. Actually, for each dataset and experiment setting, our approach is compared to the baseline method with fixed variance. We observed from Fig. 3 that the variation in fixed variance value within a certain range has little impact on performance, due to limited time, we only search the best variance for MORPH II(Setting I) and apply it to other experiments. In addition, the baseline with the fixed variance of MORPH II(Setting II), MORPH II(Setting III), and FGNET are 2.66, 2.79,2.64, respectively.

Fig. 3.
figure 3

The MAE results on MORPH II under Setting I. The blue line denotes the results of the baseline model trained with different fixed variance. The red line is the result of the baseline model trained with cross-entropy loss and the green one is the result of AVDL

Table 4. The performance comparison on selecting different number of facial images of each age to form validation set

The Influence of Different Sample Number in Validation Set. As [28] shows, a balanced meta dataset could provide balanced class knowledge. For the same purpose, we choose an unbiased validation set as meta dataset. As for the composition of the clean validation set, we try different sizes of validation set. We respectively random select 1, 2 and 3 images from each class in the training set to form the validation set for experiment. From Table 4, we can find that the larger the validation set is, the better the model performs. However, since all validation set is used in each iteration, it needs more time and memory as the size of validation set increases. Considering the time and space cost, for each dataset setting, we randomly chose three image from each class to form the validation set.

4.6 Visualization and Discussion

Considering the affordance and credibility, here we display some visual results of AVDL in age estimation and variance adaptation.

We use the learned variance of samples to show the effectiveness of AVDL and to justify our motivation. Under the Setting I on MORPH II, each age, ranging from 16 to 60, possesses a group of face images belonging to different person identities. While there is no person whose images covering the full age range. We select images of several persons at different ages with their adapted variances in Fig. 4(a). As [11] mentioned, the age variances of younger or older people tend to be smaller than those of middle age. And the variances vary between people in the same age group. Besides, the variance in Fig. 4(b) shows the visualization of the adjusted variances in a mini-batch. The initial variance for each sample, as indicated in Sect. 4.2, is set to 1. The learned variances are shown in the horizontal band above in which each block represents a sample and the color of the block indicates the magnitude of variance. The blocks are arranged from left to right according to the ages of samples. The band below is the legend which indicates the relationship between the magnitude of variance and the color of block. Same as Fig. 4(a) shows, the variance in young age and old age is smaller. Besides, the variances in the band fluctuate slightly which demonstrates the variance is different for people.

Fig. 4.
figure 4

Examples of age estimation results by AVDL. (a) shows some samples at different ages on Morph II with adapted variances. According to the Gaussian curves, it can be proved that for younger and older people, the variances tend to be smaller while for middle age, larger. (b) uses heat map to visualize the adaptively learned variances \(\sigma \) corresponding to different ages.

5 Conclusions

In this paper, we propose a novel method for age estimation, named adaptive variance based distribution learning (AVDL). AVDL introduces meta-learning to adaptively adjust the variance for each image in single iteration. It achieves better performances than others on multiple age estimation datasets. Our experiments also show that AVDL can guide variance to get close to real facial aging law. The idea that using meta-learning to guide key hyper-parameters is inspirational and we will explore more possibilities of it.