Keywords

1 Introduction

Age estimation has many real-world applications including social robotic interaction, biometrics, demographics, business intelligence, online advertising, item recommendation, identity verification, video surveillance, access control, human-computer interaction, privacy and security, crowd behavior, law enforcement, and many more [1,2,3]. Single facial image age prediction is highly challenging [4,5,6,7], due to the variability in how individuals age based on their “ageotype” [8]. Everyone ages at different rates and biological age is influenced by genetics, diet, exercise, stress and environment. Moreover, visual cues about an individual’s chronological age can vary due to pose, lighting, gender, scale, cosmetics, accoutrements, race, height, weight, health, emotion, occlusion, etc. [1, 3, 9, 10]. Facial age feature embeddings can also be used to improve face recognition [11] and distinguish between real and synthetic (Deepfake) faces [12].

In the field of facial age estimation, there is an absence of large, reliable annotated datasets due to the difficulty in establishing ground truth ages. The LAP 2016 dataset [13] is reliable but only contains 7,591 images. Large datasets, like IMDB-Wiki [14], CACD [15], and UTK [16] are annotated with the age information based on online web crawling and social networks, therefore, reliability is not guaranteed. Some of the datasets for face aging prediction (prediction of a person’s appearance at a younger or older time period) do not have enough diversity because many pictures are from the same individual at different times; like the FG-NET [17] dataset which contains 1,002 images of only 82 people. MORPH Album 2 [18] is another longitudinal dataset that contains 55,134 images of 13,618 subjects, but with a limited age distribution that ranges between 16 to 77. The CAIP Guess the Age (GTA) Contest [19], uses the VGG-Face2 MIVIA Age Dataset [2]. It consists of 575,073 images of more than 9,000 identities, collected at different ages. The images are extracted from the VGGFace2 [20] and annotated with the person’s age by means of a knowledge distillation technique [2]. The VGG-Face2 MIVIA Age Dataset is the most accurate facial age dataset currently available at this scale in terms of sample size and heterogeneity. Despite the lack of precise age data, several machine learning and data driven age estimation models have emerged [1]. DLDL-v2 (ThinAgeNet) [21] currently stands as the state-of-the-art on the MORPH Album 2 and ChaLearn 2015 and 2016 [22] datasets.

Guess the Age (GTA) Contest considers the biometric task of estimating a person’s age using only their facial image as input [1, 2]. Although there are over 575K age labeled images in the VGG-Face2 MIVIA Age Dataset covering gender, ethnicity, varying poses, scale and illumination, there is a high degree of age class imbalance. The four age groups covering, 1 to <20 and \(\ge \)60, the two youngest and two oldest groups (out of eight categories) constitute less than 10% of the data; the youngest and oldest age categories make up less than 1%.

In this paper, we propose a novel age estimation approach that uses a two-layer classification-plus-regression random forest trained on deep feature embeddings from the ResNeXt50 architecture [23]. We show that an ensemble of weak decision trees trained on deep features has smaller variance than a pure deep neural model with end-to-end optimization.

Fig. 1.
figure 1

Sample face-alignment using FANet [24] applied to face images from the VGG-Face2 MIVIA Age Dataset. Note that not all faces are warped when they are side profiles or have up-down tilts. Our final age estimates using ResNeXT+TLRF (vs actual) for these subjects from left to right and top to bottom are: 27 (27), 27 (30), 29 (29), 34 (35), 31 (31), 56 (57), 27 (29), 58 (59), 28 (28).

2 Deep Learning Methods for Age Estimation

2.1 Pre-processing

We used z-normalization to normalize the intensity value of the pre-cropped input face images. VGG-Face2 MIVIA Age dataset contains images with already cropped single faces, and hence a face detection step was not necessary.

Our experience with incorporating a face-alignment step using FANet produced mixed results [24]. FANet was used to estimate 68 facial key-points in the cropped face image. These extracted key-points are matched with a template (standard face pose) set of key-points to estimate the 2D alignment transformation matrix. We then apply this transformation to warp the original face image to realign the face. Sample results from this step are shown in Fig. 1.

We evaluated the potential benefit of face-alignment since this can reduce the learning complexity for age estimation when faces are in similar poses. However, we found that face-alignment reduced the diversity in the training dataset which could lead to overfitting, and reduce the performance of deep neural networks. For this reason, we trained the deep neural models on non-aligned faces to ensure better generalizability. Although face-alignment of training images had limited benefit in our initial testing, several approaches are being studied to better incorporate face-alignment as a data augmentation approach to improve performance during inference.

2.2 ResNeXt CNN

Architecture. We use the ResNeXt architecture [23] for extracting feature descriptors due to its advantages over the classical ResNet architecture. ResNet uses residual blocks [25] that make use of sequential convolution layers with an added skip connection. This simple modification led to a breakthrough in performance when compared to classical CNNs (such as VGG [26]). The ResNeXt architectural insight was the notion of cardinality, that many parallel small convolutions are better than a single deep sequence of convolutions with wider kernels. This is done by using parallel convolution streams with fewer channels instead of a single sequential stream with more channels. Using the cardinality property, they experimentally demonstrated an improvement in accuracy on the ImageNet benchmark [27] by simply increasing the cardinality without adding more parameters. This is crucial when dealing with smaller class sizes where over-fitting is more likely.

Hyper-parameters. We train a single output regression version of ResNeXt using the Adam [28] optimizer that is a variation of the Stochastic Gradient Descent algorithm [29]. We use an initial learning rate \(\alpha =10^{-4}\). The model weight initialization is based on transfer learning with pre-trained weights from the ImageNet classification dataset. Additionally, we adopt warm restart scheduling during training using the cosine annealing method [30]. Learning rate is one of the most important hyper-parameters in training neural networks. For this reason, adaptively selecting a learning rate and/or scheduling are crucial for a more robust training [31,32,33].

Loss. We define a new loss function \(\mathcal {L}_{AAR}\) inspired by the Age Accuracy and Regularity (AAR) metric from the GTA contest. For a set of predicted ages \(\mathbf {\hat{y}}\) and real ages \(\mathbf {y}\) of size N, the loss function equation is given as:

$$\begin{aligned} \mathcal {L}_{AAR}(\mathbf {y}, \mathbf {\hat{y}}) = \gamma \mathcal {L}_1(\mathbf {y}, \mathbf {\hat{y}}) + \lambda \sigma \end{aligned}$$
(1)

where:

$$\begin{aligned} \mathcal {L}_1(\mathbf {y}, \mathbf {\hat{y}}) = \frac{1}{N}\sum _{i=1}^N \ell _1(y_i, \hat{y}_i) \end{aligned}$$
(2)

with:

$$\begin{aligned} \ell _1(y, \hat{y}) = {\left\{ \begin{array}{ll} \frac{1}{2\beta } (y - \hat{y})^2 ,&{} \text {if } |y - \hat{y}| < \beta \\ |y - \hat{y}| - \frac{1}{2}\beta , &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)

and:

$$\begin{aligned} \sigma = \sqrt{\frac{1}{8} \sum _{j=1}^{8}{[\mathcal {L}_1(y^j, \hat{y}^j) - \mathcal {L}_1(y, \hat{y})]^2}} \end{aligned}$$
(4)

where y is the set of true ages, \(\hat{y}\) is the set of predicted ages, \(y^j\) and \(\hat{y}^j\) are the true and predicted ages respectively that belong to \(j^{th}\) age group. \(\mathcal {L}_1\) is the smooth L1 norm (mean absolute error), \(\sigma \) is a regularization term to reduce the model’s sensitivity to the dataset imbalance. \(\gamma \) and \(\lambda \) are coefficients terms for two parts of the loss function. The loss parameters used in this work are \(\gamma = 0.7\), \(\lambda = 0.3\), and \(\beta = 1.0\).

Note that there are two main differences between our loss function and the AAR metric. First, we use smooth L1 distance \(\ell _1\) as opposed to MAE. Smooth L1 was proven less sensitive to outliers and less prone to exploding gradients [34, 35]. Second, we do not clip \(\mathcal {L}_1\) and \(\sigma \) components to a maximum value; instead, we give them different weights to emphasize one over the other.

Label Distribution Smoothing (LDS). To tackle the challenge of age imbalance in the dataset, the Label Distribution Smoothing (LDS) method was evaluated [36]. LDS convolves a symmetric 1-D Gaussian smoothing kernel k with the label distribution (histogram) p(y) to produce a kernel-smoothed version that interpolates information of data samples with nearby labels. A symmetric kernel is a kernel that satisfies: \(k(y+\varDelta {y}) = k(y-\varDelta {y})\) and \(\nabla _y k(y+\varDelta {y}) + \nabla _y k(y-\varDelta {y}) = 0, \forall y \in Y\). The smoothed label distribution, \(p'(x)\), is a convolution between the distribution p(y) and the kernel k(y):

$$\begin{aligned} p'(y) = k(y) * p(y) \end{aligned}$$
(5)

where \(*\) is the convolution operator. The loss function is then reweighted by scaling the estimates with the inverse of the label frequency for each sample:

$$\begin{aligned} \mathcal {L} = \frac{1}{N}\sum _{i=1}^N \frac{\ell (y_i, \hat{y}_i)}{p'(\hat{y}_i)} . \end{aligned}$$
(6)

2.3 Two-Layer Random Forest (TLRF)

ResNeXt takes an RGB input image \(x_i\) and uses a series of convolutional blocks to produce a feature embedding of 2,048 dimensions \(f_i\in \mathbb {R}^{2,048}\). It then uses a single-layer perceptron (fully connected neural regressor) with learned weights to make a final prediction of age, \(\hat{y}_i \in \mathbb {R}\). We replace the neural regressor with our two-layer random forest (TLRF) combining classification-plus-regression to make a final prediction. In the first stage, TLRF uses ResNeXt features as input to the first layer (random forest classifier) to make a classification of the given sample’s age group in the form of a probability vector \(p_i(G) \in \mathbb {R}^8\) where:

$$G\in \{[1, 9], [10, 19], [20, 29], [30, 39], [40, 49], [50, 59], [50, 59], [60, +\infty ]\}$$

We concatenate the 8-dimensional predicted probability vector for all eight of the age groups with the learned 2048-dimensional deep embedding feature vector \(f_i\) into an augmented vector. We, then use that as input to the second random forest regressor layer in our TLRF. The final regression output is then rounded up to the nearest integer. A visual diagram of our approach is shown in Fig. 2.

Fig. 2.
figure 2

Proposed ResNeXt+TLRF facial age estimation pipeline using ResNeXt-50 feature embedding vector with ImageNet transfer learning plus VGG-Face2 MIVIA Age training. A dual stage random forest estimates both class labels and age estimates.

Our experiments showed that TLRF improves the performance and stability of ResNeXt. For each layer of TLRF, we utilize a random forest of 100 decision trees trained in parallel on the ResNeXt embedding feature vectors, and each decision tree uses a maximum of 128 randomly selected features.

2.4 Training the Deep Architectures

Dataset Split. The VGG-Face2 MIVIA Age Dataset consists of 575,073 example cases [1, 2]. We used 90% of this dataset (517,562) for training and the remaining 10% for evaluation (57,511). As the dataset is not uniformly distributed in terms of age, we sample 10% from each age group j for evaluation; rather than 10% uniformly sampled across the entire set. We then divide the training data further into a training and validation split of 90% and 10% sizes respectively.

Pre-processing. Face images are normalized as explained in the pre-processing section. In addition, we resize the images to \(224 \times 224\) resolution to match the expected input of ResNeXt network. Additionally, for better network stability, we normalize the age to range between 0 and 1.

Data Augmentation. For data augmentation, we use random horizontal flipping. Since our experiments showed that face-alignment hurts training, random rotations and distortions could also be applied in future work.

3 Experimental Results

In addition to the mean absolute error, we report the Age Accuracy and Regularity (AAR metric). GTA contest defined the \(\text {AAR}\) performance measure as:

$$\begin{aligned} \text {AAR} = max(0; 7 - \text {MAE}) + max(0; 3 - \sigma ) \end{aligned}$$
(7)

with a maximum score of 10; with \( \sigma =\sqrt{\frac{1}{8}{\sum _{j=1}^8(\text {MAE}_j-\text {MAE})^2}}\) where \(\text {MAE}\) is the mean absolute error and \(\text {MAE}_j\) is the mean absolute error for the \(j^{th}\) age group. The mean average error (MAE) is given as \(MAE = \frac{1}{N} \sum _i |y_i - \hat{y}_{i}|\), where i is the sample index over all age categories. All evaluations are performed on the evaluation set that we described in Sect. 2.4 unless specified otherwise.

3.1 ResNeXt

A performance comparison of ResNet vs. ResNeXt in terms of MAE is given in the Table 1. Additionally, the table shows the difference in performance of our custom soft AAR loss compared to the mean squared error (MSE) loss \(\mathcal {L}_{MSE}\) where, for a given batch of size n:

$$\begin{aligned} \mathcal {L}_{MSE} = \frac{1}{n} \sum _{i=1}^n ||y_i - \hat{y}_i||^2 \end{aligned}$$
(8)

Table 1 shows how using ResNeXt over its predecessor ResNet improves performance. Additionally, our custom AAR loss consistently improves the AAR metric in both networks. We can also note that LDS did not help in improving the performance; hence, we choose to move forward with ResNeXt trained using the AAR Loss. The LDS-trained ResNeXt network was trained using both MSE and AAR loss functions. The LDS-trained ResNeXt trained using AAR loss shows better performance.

Table 1. Accuracy comparison of ResNet, ResNeXt and ResNeXt with label distribution smoothing LDS using the evaluation data.

3.2 Two-Layer Random Forest (TLRF)

TLRF Classifier Module. Although classifying a face’s age group is an easier task than the exact age, it is still challenging due to age class imbalance. Table 2 shows the performance of the TLRF classifier module on our evaluation set. The F1 measure is much lower for underrepresented age groups due to lower recall.

Table 2. TLRF age group classifier module performance (using ResNeXt descriptor) on evaluation data. Support is the subset of data in each of the eight age categories used for evaluation.

TLRF Regressor Module. Several regression random forest topologies were evaluated against our proposed TLRF. First, a traditional regression random forest (RRF) was trained and evaluated with different number of trees using the ResNeXt 2048-dimensional feature descriptor. Then, we compare the single-layer random forest approach (RRF) to the proposed TLRF. Table 3 summarizes the RF ablation study experimental results using ResNeXt in combination with different RF configurations. For this part, a ResNeXt-50 was trained using MSE loss. The ResNeXt-50 residual deep network with a fully connected final regression layer performed well with an AAR of 8.01 on the held out evaluation set and 8.16 on the combined training and evaluation sets. Incorporating a random forest learning component improves the overall AAR accuracy using 100 trees to 8.20 on the combined training and evaluation sets and reduces AAR to 7.96 on the held out evaluation set. Increasing the number of trees to 200 did not improve performance on the evaluation set and decreased AAR performance to 8.17 on the combined T+E sets. Using the two-layer classification plus regression random forest with the same ResNeXt-50 feature embedding vector results in the best AAR of 8.21 on the combined training and evaluation sets and improves the score to 7.98 comparing to traditional regression random forests. This model also had the smallest class standard deviation (\(\sigma \)), on the held out evaluation set and the combined set.

Table 3. Experimental results showing accuracy on training+evaluation (T+E) and evaluation (E) sets with different random forest learning methods (number of trees and number of layers). Last row ResNeXt+TLRF is our final result. RRF refers to Regression Random Forest. All ResNeXt networks were trained using the MSE loss.

3.3 Generalizability Performance Using the Withheld GTA Data

Based on the results described previously, we submitted the ResNeXt+TLRF as our single official submission to the GTA Challenge competition. After our official submission to the GTA contest, we continued to explore the generalization capability of the different architectures on the unseen hidden dataset with assistance from the MIVIA Lab at the University of Salerno.

Experimental results in Table 4 show that using the proposed custom AAR loss function consistently improves the generalizability of face estimation MAE accuracy in both our evaluation and the GTA hidden test set for all methods. ResNeXt trained using the AAR loss function has the highest AAR score of 8.12 on the heldout evaluation data and score of 7.02 on the hidden test set. The submitted ResNeXt+TLRF method also generalizes well on new unseen faces and has the lowest age group variance of 0.98. We notice that apart from the two underrepresented age groups (MAE\(_1\) and MAE\(_8\) with \(<1.0\%\) samples), the MAE scores are quite consistent between our evaluation split data and the GTA challenge’s hidden test data. It is important to note that the standard deviation, \(\sigma \), is more than eight times higher in the hidden dataset than the held out evaluation data due to larger deviations in MAE\(_1\) and MAE\(_8\). The lowest variation in the hidden or withheld data is the ResNeXt (MSE)+TLRF method, that we submitted to the GTA contest, and is italicized in Table 4.

Table 4. Results on the Guess the Age (GTA) contest hidden (or withheld) test dataset. Column labeled D indicates dataset used: T, for our separate evaluation set (see Sect. 2.4); H, for the unseen hidden GTA challenge test set.

Additionally, using MSE loss, our TLRF method outperformed LDS. Although ResNeXt (AAR) without a TLRF module achieves a slightly better AAR score than ResNeXt+TLRF, it actually has a higher (worse) \(\sigma \) score, which indicates less generalizability across underrepresented age groups. Other methods of augmentation may help with enhancing the generalization capability of the architectures by pretraining with automatic face aging methods which provide a large amount of ground truth across age categories [37], selectively augmenting the lowest representated groups more, incorporating augmentation in feature space during the random forest training, etc.

4 Conclusions

Accurate unconstrained age estimation or categorization, using images or video, is useful in a number of applications including face recognition, age appropriate advertising and retail, venue access, detecting deep fakes, health and exercise monitoring, emotion analysis, forensics, privacy and security applications [38]. Our proposed two-stage supervised learning pipeline for facial age estimation using a ResNeXt deep learning stage followed by a two-layer random forest (TLRF) was able to estimate age with a mean absolute error of about 2 years across all eight age categories with a standard deviation of less than one year. Despite the significant class imbalance in the training data, we were able to achieve an AAR score of 6.97\(\pm 0.98\) (ResNeXt+TLRF) and 7.02\(\pm 1.16\) (ResNeXt) out of 10.0 on the hidden test data of the VGG-Face2 MIVIA Age Dataset as part of the Guess the Age (GTA) contest. The most challenging age categories are the youngest and oldest groups at the two extremes of the age distribution for which there was the least amount of training data (less than 1%). The experimental results demonstrate that a distribution adaptive (AAR) loss function is effective for training with class imbalance. Face alignment did not improve performance and test time data augmentation had limited benefit. For facial age estimation, an ensemble of weak learners trained on deep features is less sensitive to under-represented age groups compared to a purely deep neural regression model trained in an end-to-end fashion.