Keywords

1 Introduction

Age and gender prediction has become one of the more recognized fields in deep learning, due to the increased rate of image uploads on the internet in today’s data driven world. Humans are inherently good at determining one’s gender, recognizing each other and making judgements about ethnicity but age estimation still remains a formidable problem. To emphasize more on the difficulty of the problem, consider this - the most common metric used for evaluating age prediction of a person is mean absolute error (MAE). A study reported that humans can predict the age of a person above 15 years of age with a MAE of 7.2–7.4 depending on the database conditions [1]. This means that on average, humans make predictions off by 7.2–7.4 years. The question is, can we do better? Can we automate this problem in a bid to reduce human dependency and to simultaneously obtain better results?

One must acknowledge that aging of face is not only determined by genetic factors but it is also influenced by lifestyle, expression, and environment [1]. Different people of similar age can look very different due to these reasons. That is why predicting age is such a challenging task inherently. The non-linear relationship between facial images and age/gender coupled with the huge paucity of large and balanced datasets with correct labels further contribute to this problem. Very few such datasets exist, majority datasets available for the task are highly imbalanced with a huge chunk of people lying in the age group of 20 to 75 [3,4,5] or are biased towards one of the genders. Use of such biased datasets is not prudent as it would create a distribution mismatch when deployed for testing on real-time images, thereby giving poor results.

This field of study has a huge amount of underlying potential. There has been an ever-growing interest in automatic age and gender prediction because of the huge potential it has in various fields of computer science such as HCI (Human Computer Interaction). Some of the potential applications include forensics, law enforcement [1], and security control [1]. Another very practical application involves incorporating these models into IoT. For example, a restaurant can change its theme by estimating the average age or gender of people that have entered so far.

The remaining part of the paper is organized as follows. Section 2 talks about the background and work done before in this field and how it inspired us to work. Section 3 contains the exact technical details of the project and is further divided into three subsections. Section 4 talks about the evaluation metric used. Section 5 presents the various experiments we performed along with the results we obtained, and finally Sect. 6 wraps up the paper with conclusion and future work.

2 Related Work

Initial works of age and gender prediction involved techniques based on ratios of different measurements of facial features such as size of eye, nose, distance of chin from forehead, distance between the ears, angle of inclination, angle between locations [8]. Such methods were known as anthropometric methods.

Early methods were based on manual extraction of features such as PCA, LBP, Gabor, LDA, SFP. These extracted features were then fed to classical ML models such as SVMs, decision trees, logistic regression. Hu et al. [9] used the method of ULBP, PCA & SVM for age estimation. Guo et al. [10] proposed a locally adjusted robust regression (LARR) algorithm, which combines SVM and SVR when estimating age by first using SVR to estimate a global age range, and then using SVM to perform exact age estimation. The obvious down-side of such methods was that not only getting anthropometric measurements was difficult but the models were not able to generalize because people of different age and gender could have the same anthropometric measurements.

Recently the use of CNN for age and gender prediction has been widely adopted as CNNs are pretty robust and give outstanding results when tested on face images with occlusion, tilt, altered brightness. Such results have been attributed to its good ability to extract features. This happens by convolving over the given image to generate invariant features which are passed onto the next layer in a sequential fashion. It is this continual passing of information from one layer to the next that leads to CNNs being so robust and supple to occlusions, brightness changes etc.

The first application of CNNs was the Le-Net-5 [11]. However, the actual boom in using CNNs for age and gender prediction started after D-CNN [12] was introduced for image classification tasks. Rothe et al. [13] proposed DEX: Deep EXpectation of Apparent Age for age classification using an ensemble of 20 networks on the cropped faces of IMDB-Wiki dataset. Another popular work includes combining features from Deep CNN to features obtained from PCA done by Wang et al. [14].

3 Methodology

3.1 Dataset

In this paper, we use the UTKFace dataset [2] (aligned and cropped) consists of over 20,000 face images with annotations of age, gender, and ethnicity. It has a total of 23708 images of which 6 were missing age labels. The images cover large variations in facial expression, illumination, pose, resolution and occlusion. We chose this dataset because of its relatively more uniform distributions, the diversity it has in image characteristics such as brightness, occlusion and position and also because it involves images of the general public.

Some sample images from the UTKFace dataset can be seen in Fig. 1. Each image is labeled with a 3-element tuple, with age (in years), gender (Male-0, Female-1) and races (White-0, Black-1, Asian-2, Indian-3 and Others-4) respectively.

Fig. 1.
figure 1

Sample images from the UTKFace dataset.

For both our approaches (custom CNN and transfer learning based models), we used the same set of images for training, testing and validation, to have standardized results.

This was done by dividing the data sets into train, test and validation in 80: 10: 10 ratios. This division was done while ensuring that the data distribution in each division remains roughly the same, so that there is no distribution mismatch while training and testing the models. The Table 1 and Table 2 show the composition of training, validation and test data with respect to gender and age respectively.

Table 1. Composition of sets by gender
Table 2. Composition of sets by age
Table 3. Network architecture for age estimation

3.2 Deep CNNs

Network Architecture.

The tasks tackled using the deep CNN approach include age and gender classification and age estimation. The basic structure of each of the 3 models includes a series of convolutional blocks, followed by a set of FC (fully connected) layers for classification and regression. An RGB image is fed to the model and is resized to 180 × 180 × 3. Every architecture comprises convolutional blocks that are a stack of convolutional layers (filter size is 3 × 3) followed by non-linear activation ‘ReLU’, max pooling (2 × 2) and batch normalization to mitigate the problem of covariate shift. The deeper layers here also have spatial dropout (drop value of 0.15–0.2) which drops entire feature maps to promote independence between them. Following the convolutional blocks, the output is flattened before feeding that into FC layers. These FC layers have activation function of ReLU, dropout (value between 0.2 & 0.4) and batch normalization. Table 3 shows the architecture used for age estimation.

The architectures for age classification and gender classification differ in the fact that they have 3 & 2 blocks with 256 filters respectively (in convolutional layer) and the output layer has 5 and 2 neurons respectively with softmax activation function (being classification tasks).

Training and Testing.

For age classification, our model classifies ages into 5 groups (0–24, 25–49, 50–74, 75–99, and 100–124). For this, we had to perform integer division (by 25) on the age values and later one hot encodes them before feeding them into the model. Similarly, gender also had to be one hot encoded for gender classification into male and female. The loss function chosen for age estimation was mean-squared error (MSE) as it is a regression task, whereas for age and gender classification it was categorical-cross entropy. For training, each model was trained using a custom data generator that allows training in mini-batches. Learning rate decay was used during training as it allowed the learning rate to decrease after a fixed number of epochs. This is essential as the learning rate becomes very precarious during the latter stages of training, when approaching convergence. Various experiments with different optimizers were conducted, the results of which have been summarized in Sect. 5.

Each model was trained between 30 to 50 epochs on average. Initial learning rate was set of the order 1e–3. Batch size of 32 was used. The learning rate was changed to 0.6 times the current learning rate after about 9 epochs (on average) to ensure that by the end of training, the learning rate is small enough for the model to converge to the local minimum. Figure 2 showcases the training plots of our models. In all graphs, the blue line denotes the training and the red line denotes the validation result. It is very evident that the training for gender classification was the noisiest whereas the training for age estimation was the smoothest.

Fig. 2.
figure 2

Training plots depicting loss by the epoch of a) Age Estimation b) Age Classification c) Gender Classification

Table 4 shows the lowest loss value to which our model could converge while training.

Table 4. Minimum loss value

The next subsection explores our work using transfer learning.

3.3 Transfer Learning

Transfer learning is one of the most powerful ideas in deep learning. It allows knowledge learned on one task to be applied to another. A lot of low-level features that determine the basic structure of the object can be very well learned from a bigger available dataset and knowledge of those transferred low-level features can help learn faster and improve performance on limited data by reducing generalization error.

The UTKFace dataset is a very small dataset to capture the complexity involved in age and gender estimation, so we focused our attention further on leveraging transfer learning. One study [6] has already compared performance of fine-tuning and pre-training state-of-the-art models for ILSVRC for age estimation on UTKFace. We take it a step further by using convolutional blocks of VGG16 pretrained on VGGFace [4] and ResNet50 and SE-ResNet-50 (SENet50 in short) pre-trained on VGGFace2 [5], as feature extractors. These models are originally proposed for facial recognition, thus can be used for higher level of feature extraction. To avoid any confusion, in this paper we denote these models as VGG_f, ResNet50_f and SENet50_f respectively where f denotes pre-trained using facial images of respective datasets.

Network Architecture.

The tasks tackled using this transfer learning approach include age estimation and gender classification. Following is the network architecture we used in our models to train on top of features extracted.

For the gender classification, for convenience, we chose custom model names VGG_f_gender, ResNet50_f_gender and SENet50_f_gender whose design as follows. VGG_f_gender comprises of 2 blocks, each containing layers in order of batch normalization, spatial dropout with drop probability of 0.5, separable convolutions layers with 512 filters of size 3 × 3 with keeping padding same to reduce loss of information during convolution operations followed by max pooling with kernel size 2 × 2. The fully connected system consisted of batch norm layers, followed by alpha dropout, and 128 neurons with ReLU activation and He uniform initialized followed by another batch norm layer and finally the output layer with 1 neuron with sigmoid activation. Batch size chosen was 64. ResNet50_f_gender comprises of just the fully connected system with batch norm, dropout with probability of 0.5, and followed by 128 units with exponential linear units (ELU) activation, with He uniform initialization and having max norm weight constraint of magnitude 3. The output layer had single neuron with sigmoid activation. The batch size we chose for this was 128. For, SENet50_f_gender we kept the same model as for ResNet50_f_gender.

For the age estimation the models have been named VGG_f_age, ResNet50_f_age and SENet50_f_age. VGG_f_age consists of 2 convolution blocks each containing in order, a batch norm layer, spatial dropout with keep probability of 0.8 and 0.6 respectively, separable convolution layer with 512 filters of size 3 × 3, padding same so that dimension doesn’t change (and information loss is curtailed), with ReLU activation function and He initialization. Each convolution block was followed by max pooling with kernel size 2 × 2. The fully connected system consisted of 3 layers with 1024, 512, 128 neurons respectively, with a dropout keep probability of 0.2, 0.2, and 1. Each layer had ELU activation function with He uniform initialization. The output layer had one unit, ReLU activation function with He uniform initialization and batch normalization. Batch size of 128 was chosen. ResNet50_f_age consists of a fully-connected system of 5 layers with 512, 512, 512, 256, 128 units with dropout with keep probability of 0.5, 0.3, 0.3, 0.3 and 0.5 respectively. Each of the layers contains batch normalization and has Scaled Exponential Linear Unit (SELU) as the activation function. Like previously, for SENet50_f_age we kept the same model as for ResNet50_f_age.

Training and Testing.

In order to save training time each set was separately forward passed via each model to get corresponding 9 Numpy ndarrays as extracted input feature vectors and saved. Since the faces were already aligned and cropped no further preprocessing was carried out and input dimensions are kept same as original RGB photos i.e., 200 × 200 × 3. For gender classification, the loss is binary cross-entropy function. Class weights were also taken into account while training to make up for slight class imbalance as there are roughly 48% female and 52% male in the both the training and validation set. For age estimation, being a regression task, the loss function was mean squared error. The optimizer used in both cases is the AMSGrad variant of Adam [15] with an initial learning rate of 0.001 which is halved in the ending phase of training for better convergence. The choice of optimizer was based on the experiments carried out while training our custom CNN architecture and theory [15].

4 Evaluation

The performance of the age estimation algorithms is evaluated based on the closeness of the predicted value to the actual value. The metrics widely used for the age estimation as a regression task is the mean absolute error or MAE which captures the average magnitude of error in a set of predictions. MAE calculates the absolute error between actual age and predicted age as defined by the Eq. (1).

$$ MAE = \frac{1}{N}\sum\nolimits_{j = 1}^{n} {\left| {y_{j} - \hat{y}_{j} } \right|} $$
(1)

Where n is the number of testing samples, \( y_{j} \) denotes the ground truth age and \( \hat{y}_{j} \) is the predicted age of the j-th sample.

For classification tasks (age and gender), the evaluation metric used was accuracy which denotes the fraction of correctly classified samples over the total number of samples.

5 Experimentation and Results

In this section we summarize the results obtained via the extensive experiments performed in the study and compare different methods from work of other researchers.

5.1 Deep CNNs

We experiment our models in 3 distinct steps. Each successive step uses the model performing the best in the previous step.

First, we tried two of the most popular layer types for convolutional layers. We trained and tested the performance of all - age estimation, age classification and gender classification on 2 types of fundamental convolutional layers - the simple convolutional layer (Conv2D) and separable convolutional layer (Separable Conv2D) with spatial dropout being present in both cases, for increased regularization effect. Rest all hyper parameters were kept the same (Table 5).

Table 5. Comparison of layer type

It is apparent that separable convolution coupled with spatial dropout (in the convolutional layers) helped the model in converging faster and generalize better. This is because, separable convolutions consist of first performing a depth wise spatial convolution (which acts on each input channel separately) followed by a pointwise convolution which mixes the resulting output channels. Basically, separable convolutions factorize a kernel into 2 smaller kernels, leading to lesser computations, thereby helping the model to converge faster.

Then we experimented with other arguments associated with the namely the type of weight initialization and weight constraints which determine the final weights of our model and hence its performance. Table 6 summarizes the results of this experiment.

Table 6. Comparison of models based on the arguments of the best performing layer

‘He’ initialization resulted in better performance when ReLU activation function was used than Xavier initialization.

Again, for each task we chose the configuration that gave the best result on the validation set and tried a bunch of different optimizers in order to maximize performance. Optimizer plays a very crucial role in deciding the model performance as it decides the converging ability of a model. Ideally, we want to have an optimizer that not only converges to the minimum fast, but also helps the model generalize well (Table 7).

Table 7. Effect of various optimizers on results

These results show Adam and its variant (Adamax) provide the best results. Adam and its variants were observed to converge faster. On the other hand, it was observed that models trained using SGD were learning very slowly and saturated much earlier especially when dealing with age.

5.2 Transfer Learning

Table 8 compares the performance based on the different extracted features, on which our models were trained.

Table 8. Comparison based on feature extractors

It is clear that the features extracted using SENet-50_f performed best for both the tasks compared to ResNet50_f and VGG_f even though we trained more layers for VGG_f.

In the study [7], a linear regression model and ResNeXt-50 (32 × 4d) architecture was trained from scratch on the same dataset for age estimation using Adam. In another study [6], various state-of-the-art models pre-trained on ImageNet were used where the authors trained two new layers while freezing the deep CNN layers which acted as feature extractor followed by fine-tuning the whole network with a smaller learning rate later using SGD with momentum. Both studies had their models evaluated on 10% size of the dataset, utilizing remaining for training or validation (Table 9).

Table 9. Comparison with others’ work

Since we got best performance, from features extracted via SENet50_f, for both tasks of age and gender classification in Table 10 and Table 11, we further provide baseline performance on them for various machine learning algorithms on the same splits of the dataset. Validation set is not used since we haven’t tuned these models, default hyper parameters of Sci-kit learn and XGBoost libraries have been used for this.

Table 10. Untuned baseline for gender classification
Table 11. Untuned baseline for age estimation

Clearly, even simple linear regression outperformed training our custom CNN model for age estimation and logistic regression came remarkably close to the custom CNN architecture for gender classification on the features extracted using SENet50_f.

As expected, our model performs relatively poorly while predicting ages for people above 70 years of age. This is quite evident from Table 2. where it can be seen that there are only 5.78% images in the dataset belonging to people above 70 (albeit the dataset is quite evenly balanced when it comes to gender). We believe much better results can be attained using a more balanced and larger dataset.

6 Conclusion

Inspired by the recent developments in this field, in this paper we proposed two ways to deal with the problem of age estimation, age and gender classification - a custom CNN architecture and transfer learning based pre-trained models. These pre-trained models helped us combat overfitting to a large extent. It was found that our models generalized very well with minimal overfitting, when tested on real-life images.

We plan to extend our work on a larger and more balanced dataset with which we can study biases and experiment with more things in order to improve the generalizability of our models. In future research, we hope to use this work of ours as a platform to improvise and innovate further and contribute to the deep learning community.