Keywords

1 Introduction

The use of biometric features to determine a person’s age is described as age estimation and can be accomplished using traits (biometric) such as eye color, skin, and facial emotions [1]. In the last few years, numerous research has been conducted in age estimation using facial images for real-world applications such as access monitoring and control, enhanced communication. These impacts have contributed significantly to emotiona1 intelligence, security control, and entertainment, among several others [1,2,3].

Many studies have been carried out to implement age estimations effectively. Although there still exist several limitations and drawbacks [6]. Previous approaches have real-world limitations majorly due to variations in conditions in which some images are taken, and could, in turn, affect the accuracy of the estimations. These conditions include lighting, background, and facial expression, which, cumulatively, could account for reductions in the quality of images to be processed [4, 5].

Several techniques have been previously employed, some of which include hand-craft methods for extracting visual features [5, 6], different age modeling approaches [7] for image representation, and machine-based approaches for resolving problems with classification [14] and regression [15]. Most of these approaches have one or two weaknesses, like considering the distance between facial landmarks and neglecting the facial appearance, not able to encode wrinkles on the face, in addition to their inability to match boundaries in an image [6].

In this work, we propose an efficient deep learning approach that would serve as an end-to-end model to increase the accuracy of deep neural networks in the age estimation task. Convolutional Neural Network is known for its ability to learn from raw pixels, exploit spatial correlations in data, among several functionalities with proven high performance in the area of computer vision and image processing. The proposed model will be trained using a large benchmark dataset with unconstrained faces, while the accuracy will be compared to other state-of-the-art approaches.

This paper is structured as follows: in Sect. 2, we described the background and related work; in Sect. 3, we describe the proposed Convolutional Neural Networks method and its application in age estimation; Sect. 4, we provide detailed analyses of the experimental result, and ultimately, the conclusions are presented in Sect. 5.

2 Background and Related Work

The possibility of performing age estimation using facial images based on biometric features extracted from a person’s face has been previously established. A survey carried out by Angulu, R. et al. [6] provided extensive details based on recent researches in aging and age estimation, which implemented popular algorithms of existing models with comparative analyses of their performance.

Deep learning methods like the deep convolutional neural network has performed excellently on different tasks of computer vision and image recognition such as age & gender classification [9], diagnosing Tuberculosis [12] because of the strength to learn without explicitly working on and ability to deal with extensive data [4, 13]. In 2016, Zhang, Z. [2] study was based on apparent age estimation with CNN using Google-NET architecture [20]. A batch normalization layer was added to the model after each rectified linear unit operation. However, their method did not work well on older people, blurry images, and misaligned faces.

Anand, A. et al. [1], previously modeled a CNN method to cumulatively perform feature level fusion, dimension reduction of feature space, and individual age estimation. WIKI dataset [21] was employed to model the Feed Forward Neural Networks (FFNNs) for parameter estimation prior to training. Adience Benchmark dataset was used to compare the exact accuracy with the state-of-the-art techniques.

Mualla, N. et al. [10], in their study, made use of PCA for feature extraction and reduction. Thereafter, the Deep Belief Network (DBN) classifier was compared with the K-nearest neighbor algorithm and Support Vector Machine classifiers. From their results, the DBN accuracy for age estimation output significantly outperforms the KNN and SVM classifier.

Niu, Z. et al. [16] presented a multiple output CNN learning algorithm with ordinal regression to solve the classification problem for age estimation [16]. However, Yi, D. et al. [17] proposed the use of the multi-scale convolutional network for age estimation. In their study, they proposed a CNN model which was not vast enough. It consisted of only one fully connected layer, one convolution layer, one pooling, and one local layer, which summed up to a total of four layers, while just a subset of MORPH2 was used to train the model.

Sendik, O. et al. [8] applied Convolutional Neural Network (CNN) and Support Vector Regressor (SVR) in their face-based age estimation task. They trained the CNN for representation and metric learning, the SVR was applied to the learned features. The work demonstrated that by retraining the SVR layer, small datasets can be analysed.

Huang, J. et al. [11] proposed a deep learning technique for age group classification using the Deep Identification features, the architecture was optimized for face representation. They analysed their architecture on both constrained and unconstrained databases which performs well.

The study carried out by Das, A. et al. [19] analyzes the classification of gender, age, and race as a multi-task. They proposed a multi-task Convolutional Neural Network (MTCNN), which makes use of unconnected features of the fully connected layers. They split the fully connected layers to perform multi-task learning for a better face attribute analysis efficiently. However, they took advantage of the hierarchical distribution of information that is present in CNN features. The experiment was carried out using two datasets; UTKFace and Bias Estimation in Face Analytics (BEFA). Their approach provided a promising outcome on both datasets.

3 Methods and Techniques

3.1 Datasets

The following databases were used for the experiments done in this work. These databases are publicly available at:

UTKFace

The UTKFace dataset is a large face dataset that consists of over 20,000 face images in the wild with a long span ranging from 0 to 116 years old. Each facial image is with annotations of age, gender, and ethnicity. Some face samples provide corresponding aligned and cropped faces. The images cover large variations in pose, facial expression, illumination, occlusion, and resolution, to mention but a few. The UTKFace dataset has been used for different tasks e.g. age estimation [19], age progression/regression [22] etc. In this study, the ages of the face images are grouped, as shown in Table 1.

The dataset can be accessed at https://www.kaggle.com/jangedoo/utkface-new.

Table 1. Ages from the dataset as grouped

FGNET

FGNET dataset only contains 1002 images of 82 individuals in the wild; the ages in dataset ranges from 0 to 69 years old with about 12 images per person. This is a small dataset with every constituent images representing different ages.

3.2 PreProcessing

Working with a deep neural network requires a large data size to achieve higher accuracy in the estimation of age. As such, we carried out data augmentation on our dataset to balance the number of images for each age class. In Table 1, age class \( \ge \) 80 compared to age class 31–80 shows the differences in the number of images. These differences could adversely affect the accuracy of our model. For the augmentation process, we applied different augmentations like flip left right (0.5), skew (0.4, 0.5) and zoom (probability = 0.2, min factor = 1.1, max factor = 1.5).

Fig. 1.
figure 1

Sample of the preprocessed images for the flip, skew, and zoom.

Fig. 2.
figure 2

The architecture of the proposed age estimation.

3.3 Proposed Model

Convolutional Neural networks are deep neural networks for classification tasks. The novel CNN architecture in this work consists of four convolutional layers and two fully connected layers. The input image with shape 200 \(\times \) 200 \(\times \) 1, which was passed in the first convolutional layer learning at 32, Asides from the second convolutional layer learning 64, which has a 5 \(\times \) 5 filter, other convolution layers consist of a 3 \(\times \) 3 filter, the same padding, and activation function (rectified linear unit). The four convolutional layers made use of Max pooling with 2 * 2 poling size with a dropout 0.25 and also a batch normalization layer. The first fully connected layer consists of 512 neurons, a Rectified Linear Unit activation, then a dropout ratio of 0.5. The second fully connected layer has four (4) neurons with a Softmax as the activation function. After features have been extracted from the input images at the convolutional layer, a softmax classifier is needed to determine the probability of the feature extracted and return the confidence score for each class. The mathematical representation of the softmax function is as follows (Figs. 1 and 2);

$$ S(x_i) = \frac{e^{x_i}}{\sum _{j=1}^{k}e^x_j} $$

where each value of x in the input vector is an exponent divided by a sum of the exponents of all the inputs.

Our model was trained using Adam optimizer with an initial learning rate of 0.0001, and we used the reduced learning rate on Plateau to monitor the validation accuracy and loss setting the patience at ten (10) and the minimum learning rate at 0.001. The Adam optimizer is given as below;

$$\begin{array}{c} \theta _{t+1} = \theta _{t} - \frac{ \eta \cdot \hat{m}_t }{ \sqrt{\hat{v}_t}+ \epsilon }\quad \\ \end{array}$$

where

$$\begin{array}{c} \hat{m}_{t} = \frac{ m_t}{ 1-\beta _1^t}, \hat{v}_{t} = \frac{ v_t}{ 1-\beta _2^t} \quad \\ \end{array}$$

and where

$$\begin{array}{c} m_t = (1-\beta _1)g_t + \beta _1 m_{t-1}, v_t = (1-\beta _2)g^2_t + \beta _2 v_{t-1} \end{array}$$

where \( \epsilon \) is Epsilon, \( \eta \) is the learning rate, g is the gradient (Table 2).

Table 2. The training details with UTKFace dataset

During the implementation, fewer convolutional filters should be used in the initial layer and increased filter in subsequent layers. Furthermore, wider window size in the initial layers compared to the later layers. Finally, the use of dropout can help to avoid over-fitting (Table 3).

Table 3. The details of the architecture for the proposed method

4 Result and Discussion

The overall performance of the proposed model is evaluated using the accuracy metric, and the system achieves an accuracy rate of 89.75% on the classifier. Figure 3 shows the accuracy achieved with this method. It can be seen that the model accuracy curve goes optimal before five epochs. The categorical cross-entropy used for the evaluation of the loss function in the classifier achieves a loss of 0.3389 on the validation dataset, as shown in Fig. 3. Our model is adept to correctly estimate the age group of faces in many instances, while in some instances of estimating the age group, misclassification was recorded.

Relatively, our model demonstrated a higher accuracy when implemented on the UTKFace dataset typifying the state-of-the-art method when compared to the accuracy reported in the study by Sithungu et al. [23] and Das et al. [19] (see Table 4). Sithungu et al. [110], in their study, employed a lightweight CNN model for age estimation analyses, after which the UTKFace dataset was used to train and test the performance of their model. Only 7700 out of 23718 images were used, which presents a limitation to their study since training a model with few data causes overfitting. Contrary to ours, we achieved augmentation of data in our study, which enabled us to use a large dataset to train and test our classifier. Das et al. [19] proposed the Multi-Task Convolution Neural Network (MTCNN) using joint dynamic weight loss alteration to classify gender, age, and race jointly. While they achieved high gender and race precision, their age estimate was less accurate than our model.

Fig. 3.
figure 3

Model accuracy and loss.

Table 4. Proposed model as compared

5 Conclusion

The major challenge associated with age classification has been the stochastic nature of aging among individual and uncontrollable age progression information displayed on faces as deduced from models applied in previous studies. This validates the need for a more efficient model that can accurately extract features from input images, which is the major underlying process in image classification. Though our CNN approach is a custom one, but with the introduction and application of a large facial image dataset we consider that our model can achieve a meaningful improvement. In this study, the proposed model is implemented on the UTKFace dataset for extracting distinctive features and then classified into the appropriate classes. The efficiency of the model is being validated on the FGNET dataset, and the performance is compared with other studies, as shown in Table 4. The proposed CNN architecture performed very well and demonstrated the state-of-the-art result on the UTKFace dataset. For future works, we would consider the effect of gender information on age estimation, consider a deeper CNN model age, gender and ethnicity. Also, we would explore some pretrained deep learning model to our study.