1 Introduction

Age classification through face, plays a very significant role in our social beings. The human face conveys important information related to individual traits and the information serves as a pivotal role in face-to-face interaction among people. These behaviours are heavily dependent on our ability to estimate these individual traits: age, gender, facial expression, pose, and ethnic group, which are from facial appearances (Gallagher and Chen 2008). Those facial features are significant in our lives while the ability to predict them correctly and dependably from the facial image, is still far from satisfying the requirements of commercial applications (Badame and Jamadagni 2017).

However, regardless of the continuous research in the field of age estimation, where both academia and industry have devoted their effort to algorithm design, modeling, data collecting, system performance test, and valid evaluation protocols, it is still a challenging job in determining the accurate age of an individual (Angulu et al. 2018). The process of facial aging is influenced not only by intrinsic factors like change in size of the face, facial feature shape, wrinkles, face contour, facial feature distribution on the face, but also by extrinsic factors, such as the manner of living, health, eating habits, sociality, environment and weather conditions (Farkas et al. 2013).

In an automatic age estimation, obtaining the facial features needed for the face representations has been very challenging through the conventional methods employed by most of the existing techniques (Gurpinar et al. 2016). Many of those methods are hand-crafted, which requires strong prior knowledge to engineer it by hand; they cannot be relied on to accurately predict human’s age. Recently, Convolutional Neural Network (CNN) methods have been employed in the current age estimation analysis and have resulted in better age estimation accuracy performance. Deep learning and CNN as a learning-based feature representation methods, have been compelled to learn discriminative feature representation directly from raw pixels and acquired linear feature filters needed to project face images into another feature space (Liu et al. 2015b). The CNN architectures posses the ability to extract facial features distinctively and robustly from the face image (Antipov et al. 2016).

However, due to the challenging nature of facial estimation task, attempts to further improve the correctness of age estimation is still very much on and researchers keep studying different (CNN) approaches to further improve results. We, therefore, present a survey of the state-of-the-art deep learning approaches for facial age estimation.

Our contributions are summarized below:

  1. 1.

    We present a survey of different state-of-the-art CNN architectures for facial age estimation describing their strengths and weaknesses.

  2. 2.

    We also study the current facial aging benchmark databases and suggest their suitability to different age estimation techniques.

  3. 3.

    We outline different state-of-the-art algorithms and techniques used for age estimation highlighting their strengths and weaknesses.

  4. 4.

    We review the standard age estimation accuracy metrics and their area of applicability in age estimation.

  5. 5.

    We also present a concise report of different CNN methodology employed in the current state-of-the-art and their performance.

A typical reflection of age estimation system using CNN method is presented in Fig. 1. To the best of our knowledge, this is the first paper that presents a comprehensive review of CNN methods in age estimation, others like Angulu et al. (2018) and Fu et al. (2010) focussed on the hand-crafted and other machine learning methods.

Fig. 1
figure 1

A typical CNN age estimation system. An age estimation system follows a general process that includes face detection, image preprocessing (landmark detection and face alignment), features extraction (extracting the useful features from the input image), and the classification itself

The rest of this work is arranged as follows; in Sect. 2, we present different CNN architectures associated with age estimation, in Sect. 3, we discuss the popular facial aging datasets, Sect. 4 discusses different age estimation algorithms, Sect. 5 discusses previous works in age estimation using the CNN approach, Sect. 6 presents the discussion, we then summarize our contributions and conclusions in Sect. 7.

2 A review of state-of-the-art CNN architectures

In recent times, deep learning and CNN have demonstrated a promising performance in feature learning and face recognition. It has the ability to learn discriminative trait descriptors directly from image pixels (Liu et al. 2017a). These traits are needed to correctly estimate the age of people. AlexNet, GoogLeNet, VGGNet, ResNet, SqueezeNet and Xception CNN architecture are generally considered as the most common architectures because of their state-of-the-art performance on different benchmarks including age estimation task.The following are the description of the architectures:

2.1 AlexNet architecture

One of the earliest CNN architecture for age estimation has been presented by Krizhevsky et al. (2017). AlexNet the winner of the 2012 edition of ImageNet Large Scale Visual Recognition Challenge (ILSVRC), was recorded as the first successful CNN architecture, trained on “imageNet” dataset with about 1.2 million labeled images of objects. AlexNet consists of a simple layout of eight layers with five convolutional and three fully-connected layers. The CNN architecture is similar to that of LeNet but deeper with stacked convolutional layers and more filters. The depth of AlexNet model contributed to the performance of works in Levi and Hassncer (2015), Anand et al. (2017) and Agbo-Ajala and Viriri (2020). It is employed majorly to solve a very challenging facial analysis problem including age estimation, gender recognition etc. However, AlexNet is outperformed by more deeper models like ResNet and GoogleNet but it is computationally expensive.

Figure 2 shows a representation of the network architecture.

Fig. 2
figure 2

The AlexNet architecture (Krizhevsky et al. 2017)

2.2 Visual Geometry Group (VGGNet) architecture

The authors in Karen and Andrew (2015) proposed an improved CNN architecture shown in Fig. 3. VGGNet is the most popular CNN architecture in the literature. It has a very small filter trained to increasingly higher depths (16–19 layers) to obtain a state-of-the-art result on the “ImageNet” classification challenge. The architecture is often used for transfer learning (features extractor and finetuning), as it shows an above average ability to generalize to datasets it was not trained on. VGGNet has proven to be effective in age estimation and this is observed in Rothe et al. (2015), Malli et al. (2016), Agustsson et al. (2017), Anand et al. (2017), Shara and Shemitha (2018), Qawaqneh et al. (2017), Rothe et al. (2018), Nam et al. (2020), Li et al. (2019), Liu et al. (2017a) and Antipov et al. (2016). However, training VGG from scratch is time-consuming and also requires high computational power; the network is brutally slow to train, and the weights of the network architecture themselves are quite large. The network with the “fully-connected” layers and the backpropagation, contributed to its weight size. VGG architectures struggle to learn if they were randomly initialized and trained from scratch; the networks were simply too deep for basic random initialization. In order to solve the aforementioned problems, Karen and Andrew (2015) developed a “pre-training” approach, that only needed fewer weight layers for training before converging network weights as the initializations for deeper networks. However, training smaller variations of network architecture and then using the converged weights as initializations to the deeper versions of the network, is time-consuming especially for deeper networks with many fully-connected layers like VGG; it requires training and tuning the hyperparameters to achieve a good result. Nevertheless, VGG architecture has proven itself to be more suitable for generalization tasks. There is a smaller variation of VGG called “MiniVGGNet” that is well suited for smaller datasets.

Fig. 3
figure 3

The VGGNet architecture (Karen and Andrew 2015)

2.3 GoogLeNet architecture

The authors in Szegedy et al. (2015) proposed a deeper and wider architecture (GoogLeNet) than AlexNet. It has a model weight of 28MB and this is lighter than VGG-16 (16 layers) and VGG 19 (19-layers). GoogLeNet uses “global average pooling” instead of “fully-connected” layers found in earlier architectures, and this dramatically reduces its weight size. Although GoogLeNet is a tiny architecture when compared to AlexNet and VGGNet, it outperformed the VGG model in the 2014 edition of the “ILSVRC”; the model makes use of a “network in network” or “micro-architecture” when constructing the overall “macro-architecture”. Goodman et al. (2015) proposed an improved version of Inception (Inception V3), to further boost “ImageNet” classification accuracy and this also impact the performance of the works in Liao et al. (2018) and Liu et al. (2015b). However, for a smaller dataset (with smaller image spatial dimensions) which might require a fewer network parameter, a simplified version of the “inception module” (with lighter layers) can be used. Figure 4 presents a diagrammatic representation of the GoogLeNet network architecture.

Fig. 4
figure 4

The Inception module in GoogLeNet (Szegedy et al. 2015)

2.4 ResNet architecture

ResNet architecture was developed by He et al. (2016a) primarily to improve the performance of the existing CNN architecture like VGGNet, GoogLeNet and AlexNet. It is an exciting network as it introduces the concept of “residual module” and “identity mappings” not in the existing architecture but capable of achieving state-of-the-art results. ResNet is deeper than VGG network but with a substantially smaller model size due to the use of “global average pooling” in place of “fully-connected” layers found in VGGNet. The strength of ResNet is the “residual module” introduced by the authors in He et al. (2016a). The “residual module” consists of two branches: The first is simply a shortcut which connects the input to the addition of the second branch, and a series of convolutions and activations. However, it was found that “bottleneck” an extension to the “residual module”, performs better when training a deeper network. Furthermore, He et al. (2016a) in their updated publication, experimented with the ordering of convolution, activation, and “batch normalization” layers within the “residual module” and they found that by updating the “residual module” to use “identity mappings”, higher accuracy could be obtained (Zhang et al. 2017). However, ResNet is computationally expensive with large depth; it takes a long time to train. There are other shallow variations of ResNet like “ResNet-10”, “ResNet-18”, and “ResNet-34” which will also generalize on less difficult task but will not obtain as high an accuracy as the deeper ones. It is fondly used when classifying a challenging and more difficult task of facial age estimation. Figure 5a, b are the representations of the architecture.

Fig. 5
figure 5

The original and updated residual module in ResNet architecture (He et al. 2016b)

2.5 SqueezeNet architecture

Iandola et al. (2017), proposed a lighter CNN architecture that is often used when we need a tiny footprint. It is a tiny network when compared with VGGNet, ResNet, AlexNet and GoogLeNet, with model weighing in at 4.9M which can further be reduced to 0.5MB by model compression. SqueezeNet is often used when networks need to be trained and then deployed over a “network” and/or to “resource-constrained” devices. The network can be trained with a reduction in the number of parameters and still obtain a high level of accuracy. Squeezenet uses “fire module” that relies on an expansive and reduced phase of only \( 1 \times 1 \) and \( 3 \times 3 \) convolutions. The module reduces the spatial volume size in the network with its relatively low number of filters and the presence of “global average pooling”. “Global average pooling” acted in place of “fully-connected (FC)” layers, removing the FC layers, has the added benefit of reducing the number of parameters required by the network extensively. However, SqueezeNet is not included in the Keras core library. As shown in Fig. 6, the tiny nature of the architecture will affect its performance on the generalization.

Fig. 6
figure 6

The fire module in SqueezeNet architecture (Iandola et al. 2017)

2.6 Xception architecture

Recently, Chollet (2017) proposed “Xception” (“Extreme Inception”) network, which is an extension of the “Inception” architecture but replaces the Inception modules with “depthwise separable convolutions”. “Xception” is a convolutional neural network architecture based entirely on “depthwise separable convolution” layers with the smallest “weight serialization” at only 91MB. As presented in Fig. 7, the network architecture has 36 “convolutional layers” structured into 14 “modules”, all of which have linear “residual connections” around them, except for the first and last modules. The “Xception” architecture is a linear stack of layers with “residual connections” which makes the architecture easier to define and modify when compared to “Inception” architecture. On “ImageNet” dataset, “Xception” shows a better result not only on “Inception V3” but on “ResNet-50”, “ResNet-101” and “ResNet-152” (He et al. 2016a). However, “Xception” is marginally slower than “Inception” modules.

Fig. 7
figure 7

The Xception architecture (Chollet 2017)

A description of the current state-of-the-art CNN architectures is presented in Table 1 while a typical CNN structure for age estimation is shown in Fig. 8.

Fig. 8
figure 8

A typical CNN structure for age estimation

Table 1 Description of state-of-the-art CNN architectures used in age estimation

3 An overview of facial aging databases

The availability of appropriate facial aging databases plays an important role in the evolution of a research field in age estimation. This allows researchers to get engaged in research activities quickly. In the research area related to soft biometrics, good quality age-separated face images are also of utmost importance to the success of the research, especially in the age estimation task (Panis et al. 2015). A brief description of some of the available facial aging databases are highlighted below:

3.1 Burt’s Caucasian database

Burt’s caucasian database (Burt and Perrett 1995) was developed by Burt and Perrett, to investigate “visual cues” to age through blending shape and colour information from many facial components. The database is made up of 147 face images of adult european males between the age of 20 and 62 years. The images reflect a neutral expression with shaved beards and without glasses or makeups. The facial images have 208 features points that encode facial shape.

3.2 3D morphable database

3D morphable database (Vetter 1999) is a powerful representation for human faces. The database contains a “3D scans of 200” adult faces (half males and half females) and 238 “teenaged faces” aged between 8 and 16 years; 125 males and 113 females with a resolution of “76,800 vertices” per scan (320 by 240).

3.3 Face and Gesture Recognition Network (FG-NET) database

FG-NET (Panis et al. 2015) is a publicly-available aging database that was collected in 2004 to support research activities in human facial aging. The database contains 1,002 face images of 82 different individuals whose ages are between 0 and 69 years. These facial images were produced by scanning the photographs found in personal collections of different subjects as such they reflect some significant variability in illumination, image resolution, expression among others. Also, the images show some degree of occlusions of different forms in their appearance.

3.4 PAL database

PAL face database (Minear and Park 2004) was collected at the “university of michigan” by Meredith Minear and Denise Park to support research in the field of facial aging. It contains a total of 1142 face images of 576 subjects aged between 18 and 93. The database has four different age group comprising: 18–29 (218), 30–49 (76), 50–69 (123), and 70+ (158).

3.5 Face Recognition Grand Challenge (FRGC) database

FRGC database (Phillips et al. 2005) was collected in 2005 at the “university of notredame” as a contribution to “multi-modal” data collection for biometric applications. The database contains about 50,000 face images of 568 different subjects with an age span between 18 and 70 years old.

3.6 MORPH database

MORPH database (Ricanek and Tesafaye 2006) is a publicly-available aging database collected at the “university of north carolina” at wilmington by the face aging group. The database records linked attributes such as age, gender, ethnicity, weight, height, and ancestry. The whole database is divided into two albums; album I and album II. Album I contains a total of 1,724 face images of 515 individuals aged between 27 and 68 years. There are 1430 images of males and 294 images of females whose age gap is from 46 days to 29 years. Album-II includes 55,134 facial images gathered from more than 13,000 individual.

3.7 Waseda human–computer Interaction Technology (WIT-DB) database

WIT-DB database (Ueki et al. 2006) contains images of approximately 5500 different japanese subjects consisting of about 3000 males and 2500 females. There are 14,214 male face images and 12,008 female face images in total with a contribution of 1 to 14 image samples from each subject. WIT-DB contains facial images that are of “unoccluded frontal views” with an expression that is normal. The face images exhibit different degrees of variability in appearance that reflect a real-world image. The age label ranges from 3 to 85 years of age and divided into 11 age-groups.

3.8 The AI & R asian database

The AI & R Asian database (Fu and Zheng 2006) was collected for asian face rendering and synthesis. It contains four different databases that include: “AI and R V1.0” for expression, “AI and R V2.0” for aging, “AI and R V3.0” for viewing, and “AI and R V4.0” for Illumination. “AI and R V2.0”, a database for age estimation, has a total of 34 face images with 17 subjects of age from 22 to 61 years. All the face images are “640 by 480” in dimension and “24-bit” colour depth.

3.9 Iranian face database

Iranian face database (Bastanfard et al. 2007) was collected in the year 2007 by the department of engineering, “the university of karaj”, for research activities in race detection, age estimation, facial surgery among others. The database contains 3600 coloured face images of 616 Iranian subjects comprising 487 men and 129 women with face images that are between the ages of 2 and 85. The facial images were captured using a white frame background and in a controlled environment with a high-resolution digital camera.

3.10 UIUC-IFP database

UIUC-IFP database (Guo et al. 2008b) is a non-publicly available database for age estimation. The database contains a total of 8000 high-resolution colour face images that is between 0 and 93 years. It has 1600 different asian natives of 800 males and 800 females with each subjects contributing a minimum of “five frontal images”. The frontal images were collected from an uncontrolled environment with different degrees of variation from illumination, makeup to facial expression.

3.11 Lotus Hill Research Institute (LHI) face database

LHI face database (Suo et al. 2008) contains 8,000 coloured face images of individuals from asians, half females and half males and one image per subject. This dataset covers an age range between 9 and 89 and contains an average of 100 images per age. The face images are at a resolution of about “120 by 160” pixels with a little pose and illumination variations.

3.12 Gallagher’s web-collected Database

Gallagher’s web-collected database (Gallagher and Chen 2009) is a collection of images from “flickr.com image search engine”. It contains 5080 images of 28,231 faces that are labeled with gender and age. The database has a total of seven different age groups: 0–2, 3–7, 8–12, 13–19, 20–36, 37–65, and 66+.

3.13 Ni’s web-collected database

Ni’s web-collected database (Ni et al. 2009) is one of the largest reported for facial aging. It is a collection of 219,892 faces in 77,021 images crawled from “google image search engine” and “flickr.com”. The database consists of age labels from 1 to 80 years.

3.14 Human & Object Interaction Processing (HOIP) database

HOIP face database (Fu et al. 2010) contains 306,600 images of 300 different subjects whose age range is between 15 and 64 years. The database has 10 different age-groups with an interval of five years that is distributed from 15 to 64 years.

3.15 Biometric Engineering Research Center (BERC) Database

BERC database (Choi et al. 2011) is a facial aging database collected and developed by “Biometric Engineering Research Center” (BERC). The database consists of face images of 390 individuals whose age range between 3 and 83 years. The images are of high resolution (\(3648 \times 2736\ \hbox {pixels}\)) and with no variations in facial expression and illumination. The images of the subjects are uniformly distributed with their age and gender.

3.16 Kyaw’s web-collected Database

Kyaw’s web-collected database (Kyaw et al. 2013) is created with the images gathered from the internet using “microsoft image search” application programming interface. The images are prepared by aligning the eye corner points and cropped to an image size of 65 by 75. The database contains a total of 963 different images divided into four age groups: 3–13, 23–33, 43–53, and 63–73.

3.17 OIU-Adience database

OIU-Adience database (Eidinger et al. 2014) is a collection of face images from an ideal real-life and unconstrained environments. It reflects all the features that are expected of an image collected from challenging real-world scenarios. They are face images that were uploaded to “flickr” website from smartphone without any filtering. “Adience” images, therefore, display a high-level of variations in noise, pose, appearance among others. It is used in studying age and gender classification system. The entire collection of OIU-Adience database is about 26,580 face images of 2284 subjects and with an age-group label of eight comprising: 0–2, 4–6, 8–13, 15–20, 25–32, 38–43, 48–53, 60+.

3.18 Cross-Age Celebrity Database (CACD)

CACD is a publicly-available cross-age face dataset, collected by Chen et al. (2015) in 2015. It is a collection of over 160,000 face images of “2000 celebrities” with their age ranging from 16 to 62 years.

3.19 Asian Face Age Database (AFAD)

AFAD, a large and publicly-available dataset, was collected by Niu et al. (2016). The dataset comprises 164,432 facial images of asian descends with an exact age ground-truths. It consists of 63,680 female images and 100,752 images of the male with the age ranging from 15 to 40 years. AFAD images were collected from “RenRen social network”, a network of students to connect with each other and share pictures.

3.20 AmI-Face database

AmI-Face dataset (Anand et al. 2017) consists of face images captured in a laboratory at different distances using a “microsoft lifeCam HD-3000”. They collected the face images at 0\(^{\circ }\), 22\(^{\circ }\), 45\(^{\circ }\), 75\(^{\circ }\), and 90\(^{\circ }\) of rotations. The images were acquired under different non-ideal conditions. The AmI-Face dataset is composed of 4535 face images of 16 different individuals.

3.21 IMDb-WIKI Database

IMDb-WIKI database (Zhang et al. 2017) is the largest publicly available dataset for age estimation of people in the wild, containing more than half a million images with accurate age labels between 0 and 100 years. For the IMDb-WIKI dataset, the images were crawled from IMDb and Wikipedia; IMDb contains 460,723 images of 20,284 celebrities and Wikipedia with 62,328 images. The images of IMDb-WIKI dataset are obtained directly from the website, as such the dataset contains many low-quality images, such as “human comic” images, sketch images, severe facial mask, full body images, multi-person images, blank images, and so on.

3.22 APPA-REAL database

APPA-REAL database (Agustsson et al. 2017) is the first state-of-the-art database with both real and apparent age labels. The images are collected using labeling application, “crowd-sourcing” data collection, data from the “AgeGuess platform”, and with the assistance of “Amazon Mechanical Turk” (AMT) workers. APPA-REAL database contains a total of 7591 images with the real and apparent age annotations. It has an age range that is between 0 and 95 of images of subjects that were taken under different conditions, which makes it more challenging for recognition purposes.

3.23 AgeDB database

AgeDB (Moschoglou et al. 2017) was collected for the purpose of research activities in age estimation, age-invariant face verification and face age progression. The manually collected face images reflect some degrees of variability that is expected of an image captured under completely uncontrolled, in-the-wild conditions. The variations appear in expressions, pose, occlusions etc. AgeDB contains a total of 16,488 images of actors, actresses, politicians, among others. All the images are labeled with their identity, age and gender attribute with a total of 568 different subjects of ages between 1 and 101 years.

3.24 UTKFace database

The UTKface UTKFace (2017) collected, is a very large face dataset with about 20,000 face images labeled with age, gender, and ethnicity. It has an age range of 0 and 116 years. The images cover large variation in facial expression, pose, illumination, resolution, occlusion, among others. The dataset could be useful for research in age estimation, face detection, landmark localization, age progression/regression etc.

3.25 Summary

MORPH-II, IMDb-WIKI, OIU-Adience, CACD, AFAD, WIT-BD, HOIP, Gallagher’s web-collected, Ni’s web-collected, AgeDB, and UTKface databases are most suitable for age estimation using CNN techniques because of their large data size. However, Fig. 9 shows that MORPH-II has the largest number of usage in age estimation using CNN techniques; it is the most appropriate dataset while estimating ages of faces taken in a controlled environment while OIU-Adience and IMDb-WIKI are well suited for classifying age of face images from an uncontrolled real-world environment. Samples of the images in IMDb-WIKI, OIU-Adience and MORPH-II datasets are presented in Fig. 10. Table 2 also presents a summary of the suitability of all the datasets studied in this paper.

Fig. 9
figure 9

A graph of aging databases by the number of publications

Fig. 10
figure 10

Samples of images in IMDb-WIKI, Adience and MORPH-II datasets

Table 2 Summary of the Popular Facial Aging Databases

4 Description of age estimation algorithms

In this section, we present different algorithms and techniques used in facial age estimation. As shown in Fig. 11, most of these techniques fall into five different categories; Age estimation can be modeled as a multi-class classification (MC), metric regression (MR), ranking, deep label distribution learning (DLDL) or an hybrid (combination of two or more techniques). We present a description of these algorithms and suggest the most effective approach in our opinion.

Fig. 11
figure 11

Classification of facial age estimation algorithms. The typical age estimation methods are categorized into five different algorithms

4.1 Multi-class Classification (MC)

Multi-class classification approach views the the ages or age groups category as an independent label and treats age value as a separate category and learns the age classifier to infer the person’s age (Feng et al. 2017; Zhu et al. 2015; Malli et al. 2016). MC algorithm does maximize the probability of ground-truth class label by not even considering other classes. Nevertheless, the limited training samples and the class imbalance among most facial aging datasets can result in overfitting problem (Gao et al. 2018).

4.2 Metric Regression (MR)

Metric Regression-based algorithm views the age class as a linearly progressing relationship and does not display the diversity of the aging method. It learns the trait that most appropriate for the mapping of the age-value space from the feature space using the appropriate regularization method. Although, it is quite normal to address age estimation task as a MR problem, which does minimize the Mean Absolute Error (MAE) result, and improve the estimation accuracy performance. However, MR generates an unsteady training mode, causing a large error term, which affects accuracy performance. Some of the typical regression methods include Gaussian Process (Zhang and Yeung 2010), quadratic regression (Lanitis et al. 2004), Support Vector Regression (Guo et al. 2008a).

4.3 Deep Label Distribution Learning (DLDL)

DLDL approach converts real-value age to a discrete-age distribution to fit the entire age distribution. It is an end-to-end learning model that solves the problem of insufficient training images experienced in most age estimation task. It relaxes the demand for a large number of training images and uneven distribution of the data by converting real age value to discrete age distribution to fit the whole age. The training instances connected with each class label will be increased without an increase in the number of training samples (Shen et al. 2017; Gao et al. 2018). However, it is usually observed that there is a lack of consistency between the employed evaluation metric and the training goals, therefore, generate an unsatisfactory result.

4.4 Ranking

The ranking-based algorithm uses age-axis tactics for age-class prediction and utilizes the relative order of the age. It uses relative age ranks instead of real age labels and ranks age class labels in the descending order using their relevance to the presented face images, to prevents making a decision for each age label that can simplify the problem (Chang et al. 2010; Li et al. 2012; Liu et al. 2017b). Nonetheless, ranking algorithm can generate suboptimal results especially, when the training objectives and the evaluating metric, are not consistent.

4.5 Hybrid

Hybrid algorithm can be built by combining two or more algorithms in a parallel or hierarchical manner, to produce a better performance. The algorithmn makes the most of the advantage of the strengths of each individual algorithm to obtain more robust system (Choi et al. 2011; Dib and El-saban 2010; Guo et al. 2008a). Unfortunately, combining two or more algorithms can result in large storage overhead and computational cost, therefore, affects its applicability in resource-constrained machines.

4.6 Summary of age estimation algorithms

In this section, we summarize the main strengths and weaknesses of the different age estimation algorithms. As presented in Table 4, most of the existing state-of-the-art methods used multi-class classification and hybrid algorithms. The hybrid algorithm combines two or more algorithms and this gives a better and more robust model compensating the weak points in the each algorithm with the strenght of others. The ranking algorithm, on the other hand, solves the problem peculiar to the classification algorithm, by using the ordinal information of various ages to convert it into various binary classification problems. Metric regression makes some predictions from data by learning the relationship betwen the features of the data and some valuess. However, with deep label distribution learning, we can obtain a better model by using the adjacent ages to generate label distribution for each age, even when the label distribution of the dataset is uneven. Table 3 produces a summary of those state-of-the-art algorithms, stating their strengths and weaknesses.

Table 3 Description of state-of-the-art algorithms used in age estimation

5 A review of age estimation studies using CNN

In 2015, Levi and Hassncer (2015) developed a shallow CNN architecture that used “three convolutional layers and two fully connected layers” to learn features representations. The simple convolutional neural network architecture was proposed out of the desire to reduce overfitting which can occur when the amount of training data is limited due to the limited availability of large facial aging databases. They also applied a “dropout regularization” and “data augmentation” methods to further limit the risk of overfitting. In spite of the simplicity of their network layout, it outperformed the existing state-of-the-art methods when it was evaluated on the challenging “adience benchmark for age and gender estimation”.

Liu et al. (2015b), proposed an “AgeNet”; an “end-to-end learning approach” for apparent age estimation. The designed network addressed the apparent age estimation problem by fusing “Gaussian label distribution based” classification models and “real value based regression” models. For the two models, a large-scale deep CNN was utilized to learn age feature representations. They also developed a deep transfer learning scheme to overcome the problem that might arise from overfitting. Consequently, the experimental result won a second place position at the “ChaLearn 2015 apparent age competition”, this demonstrated that the network (AgeNet) achieved the state-of-the-art performance in apparent age estimation.

To obtain more reliable system, trained, fine-tuned under unconstrained environment, Rothe et al. (2015) employed deep learning method “Deep EXpectation (DEX) of apparent age”, to tackle the estimation of apparent age in still face images. The architecture is a VGG-16 network that was initially pretrained on ImageNet dataset for image classification. They developed the largest public facial aging database by collecting about 500,000 face images of celebrities from the IMDb website and Wikipedia. They also fine-tuned the proposed model on apparent age labeled face images for better accuracy. Their method defined the age regression problem as a deep classification problem followed by a “softmax expected value refinement” and this shows an improvement over direct regression training of CNNs. DEX network ensembles the prediction of “20 networks” on the cropped face image and does not explicitly employ facial landmarks. The proposed method dramatically won the 1st place position at the ChaLearn LAP 2015 challenge on apparent age estimation outperforming other state-of-the-art methods.

Ranjan et al. (2015) approached an age estimation from unconstrained images with deep convolutional neural networks. The approach employed four different steps to carry out the task, this includes “face detection”, “ alignment”, “deep feature extraction”, and “3-layer neural network regression”. The method obtained the needed features from the pool of a pretrained DCNN model and also adopted a “gaussian loss function” with a “3-layer neural network regression” model for age estimation task before adopting a “hierarchical learning” approach to enhance the method. The result showed that the “gaussian loss function” and the proposed “3-layer neural network regression” model outperformed the conventional “linear ” model for age estimation.

The authors in Huo et al. (2016) proposed a method using deep CNN with “distribution-based (KL divergence) loss functions”. The architecture consists of two deep CNNs of different architectures as two streams: VGG-16 and a novel architecture. The VGG-16 was fine-tuned on three different datasets, the second model utilized different types of inputs of different augmentation methods to train this novel CNN model. Each deep CNN model was pre-trained on different datasets before they finetune the deep CNN models on the competition dataset. They have to fuse the results of two models to achieve the final predicted ages. Consequently, their approach achieved an \( \epsilon \)-error of 0.3057, this won a fourth position at the “ChaLearn 2015 apparent age competition”. The model achieved a stronger prediction because it utilized extra 119,539 face images and other public facial datasets for training.

Gurpinar et al. (2016) proposed an apparent age estimation method with the use of deep learning for apparent age estimation from facial images. The method classified samples into different “overlapping age-groups”. The estimation of these age-groups was carried out with a “local regressors” method before finally fusing all the age-groups for the final estimate. They utilized “Kernel extreme learning” machines for the classification. The proposed model was evaluated on the apparent age estimation dataset for “ChaLearn LAP 2015 challenge” and it achieved the 7th position with 0.374 normal scores on the “sequestered test set”. This showed that “local regressors” can perform better than the “global regressor” for almost all groups.

Antipov et al. (2016) developed a solution using a pretrained VGG-16 convolutional neural network. They trained the network on the huge IMDb-WIKI dataset and then finetuned it on the small dataset provided at the competition. They showed that the actual age estimation of children is the keystone of the competition. In view of that, they developed a separate VGG-16 network for children between 0 and 12 and trained it for apparent age estimation; the “children network” was separated from the “general” one. They employed separate “age encoding” strategies for training the “children” and the “general” networks; the strict one for the “children” network and “label distribution encoding” for the “general” network. The result of the experiment won a first place position at the 2016 edition of the ChaLearn LAP apparent age estimation.

Liu et al. (2015a) proposed a “Multi-Region Convolutional Neural Network” (MRCNN) for facial age estimation. The proposed method makes use of the “multiple sub-regions” which contain rich information on the age. It does this, by joining the “multiple face sub-regions” together for age estimation. The method utilized “8 network” and then constructed “8 sub-network” structures before fusing them at the feature level. The proposed model has two benefits: the “8 sub-networks” learned the unique age characteristics of the corresponding sub-region and the “8 networks” are packaged together to complement age-related information. Dramatically, the experimental result achieved a state-of-the-art performance when evaluated on MORPH-II database.

Malli et al. (2016) developed an ensemble of deep learning networks to achieve an apparent age estimation. They employed a fine-tuned VGG-16 convolutional neural networks architecture that was pretrained on the IMDb-WIKI database. They discovered that the apparent age is different from the real age of individuals. The real age is associated with a single age label while apparent age has multiple age labels associated with the face image. To solve the problem, they classified the face images that are within a defined age range together. They trained an ensemble of deep learning models using these age groups and their age shifted groupings, then combined the outputs of those model to achieve the final estimation. They solved the problem of imbalance in age distribution associated with the dataset by using an “adaptive data augmentation”.

Based on ordinal regression and deep learning, Niu et al. (2016) proposed an “end-to-end learning” approach to address the difficulty associated with “ordinal regression problems”; the first work to address “ordinal regression” problems through CNN. The approach employed deep CNN which could conduct “feature learning” and “regression” modeling at the same time. The proposed method is a multiple output CNN learning algorithm that collectively solves the series of “ordinal regression” sub-problems. As part of the solutions, they developed a dataset “Asian Face Age Dataset (AFAD)” with about 160K facial images with exact age. The approach achieved the state-of-the-art performance when evaluated on MORPH-II and AFAD datasets.

Agustsson et al. (2017) proposed a “deep Residual Deep Expectation” (DEX) method that posses the capacity to improve the performance of the original “DEX regressors” on age estimation tasks. The “original regressors” give a rough estimate of the age, by extracting the robust features from the input face image but the proposed “residual regressors” tackled the residuals between the “rough DEX estimation” and the “ground truth labels” through a “specialized” model. The new “regressor” model allowed correction and also improved the performance of the “original DEX” showing improvements in age estimation task. As part of their solutions, they developed a large face image dataset “APPA-REAL”, with both real and apparent age annotations.

Anand et al. (2017) applied “post-processing” approaches to improve the performance of pre-trained deep networks. The method used the strategies to extract features from the input face image using pre-trained CNN. The proposed method implemented a “feature level fusion”, decreased the dimension of the feature space before finally estimating the age of the individual using a “Feed-Forward Neural Network” (FFNN). The age estimation method achieved better results than the state-of-the-art techniques when evaluated on “adience benchmark of unfiltered faces for gender and age estimation” and on a private (AmI-Face) dataset.

Aydogdu and Demirci (2017) proposed an “optimized deep CNN” architecture for the age estimation task. The proposed CNN architecture consists of four convolutional layers and two fully connected layers. The performance of the architecture was evaluated on a MORPH-II database and it outperformed other CNN architectures in their study using “exact success”, “top-3”, “1-off” criteria, and “standard deviation” values.

Based on ranking approach, Chen et al. (2017) developed a novel CNN-based architecture “ranking-CNN” for age estimation. The architecture has a series of fundamental CNNs trained on “ordinal age labels”. The binary outputs of the those CNNs are collected for the final age estimation. Through extensive emperical experiments, they demostrated that their proposed method resulted in smaller estimation errors when compared with “multi-class” classification techniques. Consequently, “ranking-CNN” method remarkably outperformed other state-of-the-art age estimation models on benchmark datasets.

To overcome the problem of strong invariance of the model caused by datasets with various sale of data, a multi-path CNN model was proposed by Liu et al. (2017a) proposed a “Group-Aware Deep Feature Learning” (GA-DFL) technique for facial age estimation. “GA-DFL” method extracted the features required for face description by learning a “discriminative feature descriptor” directly from the raw pixels. In order to smoothen the adjacent age groups, they introduced an overlapped coupled learning method. They also employed a “multi-path” deep CNN architecture to integrate multiple scale information into the learned face presentation which further improved the performance of the method. They assessed the effectiveness of the proposed method on three publicly-available datasets on facial age estimation that were obtained in both controlled and uncontrolled conditions and it achieved a better performance when compared with most state-of-the-art facial age estimation methods.

Liu et al. (2018, 2019) proposed an “Ordinal Deep Feature Learning” (ODFL) method for facial age estimation. ODFL developed deep CNN to study “age-adaptive” face descriptors with CNN to utilize the “topology-aware ordinal relation” for face description. To achieve this, they ensured the “topology-aware ordinal relation” of face images is maintained in the learned feature place, they also ensured that the age distinction information of the embedded feature representation is employed in a “ranking-preserving” way. They evaluated the empirical result on four publicly-available datasets on facial age estimation and it showed an encouraging performance when compared with the current state-of-the-art methods.

Also, the authors in Qawaqneh et al. (2017) employed a VGG-Face network model that was trained on a database for face identification task. The deep CNN architecture consists of 11 1ayers; eight convolutional layers and three fully connected layers with each “convolutional” layer followed by a “rectification” layer, and a “max-pool” layer featuring at the end of each convolutional block. The study also investigated a “GoogLeNet” architecture and was also trained on a very large database with millions of training images, but unfortunately, it could not outperform the proposed VGG model. The VGG-Net CNN was further finetuned and modified to perform age estimation task.

Zhang et al. (2017), proposed a new CNN based method “Residual Networks of Residual Networks (RoR)” for age group and gender estimation in the wild. The RoR model was first pretrained on ImageNet dataset before finetuning on the IMDb-WIKI-101 and adience database to achieve better learning ability of face images. They evaluated the effectiveness of the proposed RoR method for age and gender estimation on popular adience benchmark.

To overcome the problem of insufficient training data, Gao et al. (2018), proposed a CNN-based method that is based on label distribution. They designed a “lightweight” network architecture that is devoid of a large number of network parameters, reducing the computation cost and storage overhead; the model parameters is 0.9M. The proposed method is a “unified” structure which can collectively learn age distribution and regress age. The model was designed by unifying two existing current state-of-the-art age estimation methods into a single DLDL framework. They also proposed a “DLDL-v2” framework which eases the discrepancy between training and evaluation stages through collectively learning age distribution and regressing single age with shallow and deep network structure. This approach created a new state-of-the-art result on apparent and real age estimation tasks achieving comparable results as the state-of-the-art methods when evaluated on LAP2016 and MORPH-II datasets.

Duan et al. (2018a), introduced a “hybrid structure” of CNN and “Extreme Learning Machine” (ELM) in a “hierarchical” style for age estimation. The “hybrid architecture” utilized CNN to extract the features from the input images while the ELM classified the “intermediate results”. Subsequently, they established the performance of their hybrid structure on two popular datasets: MORPH-II and adience benchmark, the experiments showed that the hybrid framework achieved a better performance when compared with other results on the same face aging datasets.

Furher, in Duan et al. (2018b), the authors proposed an ensemble structure referred to as “CNN2ELM”, which includes CNN and “Extreme Learning Machine” (ELM) for age estimation. The model is a modification of the method used by Duan et al. (2018a). It consists of a three-level model including “feature extraction” and fusion, age grouping through an “ELM classifier”, and age estimation through an “ELM regressor”. They trained three networks to extract traits resembling age, gender, and race from the same image of a person during test and validation stages. Features associated with the age property were improved by fusing race and gender features. Then, to obtain a narrow age span, the ELM classifies the fusion results into one of the age groups. Subsequently, an age determination is executed using an “ELM regressor”. They pretrained the network on “ImageNet” database and then finetuned on the “IMDb-WIKI” database. They evaluated the effectiveness of the proposed network on adience benchmark, ChaLearn LAP2016, and MORPH-II, it outperformed the current state-of-the-art methods on age estimation tasks. It achieved the sixth position in “ChaLearn Looking At People 2016 apparent age estimation challenge” final results.

Liao et al. (2018) proposed an “AgeNet” and “divide-and-rule” architecture to estimate age. The “AgeNet” is a CNN-based network. The network was utilized to extract face age descriptor while the divide-and-rule strategy was employed for face age estimation. The “AgeNet” model used an approach based on regression and classification to construct an age estimated deep CNN. The network is a sturdy face age feature extractor model that possess a superior image representation capacity. The proposed “divide-and-rule” learning model was to resolve the “ordinal regression” problem associated with the age estimation task. The experimental result on FG-NET, MORPH-II, and IMDb-WIKI showed that the “AgeNet” method and “divide-and-rule” age estimator achieved a better result than the conventional age estimation methods.

Also in Shara and Shemitha (2018), the authors proposed a multiple deep CNN that is based on VGG-face network for the facial age estimation. The age estimation method involved three different phases: a “training” phase, “feature extraction” phase, and “testing” phase. They also collected more than 10,000 age-labeled face images. They exploited the age information from the face images of age difference through the deep CNN-based model. The method utilized the “symmetric Kullback–Leibler divergence loss function” at the top layer of the model and used the “label distribution” for the loss function. The performance of the method was evaluated on the privately-collected images.

Later, Rothe et al. (2018), used “Deep EXpectation” (DEX); a deep learning solution that is based on VGG-16 architecture, to solve real and apparent age estimation from a single face image without the use of facial landmarks. They also introduced the IMDb-WIKI dataset, the largest public dataset of face images with age and gender annotations. The DEX model was initially pre-trained on both “ImageNet” and “IMDb-WIKI” datasets in order to achieve better performance. DEX learned from large data, utilizing a robust face alignment before formulating an expected value for age regression. They validated the DEX method on standard benchmarks: MORPH-II, FG-NET, and LAP2015 and it achieved a state-of-the-art result for both real and apparent age estimation.

The authors in Liu et al. (2018) developed a CNN architecture based on the “multi-class focal loss function” to increase the achievement of age estimation. Precisely, they designed an approach that approached the class inequality through reshaping the standard “cross entropy loss” that it down-weights the loss attached to well-classified samples; they studied the problem of excessive class imbalance amongst different age categories. They validated the approach on an adience benchmark and it showed that the proposed model achieved a significant improvement in performance for age estimation.

Li et al. (2019) proposed a CNN based technique, BridgeNet, for age estimation. The proposed model comprises two components; local regressors and gating networks that can jointly be learned in an end-to-end way. The first component (local regressors) addressed heterogeneous data by partitioning the data space. In contrast, the second one (gating networks), employed a bridge-tree structure that learns the continuity-aware weights used by the local regressors. Experimental results on the MORPH II, FG-NET, and Chalearn LAP 2015 datasets proved the CNN model to be effective, outperforming the state-of-the-art methods.

Liu et al. (2019) then developed a method that is an extension of their work in Liu et al. (2018). The work is an end-to-end ordinal deep learning (ODL) framework, including two ordinal regression loss functions; Square loss and Cross-Entropy loss. The proposed ranking-based ordinal deep feature learning (ODFL) method learns features needed for face representation directly from raw image pixels and then learn the procedures of feature extraction and age estimation independently. The work was evaluated on state-of-the-arts face aging datasets, and it achieves superior performance when compared with the state-of-the-art methods in age estimation.

Zhang et al. (2019), also proposed a novel method; recurrent age estimation (RAE). The CNN-based method makes use of the appearance features and the personalized aging patterns of input face images. RAE used an architecture that combines CNN and Long Short-Term Memory networks (LSTM); CNN is trained to extract discriminative appearance features from face images, while the LSTM network learns the personalized aging patterns from sequences of face features. Furthermore, to exploit the ambiguity from the real age and adjacent ages, the authors employed Label distributed learning (LDL), and this consequently improved the experimental result by overcoming the problem of overfitting caused by small datasets. The experimental results show that RAE outperformed the existing approaches when it was evaluated on MORPH-II and FG-NET datasets.

Nam et al. (2020) solved the problem of age estimation of low-resolution facial images with a deep CNN-based model that reconstruct low-resolution faces as high-resolution faces. The CNN-based solution consists of a conditional generative adversarial network (GAN) that pre-processed low-resolution facial images before used as input. The model then used the state of the art CNN network architecture like ResNet, VGG, and DEX, for the age estimation of the reconstructed facial images. The experimental results on PAL, MORPH, and FG-NET databases, demonstrate the proposed method effectiveness in high-resolution reconstruction. It achieves state-of-the-art results in age estimation of low-resolution images.

Further, in Liu et al. (2020), the authors developed a lightweight CNN network (ShuffleNetV2), based on the mixed attention mechanism (MA-SFV2). The model; Mixed Attention-ShuffleNetV2, transforms the output layer, that model age estimation as a classification problem(that classify age as a separate label), regression problem (that rank the age of the human face having a particular order) and distribution learning (that consider the age correlation between adjacent ages). The model includes image pre-processing that reduces the influence of noise vectors and a data augmentation method like filtering, sharpening, histogram enhancement, etc., that increase the image size and alleviate the overfitting of the network. The model combines classification and regression and distributed learnings algorithms for the age estimation task. The experimental results on MORPH-II and FG-NET datasets, prove the applicability of the model in real-life situations, especially in mobile terminals.

Recently, Agbo-Ajala and Viriri (2020) developed a CNN-based model to classify unconstrained real-life face images into age and gender. The approach included an image preprocessing algorithm that prepares the input images and also a CNN architecture that does the feature extraction and the classification of the images to age and gender group. The experimental results were evaluated on the OIU-Adience dataset, and it confirms the effectiveness of their approach, outperformed other studies on the same dataset.

Table 4 Summary of state-of-the-art methods and performance

6 Discussion

Age estimation can be addressed as an exact age estimation or age-group estimation. Exact age estimation assigns an exact age label to a face image while age-group estimates the age range in which a facial image can fall. As presented in Table 3, the current techniques for age (groups) estimation fall into five different classes which include: “multi-class” classification, “metric regression”, “ranking”, “Deep Label Distribution Learning” (DLDL), and “hybrid” (combining two or more modeling methods); Age (group) estimation can be approached as “multi-class” classification, “metric regression”, “ranking”, DLDL or “hybrid” of two or more methods. Although, choosing among any of these approaches may be guided by the complexity of the problem, size of the dataset and the age distribution of the dataset used, it is very evident from the literatures that no concise conclusion can be given on the best class of deep learning approach for age estimation task. For huge and “evenly-distributed” datasets, any of the approaches can be employed. For datasets with imbalance (uneven) age label distribution, age-group labels or insufficient training images, ranking and DLDL based approach may be more suitable. Combining two or more modeling approaches in a parallel or hierarchical manner produces a better performance when compared with a single CNN method. The hybrid method makes the most of the advantage of the strengths of each technique used and is expected to not only outperforms other individual approaches but also make it robust.

It is important to use established metrics to measure the extent to which alternative solutions, models, or approaches are able to meet expectations or goals through direct measurements of their strengths, deficiencies, and trade-offs. The standard evaluation metrics associated with facial age estimation are:

Mean Absolute Error (MAE) is one of the most commonly used metrics to evaluate the performance of each age estimation and the most suitable for measuring age estimation when there is a missing images while training data. The MAE is the average of the absolute errors between the estimated ages and ground truth ages (Onifade and Akinyemi 2014). As shown in mathematical Eq. (1), it is defined as an average performance of the age estimation techniques. The smaller the MAE, the better the age estimator effectiveness.

$$\begin{aligned} MAE = \sum _{k=1}^N \frac{|l_k-l_k^*|}{N} \end{aligned}$$
(1)

where \(l_k\): the estimated age, \(l_k^*\): the ground truth age for the test image k, N: the total number of test images.

Cumulative Score (CS) is a useful metric when measuring the performance of an age estimator (Lu et al. 2015). It is most suitable when the training data has images at nearly every age. It can also be used as an indicator of the efficiency of an age-group classifier. Eq. (2) presents the mathematical function for CS. The larger the CS, the better the age estimator effectiveness.

$$\begin{aligned} CS(j)=\frac{N_{e\le j}}{N} \times 100\% \end{aligned}$$
(2)

where j: error level, N: the total number of test images, \(N_{e\le j} \): the amount of test images on which the age estimation performs an absolute error not greater than j (years).

Exact accuracy metric is also used to define the effectiveness of an age estimator (Levi and Hassncer 2015). It is calculated as the percentage of face images that were classified into correct age-groups. Eq. (3) presents its mathematical equation.

$$\begin{aligned} Exact\; accuracy = \frac{no \;of\; accurate\; prediction}{total \;no \;of \;prediction \;made} \end{aligned}$$
(3)

1-off evaluation metric measures whether the ground-truth class label matches the predicted class label (Aydogdu and Demirci 2017). It allows for a deviation of at most one bucket from the real age range. 1-off is calculated as a ratio of the correct predictions to the total number of data points.

Normal Score (\(\epsilon \)-error) metric calculates the proportion of inaccurate predictions over the total number of instances evaluated (Zhu et al. 2015). The smaller the \(\epsilon \)-error, the better the age estimator performance.

Equation (4) gives the mathematical definition of \(\epsilon \)-error metric

$$\begin{aligned} \epsilon = 1 - e^{-\frac{(x-\sigma )^2}{2\mu ^2}} \end{aligned}$$
(4)

where x: the estimated age, \(\sigma :\) apparent age label provided for a given face image, \(\mu \): standard deviation of all estimated ages for the given face image.

However, there are major observations from earlier research work in the estimation which help in making some justifiable conclusions. Hence, we highlight those observations stating some of the points which we deem very important:

  • It was observed that most of the existing work assume a constrained scenarios of the face images; the face in the input image are normalized with a frontal view. While other models employed a face pre-processing step that make face localization and alignment that prepare the images for the next steps

  • It was also observed that the image processing method employed for face detection, facial landmark and face alignment has an impact on the performance of an age estimator.

  • It is also important to assert that the performance of learning algorithms is determined by many factors, among which are the size and label distribution of the employed dataset, the degree of image variability, etc. The deep learning algorithms perform differently on different datasets, this is most likely due to the peculiarities of each dataset.

  • Data augmentation improves the performance of an age estimation models, especially on an unevenly distributed not-too large datasets.

  • We also observed from the literature that models pre-trained on large scale datasets before fine-tuning on the original dataset performed better than training the model on just the original dataset.

  • From this review, we observed that multi-class classification algorithmn have been the most popularly used individual algorithm for age estimation in the literature.

  • We also observed that ranking and DLDL are the most suitable algorithms for the estimation when the label distribution of the dataset is uneven.

7 Conclusion and future directions

In this paper, we present a comprehensive survey of various CNN architectures with their strengths and weaknesses. A thorough analysis of the existing deep CNN state-of-the-art methods used in age estimation was also discussed. We have also studied different facial aging dataset benchmarks available, their suitabilty and performance on different CNN models when measured on standard evaluating metrics. In Table 4, a summary of the performance evaluations of age estimation CNN methods on different databases is given.

There are a series of encouraging future research study that may see improvement in age estimation performance. Among them are:

  • Age estimation of an individual from an unconstrained real-life face image is rapidly gaining more popularity, because of its many possible application. Although there has been a number of efforts towards achieving a result to a high level of accuracy, the results are quite not good enough because of the challenges experienced in an ideal world image with varying degrees of variations.

  • A deeper network architecture, with sufficient and evenly-distributed training face images, should be investigated in the future works.

  • Also, there is a need to investigate a study that focuses more on predicting apparent age (how old does the person look like?) rather than biological age (the real age of a person). This is useful in “face beauty product development”, “movie” and “theater role” casting, “plastic surgery”, “age-specific employment”, etc.

  • Huge datasets with apparent age labels annotation rather than annotation with real age will also help improve accuracy in apparent age estimation research.

Comparative analysis of different approaches helps in understanding the implementation of a project in a better way. Therefore, this work is expected to serve as a guide in choosing the right method and approach for facial age estimation to further improve on the existing state-of-the-art results in the field.