1 Introduction

Age estimation is defined as the process of determining a person’s age based on many biometric attributes [26]. Among the multiple age-related traits, the human face conveys a lot of valuable biological information that allows one to guess the age of others immediately just by looking at their faces. The facial appearance of humans changes remarkably with time due to various extrinsic and intrinsic factors like gender, genes, environment, lifestyle, etc. As a result, that leads to diversity in human aging progress at different age stages. Moreover, aging growth varies among different people. Hence, the human estimation of age cannot always be accurate, especially the fact that people’s behaviors and preferences vary at different ages. Therefore, the need for advanced algorithms that can exceed human ability [39] in the field of computer vision has become a more interesting yet difficult problem to handle in recent years [9].

Some examples of Pre-trained CNN are:

  1. 1.

    Alex-Net: It contains eight layers; the first five were convolutional layers, some of them followed by max-pooling layers, and the last three were fully connected layers. It uses the non-saturating Re-LU activation function, which showed improved training performance over tanh and sigmoid.

  2. 2.

    Google-Net: It consists of a 22-layer deep CNN and implements a novel element that is dubbed an inception module. It reduced the number of parameters from 60 million (Alex-Net) to 4 million.

  3. 3.

    VGG-Net: Its architecture is very uniform making it very attractive and special. It has 16 convolutional layers. It owns 3 × 3 convolutional layers with several kernels. It has a large number of parameters that is bigger than 130 million weights

  4. 4.

    Res-Net: To avoid going deeper (more training weights), Residual neural networks utilizes skip connections, or shortcuts to jump over some layers.

Training those networks (millions of parameters) from scratch is very challenging and difficult. A better solution is to use Transfer learning [20]. Transfer learning is a recent benchmark in the deep learning methodology that provides the ability to re-enhance the existed architectures to be suitable for new tasks instead of building new paradigms from scratch. It aims to utilize the knowledge acquired from pre-trained paradigms to solve new-related tasks instead of creating novel models. Thus, the previously learned models can be modified partially by adding new layers or training on new samples. It can improve the performance of deep learning schemes since the existed models learn generic attributes that can be existed in other domains. And it is very useful in overcoming the over-fitting problem, especially if the dataset is not large [20].

When CNN is retrained on new samples for another-related tasks, it is called fine-tuning. In fine-tuning:

  1. 1.

    The last layers will be removed.

  2. 2.

    New layers of the same class of removing ones will be configured.

  3. 3.

    The learnable neural nodes at the last fully connected layer in the classification module are replaced with a new layer from the same class with several neurons equals to the number of classes in corresponding training data.

  4. 4.

    It is applied in cases in which the amount of training samples is small.

Alex-net and Google-net are pre-trained deep convolutional networks that have been trained on over a million images and can classify images into 1000 object categories, such as a keyboard, coffee mug, pencil, and many animals. The networks have learned rich feature representations for a wide range of images which enable them to take an arbitrary image as input and then outputs a label for the object in the image together with the probabilities for each of the object categories.

The objective of this paper is to generate high age estimation accuracy by:

  1. 1.

    Using those powerful networks.

  2. 2.

    Fine-tuning those networks with transfer learning which is usually much faster and easier than training a network from scratch with randomly initialized weights.

To accomplish our objective, we have concentrated on 2 experiments:

  1. 1.

    Getting the optimum number of outputs (classes) of those networks to reach high age estimation results.

  2. 2.

    Finding the optimum age gap: As you increase the age gap between 2 groups, the recognition will increase among the 2 groups at the expense of increasing the age images which are in-between the 2 groups and eventually decreasing the recognition of these in-between ages.

Those experiments led to our novel idea: building a hierarchical network, which will be able to organize the face images in the age group they belong to. This network consists of a set of pre-trained 2-classes CNNs (Google-net) with optimum age gaps.

2 Related work

2.1 Deep learning for age estimation

Y. He. proposed an end-to-end deep embedding network for age estimation [18]. To create a metric embedding, we need to learn a function that can map an input to a feature space where the Euclidean distance in this embedded space directly represent the semantic similarity of the inputs, and in this case, the features were learned by triplet loss using a CNN. Still, on the topic of multi-task classification, [34] uses the age-related ordinal information and proposes a multiple output CNN to perform age ordinal regression. They transform the ordinal regression into a series of binary classification sub-problems. These binary classifiers are responsible for predicting whether the rank of an image is above an rk rank, k ∈ K, where K is the number of ranks, i.e. number of age ranges, which can be discrete or aggregated values. This technique was later improved by [5]. Other similar approaches were developed by [47, 50], where the multi-task classification loss is formulated by adding the gender and gender-specific age estimation losses. Multitask learning can be seen as a form of inductive transfer, which can help improve a model by introducing inductive bias. The inductive bias in the case of multi-task learning is produced by the sheer existence of multiple tasks, which causes the model to prefer the hypothesis that can solve more than one task. Multitask learning usually leads to better generalization [41]. The current state of the art performance is held by [52] on both MORPH2 and FGNET-AD datasets. They extract local facial characteristics of a cropped region estimated by a Long-Short Term Memory (LSTM) unit. They then combine their local extracted features with the global-image level features to perform their final estimation. It should be noted that some of the most popular pre-trained networks are Alex-net [22] and Google-net [45] which have been applied successfully to face datasets like the FGNET [8] and the MORPH [40].

Nowadays Deep learning schemes show impressive results in solving object classification [23]. More recently, multitude deep learning models, especially CNN have achieved a state-of-art performance in the domain of computer vision, especially for face-related assignments like face alignment in [44], facial verification in [46]. A complete assessment on utilizing the deep learning for age estimating is presented in [19] and compared to traditional handcrafted visual features. In practice, the fixed handcrafted fusion attributes are unable to reach the state-of-art performance like other modern deep learning schemes in the age estimation problem. Thus, this conventional model will be replaced with the trendiest techniques. In [51] the authors proposed age estimation using three stages: preliminary abstraction stage for extracting deeper features, local feature encoding stage to model the relationship between local features and recall stage for the combination of temporary local impressions.

2.2 Facial features

One of the first attempts in the area of age classification of facial images was reported on the work of Kwon and Lobo [25]. Authors in [37] exploit a similar method to design an approach for an age progression in subjects whose ages are under 18. Statistical model of facial appearance was done in [28, 29]. Extracting more detailed information from facial features was achieved by using the Active Appearance Models (AAMs) to project faces into a lower-dimensional space [6]. After that, Geng et al. [11, 12] suggest Aging pattern Subspace (AGES), which relies on training the subspace on a collection of facial images that emerges each person at various ages. Each set of face images is handled as a single sample to project it to a low dimensional space [27]. On the other hand, authors in [48, 49] treat the problem of age estimation as a regression problem. On a different line of work, the authors in [16] propose manifold learning to build common aging patterns in low-dimensional space from multiple facial images for every age. Furthermore, many algorithms have been used successfully to extract appearance features and characterize the facial images such as Local Binary Patterns (LBP) in [15]. In [10] authors utilize Gabor features to learn an age estimator. More advanced technologies have been introduced in [33] in which authors propose a new model based on decreasing the noise of aging features from facial datasets. In [3] the authors convert the age estimation classification problem into smaller binary classification sub-problems based on an ordinal hyper-plane ranking scheme. An enhanced method, called label distribution, is presented in [13]. As for age-group research, the authors in [31] organize the dataset in four age groups, namely children, adolescence, moderately aged adults, older adults. The novel scheme utilizes in [43] achieved good performance. Authors in [24] adopted Topological Texture Features (TTF) for facial texture. The research outcomes in [32] show that the TTF method is ineffective to deal with the fast changes in facial textures. A survey of Neural Networks applied to age estimation is done in [36]. A hybrid deep learning CNN–ELM for age and gender classification is presented in [7]. [17] proposes Locally adjusted robust regression applied to age estimation. Age estimation robust to optical and motion blurring by deep residual CNN is done in [21]. The authors in [35] used double-level feature fusion of face and gait images. In [38] Convolutional neural networks for age classification from smart phone based ocular images was applied. Concentration on recognition surgically altered face images was done in [42].

In this paper, we have used a set of two-class CNNs (Google-nets) to tackle the age estimation problem. Different experiments were performed in order to find the optimum number of classes and the optimum age. This led to our new network which is compared with other techniques on both the FGNET and MORPH aging datasets giving smaller Mean Absolute Error (MAE).

3 Experiment setup

3.1 Facial databases

The FG-NET database [8] contains 1002 face images related to 82 individuals with an age range between 0 and 69. It is highly skewed because most of its samples are under the age of 31. A typical order of aging face samples from this dataset is shown in Fig. 1. Fig. 2 shows the age distribution of the FGNET images.

Fig. 1
figure 1

Aging faces of one person from FG-NET dataset

Fig. 2
figure 2

Age distribution of FGNET images

As can be seen from Fig. 2, the FG-NET has a small number of images in older age range. It has a larger number of images in the young and small age ranges. To overcome this limitation and to increase the number of age images, The MORPH dataset is used. The MORPH aging dataset is much larger than FG-NET. It has 55,132 face images from more than 13,000 subjects in this database. The average number of images per subject is 4. The ages of the face images range from 16 to 77 with a median age of 33.

We have used the total number of images in the FG-NET and the MORPH databases and we have randomly selected 80% of images as the training data, including 44,909 images, and the remaining 11,227 images are set as the testing data. Note that there are no duplicate subjects between the training and testing sets.

3.2 Training parameters

The FG-NET + MORPH dataset has been chosen because it covers a wide aging range [0–77]. The typical steps of transfer learning involve replacing the classifier module layers with new ones related to the new studied tasks. In practice, to retrain each of Alex-Net and Google-Net on human facial images, the fully connected layers would be set to have an output equals to the number of examining labels [20]. As for the earlier weights in the first layers, they would not be frozen and therefore new updates would replace the current values during the training phase.

Defining different options for optimal training is selected with a learning factor equals to 0.001, and several batches like the number of input samples. We have used the cross-entropy loss function and the stochastic gradient descent algorithm. Finally, the number of epochs ranges between150 to 300.

3.3 Evaluation methods

To validate the practical implementation of the suggested model, the Mean Absolute Error (MAE) is applied. MAE calculates the average of the absolute errors between the estimated values and the target age, as shown in the following equation:

$$ {\mathrm{MAE}}_{\mathrm{ABS}}=\frac{1}{\mathrm{N}}\left(\sum \limits_{\mathrm{n}}^{\mathrm{N}}\left\Vert {\mathrm{k}}_{\mathrm{n}}-{\mathrm{y}}_{\mathrm{n}}\right\Vert \right) $$

Where: K is the real label of the nth image and y is the predicted age based on this proposed model.

4 Experiments and Analysis

We have concentrated on 2 experiments: The effect of the number of classes on the accuracy of the pre-trained CNN and the effect of the age gap on the accuracy of the transfer learning. As a result, we have built a hierarchical network, which will be able to organize the face images in the age group they belong to.

4.1 First experiment: Investigation of the number of classes

In the first test, the relation between the accuracy and number of classes is investigated to measure the impact of the transfer learning on the new task. We have considered different age ranges. Table 1 shows the accuracy results for 6 classes (A, B, C, D, E, and F). We have used an age range of 5 years for classes A, B, C, and 10 years for D, E, and F. This is because “younger aging involves fast cranial changes, older aging involves slow textural changes” [43]. For 5 classes we have used an age gap of 10 years, for 4 classes an age gap of 15 years, for 3 classes an age gap of 20 years and for 2 classes an age gap of 30 years. Tables 1, 2, 3, 4 and 5 show the accuracy outcomes for the different number of classes by using VGG16, ResNet18, Google-Net, and Alex-net.

Table 1 The Accuracy results for 6 classes. A:0–5 B:6–10 C:11–19 D:20–29 E:30–39 F:40–77
Table 2 The accuracy results for 5 classes. A:0–9 B:10–19 C:20–29 D:30–39 E:40–77
Table 3 The accuracy results for 4 classes. A:0–14 B:15–29 C:30–44 D:45–77
Table 4 The accuracy results for 3 classes. A:0–19 B:20–39 C:40–61
Table 5 The accuracy results for 2 classes. A:0–29 B:30–77

4.2 Second experiment: Investigation of the age gap

Two-class CNN is chosen to predict which aging groups the testing images belong to. Fig. 3 shows a typical architecture for 2 age ranges. The SoftMax activation function is used. It should be noted that the output of the SoftMax will give the probability for the input image to belong to a certain age range. The comparison between the performance and generalization of the Alex-Net and the Google-Net has been conducted by employing different age ranges for each label in the two-class CNN (Fig. 3). All the train and test sets are from age range A and B. Table 6 shows the recognition results for the VGG16, ResNet18, Google-net, and Alex-net.

Fig. 3
figure 3

Two-class CNN

Table 6 Age gap accuracy results

Table 6 shows that the best recognition results were attained by using the age ranges A:0–5 and B:10–15. Because all the data are from the ages of 0–15, the errors came from the age range 6–10. Fig. 4 shows 2 two-class CNNs which will give better performance than Fig. 3. The recognition results have increased from 97% to 99% using the Google-net (Table 7). It should be noted that Fig. 4 gave better accuracy because most of the test errors given by the first network in Fig. 4 are in the age range from 10 to 15 which will be covered by the second network.

Fig. 4
figure 4

2 Two-class CNNs

Table 7 Two-class CNN’s accuracy Results

Figure 3 shows a typical 2-class CNN. Its disadvantage is as you increase the large gap between 2 groups, the recognition will increase among the 2 groups at the expense of increasing the age images which are in-between the 2 groups and eventually decreasing the recognition of these in-between ages. For example: 2classes: Class1:0–5 and Class2:10–15. The in-between age ranges are 6–9. A better network is shown in Fig. 4 (our contribution). Figure 4 gave better accuracy (Table 7) because most of the test errors given by the first network in Fig. 5 are in the age range from 10 to 15 which will be covered by the second network. The in-between age ranges 6–9 will be covered in the first network in Fig. 4.

Fig. 5
figure 5

Age estimation network

4.3 Age estimation using age gap

Our Age estimation network is shown in Fig. 5. It consists of a set of two-class CNNs (Google-nets). For example, the first and second CNN’s will give an accurate recognition that the input image is in the range 0–5 or 10–15. The second and third CNN’s will give an accurate recognition that the input image is in the range 15–20 or 20–25.

It should be mentioned that the above network also applies to targets that represent a small age width. In this case, the prediction process will not only guess the correct aging group of the image but also would give approximately the actual age of this image. Predicting the actual age of images could be achieved by making the age width very small.

4.4 Analysis of the results

We have considered different age ranges corresponding to different classes. The accuracy results for 2 classes gave the best results. And as you increase the number of classes the accuracy results will decrease. This can be interpreted as follows: For the same database, as you increase the number of classes, the number of training data per class will decrease.

We have also considered different age gaps. Given the 2 groups:

  1. A:

    from age1 to age2.

  2. B:

    from age3 to age4.

The age gap is the age difference ag3-age2. As you increase the age gap, the number of images which will fall in this age difference range will increase and the accuracy results of these images will decrease.

Our 2-networks (Fig. 4) concentrated on increasing the accuracy results of these images. A fraction of these images will be covered by the first network and the remaining fraction will be covered by the second network.

This approach is generalized to cover all the age ranges (Fig. 5) where each network will cover a certain age range.

4.5 Comparisons with other methods

To evaluate the accuracy of our model the MAE [2] is used. Our network was compared (Table 8) to state-of-the-art models in FGNET and MORPH aging data sets. We have used the total number of images in FG-NET and MORPH. We randomly select 80% of images as the training data, including 44,909 images, and the remaining 11,227 images are set as the testing data. Note that there are no duplicate subjects between the training and testing sets.

Table 8 MAE results for different methods including our proposed network

5 Conclusion

In this paper, a novel network, which consists of a set of pre-trained 2-classes CNNs with optimum age gaps is presented. This set of two-class CNNs (Google-nets) fits the whole aging estimation task and generates high age estimation accuracy. This new technique was compared with other techniques on both the FGNET and MORPH aging datasets giving smaller Mean Absolute Error (MAE). It should be noted that the limitation of this study was to find a dataset which contains all the age ranges. We have solved this limitation by using the total number of images in the two databases: the FG-NET and the MORPH databases. We have used transfer learning applied to the CNN pre-trained networks; future work can be concentrated on training those networks from scratch.