1 Introduction

Humans live a certain period of time, and with the progress of time, the human facial appearance beside other parts shows some remarkable changes due to the aging progression. According to Berry et al. [8], we can recognize this progress in some different facial appearances between the different age groups such as the infants and young children have larger pupils, children’s lips are redder and proportionately larger than lips of adults and baby’s nose is typically small, wide, and concave. Predicting the age is a difficult task for humans and it is more difficult for computers, although the accurate age estimation is very important for some applications. Recently, several applications that exploit the exact age or the age group have emerged. The person’s age information can lead to higher accuracy in establishing the user’s identity for the traditional biometric identifiers which can be used in access control applications [23].

In this paper, we introduce an automatic age estimation scheme in facial image. The proposed scheme contains three phases: face registration, descriptor extraction and age estimation. The goal of face registration or alignment is to detect faces in images, normalize the 3D or 2D pose of each detected face, and then produce a cropped face image. This preprocessing step can be crucial since the processing stages rely on it. Thus, the first stage can influence the performance of the whole estimation process. The preprocessing phase can be challenging since it encounters many variations affecting face images. The extraction stage computes a set of features from the cropped face. These features can be given by either shallow texture descriptor or deep neural network. In the last phase, we fed the extracted features to a regressor to estimate the age.

The main contributions of this paper can be summarized in the followings:

  • We provide extensive comparison of handcrafted-feature-based and deep-learning-based approaches methods.

  • Extending some handcrafted-feature-based methods that are based on Pyramid Multi-Level face representation.

  • A study of the computational cost of each method.

The remaining of the paper is organized as follows: In Section 2, we summarize the existing techniques of facial age estimation. We introduce our approach in Section 3. The experimental results are given in Section 4. In Section 5, we present the conclusion and some perspectives.

2 Background and related work

Facial age estimation is an important task in the domain of facial image analysis. It aims to predict the age of a person basing on his or her facial features. The predicted age can be an exact age (years) or age group (year range) [43]. Predicting age is a difficult task for humans and it is more difficult for computers, although accurate age estimation is very important for some applications. Most of the classic age estimation methods are reviewed in [14], and both classic and deep learning methods are covered and reviewed in [3]. In the literature, few approaches are studying facial age progression or facial age synthesis compared to the facial age estimation studies such as [51,52,53,54, 56]. From a general overview, the facial age estimation approaches can be categorized based on the face image representation and the estimation algorithm. The current methods for age estimation can be divided into two-dimensional (2-D) and three-dimensional (3-D) methods based on the dimensionality of the processed samples. We focus on 2-D images in this work. It is possible to further divide 2-D age estimation approaches into many categories. By adopting a simple categorization, age estimation approaches can be divided into three main types: anthropometric-based, handcrafted-feature-based and deep-learning-based approaches [5]. For this reason, we will focus on these three categories.

The anthropometry-based approaches mainly depend on measurements and distances of different facial landmarks. Kwon and Lobo [28] proposed an age classification method that classifies input images into one among three age groups: babies, young adults, and senior adults. Their method is based on craniofacial development theory and skin wrinkle analysis. The main theory in the area of craniofacial research is that the appropriate mathematical model to describe the growth of a person’s head from infancy to adulthood is the revised cardioid strain transformation [2]. Hu et al. [22] take an image pair of the same person and derived the age difference by using Kullback-Leibler divergence is employed.

Handcrafted-feature-based approaches are one of the most popular approaches for facial age estimation since a face image can be viewed as a texture pattern. These approaches have been used in many computer vision applications due to their strengths such as fast and easy implementation, suitable for real-time applications, and low computational cost. On the other hand, they are vulnerable to profile faces and wild poses and are considered classic approaches. Many texture features have been used like Local Binary Pattern (LBP), Histograms of Oriented Gradients (HOG), Biologically Inspired Features (BIf), Binarized Statistical Image Features (BSIf), and Local Phase Quantization (LPQ). LBP and its variants were also used by many works like in [4, 15, 48, 60]. BIf and its variants are widely used in age estimation works such as [18, 45]. Also, Guo et al. [18] investigated biologically inspired features comprised of a pyramid of Gabor filters in all positions in facial images and used either Support Vector Machine (SVM) or SVR with Radial Basis Function (RBF) kernels for evaluation. Liu et al. [33] propose an ordinal deep feature learning (ODFL) method to learn feature descriptors for face representation directly from raw pixels. Motivated by the fact that age labels are chronologically correlated and age estimation is an ordinal learning problem. Some researchers used multi-modal features. For instance, the work presented in [4] proposed an approach that used LBP and BSIf extracted from Multi-Block face representation. Lanitis et al. [29] were the first to use Active Appearance Models (AAMs). Yang and Ai [59] used a real AdaBoost algorithm to train a strong classifier by composing a sequence of the local binary pattern (LBP) histogram features. They conducted experiments on gender, ethnicity and age classifications. Lu et al. [37] proposed a local binary feature learning method (CS-LBFL) to learn a face descriptor that is robust to local illumination. However, these methods aim to seek simple feature filters, so that they are not powerful enough to exploit the nonlinear relationship of face samples in such cases that facial images are exposed to large variances of diverse facial expressions and cluttered background. In [39], the authors classify the input face image into one of the demographic classes, then estimate age within the identified demographic class.

Deep learning approaches mainly use Convolutional Neural Networks (CNN) which is a type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex. Some approaches train the networks from scratch such as [30] and others do transfer learning such as [44]. Deep learning approaches are considered the best in achieving good and stable results. Also, their implementation is suited for real-time applications which are very important nowadays. Moreover, they are more immune to facial poses compared to classical approaches, yet there is a main downside which is the high computational cost because of the need for the graphics processing units (GPUs). In [20], the authors used VGGFace network to extract the deep features and the kernel Extreme Learning Machines to predict the age. Hu et al. [22] proposed deep architecture with multi-label loss function, their proposed multi-label loss function composed of three losses which designed to drive their CNN to understand the age progressively. Gurpinar et al. [20] used Kernel Extreme Learning Machines to classify the age estimation and for features, the authors used pretrained deep CNN by using the features from a deep network that is trained on face recognition. Liu et al. [32] to estimate the age, they propose a group-aware deep feature learning (GA-DFL) approach. The same authors to improve the performance they designed a multi-path CNN to capture age-informative appearances from different scale information. Shen et al. [50] propose two Deep Differentiable Random Forests methods, for age estimation. Both methods connect split nodes to the top layer of the CNNs and deal with non homogeneous data by jointly learning input-dependent data partitions at the split nodes and age distributions at the leaf nodes. The name of the methods is Deep Label Distribution Learning Forest (DLDLF) and Deep Regression Forest (DRF). Dornaika et al. [12] to get better performance on age estimation, they used robust loss function for training deep network regression.

3 Proposed approach

In this section, we present the different stages of our approach that estimates the human age based on facial images. Our approach takes a face image as input. A face preprocessing stage is first applied to the image in order to obtain a cropped and aligned face. In the second stage, a set of features is extracted across a texture descriptor or a pretrained CNN. Finally, these features will be fed to a linear SVR in order to predict the age. Fig. 1 illustrates an overview of the proposed facial age estimation approaches.

Fig. 1
figure 1

The general structure of the proposed facial age estimation approaches

We give the pseudocode of the facial age estimation pipeline algorithms starting by input image to output age. Algorithm 1 summarizes the different stages of the proposed facial age estimation approach using handcrafted features. Algorithm 2 summarizes the different stages of the second approach based on deep features.

figure a
figure b

3.1 Face preprocessing

Firstly, we apply the cascade object detector that uses the Viola-Jones algorithm [57] to detect people’s faces. This face detector uses the histogram of oriented gradients (HOG) features and a cascade of classifiers trained using boosting. Then, we detect the eyes of each face using dlib’s face landmark detector [26], which is an implementation of Kazemi et al. [25] that uses an ensemble of regression trees (ERT) to estimate the face’s landmark positions directly from a sparse subset of pixel intensities. To rectify the 2D face pose in the original image, we apply a 2D transformation based on the eyes center landmarks. Therefore, we align the face by rotating clockwise the face by an angle θ around the image center. Unlike the work described in [6], the cropping parameters are set as follows: kside = 0.9, ktop = 1.3 and kbottom = 1.9. These parameters are multiplied by a rescaled inter-ocular distance in order to obtain the side margins, the top margin and the bottom margin. The aligned and cropped images are resized to 224 × 224. Figure 2 illustrates the steps of the face preprocessing stage.

Fig. 2
figure 2

Face preprocessing stage

3.2 Feature extraction

The feature extraction stage has been the most studied topic among the remaining stages due to its effective role in the age estimation systems performance. In our approach, we studied two different kinds of feature extraction methods. The first one is based on handcrafted features or texture features and the second one is based on deep features that are extracted using pretrained networks [11]. The main codes used for generating the features can be available upon request.

3.2.1 Handcrafted features

Refer to the attributes derived using generic purpose texture descriptors that use the information present in the image itself. In our case, we used three types of texture descriptors LBP, LPQ and BSIf on a Pyramid Multi-Level (PML) face representation.

Local Binary Pattern (LBP) is a very efficient method for analyzing two dimensional textures. It used the pixels of an image by thresholding the neighborhood of each pixel and considers the result as a binary number. LBP was used widely in many image-based applications such as face recognition. The face can be seen as a composition of micro-patterns such as edges, spots and flat areas which are well described by the LBP descriptor [1].

In the following, we will present the LPQ and BSIf descriptors.

  • Local phase quantization (LPQ): LPQ was originally proposed by Ojansivu and Heikkila [40]. LPQ is a texture descriptor based on the application of STFT. It uses the short-term Fourier transform STFT 2-D computed over a rectangular MbyM neighborhood Nx centered at each pixel position x of the image f(x) defined by the formula (1).

    $$ F(u,x)=\underset{y\in N_{x}}{\sum}f(x-y)e^{-j2\pi u^{T} y}={w^{T}_{u}} f_{x} $$
    (1)

    where wu is the basis vector of the 2 − DDFT at frequency u (a 2D vector), and fx is another vector containing all M2 image samples from Nx.

    In LPQ only four complex coefficients are considered, corresponding to 2-D frequencies: u1 = [a,0]T, u2 = [0, a]T, u3 = [a, a]T, u4 = [a, −a]T, where a is a sufficiently small scalar. For each pixel the vector obtained is represented by the following formula:

    $$ F_{x}=[F(u_{1},x),F(u_{2},x),F(u_{3},x),F(u_{4},x)] $$
    (2)

    The phase information in the Fourier coefficients is recorded by observing the signs of the real and imaginary parts of each component in F(x). This is done using a scalar quantization defined by this formula:

    $$ q_{j}= \left\{ \begin{array}{l} 1 \quad if \quad g_{j}\geq 0 \\ 0 \quad otherwise. \end{array} \right. $$
    (3)

    where gj is the j th component of the vector G(x) = [Re{F(x)}, Im{F(x)}]. The resulting eight binary coefficients qj represent the binary code pattern. This code is converted to decimal numbers between 0-255. From that, the LPQ histogram has 256 bins.

  • Binarized statistical image feature (BSIf): The BSIf is an image texture descriptor proposed by Kannala and Rahtu [24]. The idea behind BSIf is to automatically learn a fixed set of filters from a small set of natural images, instead of using hand-crafted filters such. The set of filters is learned from a training set of natural image patches by maximizing the statistical independence of the filter responses.

    Given an image patch I of size L × L pixels and a linear filter Wk of the same size, the filter response Sk is obtained by:

    $$ S_{k}=\underset{i,j}{\sum} W_{k}(i,j) I(i,j)= W_{k}^{\prime T} I^{\prime} $$
    (4)

    where \(W^{\prime }_{k}\) and \(I^{\prime }\) are vectors of size L × L (vectorized form of the 2D arrays Wk and I). The binarized feature bk is obtained by:

    $$ b_{k}= \left\{ \begin{array}{l} 1 \quad if \quad S_{k} \geq 0 \\ 0 \quad otherwise. \end{array} \right. $$
    (5)

    The filters Wk are learnt using independent component analysis (ICA) by maximizing the statistical independence of Sk. The number of histogram bins (Nbins) obtained by the BSIf descriptor is calculated using this formula:

    $$ N_{bins} = 2^{N_{f}} $$
    (6)

    where Nf is the number of the filters used by BSIf.

3.2.2 Pyramid multi-Level (PML) representation

The Pyramid Multi-Level (PML) representation adopts an explicit pyramid representation of the original image. It preceded the image descriptor extraction. This pyramid represents the image at different scales. For each such level or scale, a corresponding Multi-block representation is used. PML sub-blocks have the same size which is determined by the image size and the chosen level. In our work, we use 7 levels based on [6] observation. Figure 3 illustrates the PML face representation adopting three levels.

Fig. 3
figure 3

PML face representation adopting three levels.

3.2.3 Deep features

Refer to descriptors that are often obtained from a CNN. These features are usually the output of the last fully connected layer. In our work, we used the VGG16 architecture [55] as well two variants of this architecture. We extract the deep features from the layer FC7 (fully connected layer) of this architecture, and the number of these features is 4096.

VGG16 [55] is a convolutional neural network that is trained on more than one million images from the ImageNet database. The network has an image input size of 224 × 224 and it is 16 layers deep and can classify images into 1000 object categories. It has approximately 138 million parameters, which makes it expensive to evaluate and use a lot of memory. As a result, the network has learned rich feature representations for a wide range of images. VGGFace [42] is inspired by VGG16, it was trained to classify 2,622 different identities based on faces. DEX-IMDB-WIKI [44] was fine-tuned on the IMDB-WIKI face database which has more than 500K images, yet it is also based on VGG16. Figure 4 presents the CNN architecture of VGG-16 as deep descriptor.

Fig. 4
figure 4

VGG-16 CNN architecture

ResNet-50 [21] is a convolutional neural network that was trained on the ImageNet database. It is based on a residual learning framework where layers within a network are reformulated to learn a residual mapping rather than the desired unknown mapping between the inputs and outputs. It is similar in architecture to VGG16 network but with the additional identity mapping capability (see Fig. 5). VGGFace2 [9] is a variant of ResNet-50 which was trained on VGGFace2 database for facial recognition. ResNet-50 and VGGFace2 features have 2048 dimensions extracted from the global average pooling layer.

Fig. 5
figure 5

ResNet residual block diagram with identity mapping

In our empirical study, we use the following deep features: VGG16, VGGFace, DEX-IMDB-WIKI, ResNet-50 and VGGFace2.

3.3 Age estimation

Once the facial image features are extracted, we need to predict the age from the extracted features. The proposed method estimates the person’s age using a linear support vector machines regressor (SVR) with a ridge penalty based on L2 norm, and optimizes the objective function using dual stochastic gradient descent (SGD) which reduces the computing time. Knowing that no feature selection was performed. This regressor was tuned to find the best configuration for its hyper-parameters that controls the trade off between a large margin and a small loss by trying to minimize the k-fold cross-validation loss.

4 Experiments and results

To evaluate the performance of the proposed approach, we use FG-NET, PAL and FACES databases. The performance is measured by the Mean Absolute Error (MAE) and the Cumulative Score (CS) curve.

The MAE and CS are two different indicators that are used for evaluating the performance of facial age estimation. The Mean Absolute Error (MAE) gives a global indicator about the performance in the sense that it summarizes the prediction errors over all tested images. It cannot quantify the number of successful predictions if a tolerance in age prediction error is used. On the other hand, the CS quantifies the number of test images that got a prediction error (years) that is smaller than a given threshold (this image is considered as an image with a correct prediction). In ideal cases, where the prediction coincides with the ground-truth age, this number should be equal to the total number of tested images and for any value of the error threshold (tolerance) T. The MAE is the average of the absolute errors between the ground-truth ages and the predicted ones. The MAE equation is given by:

$$ MAE= \frac{1}{N} \sum\limits_{i=1}^{N} \left|p_{i}-g_{i}\right| $$
(7)

where N, pi, and gi are the total number of samples, the predicted age, and the ground-truth age respectively. The CS reflects the percentage of tested cases where the age estimation error is less than a threshold. The CS is given by :

$$ CS(T) = \frac{N_{e\leq T}}{N}<percent> $$
(8)

where T, N and NeT are the error threshold (years), the total number of samples and the number of samples on which the age estimation has an absolute error no higher than the threshold, T. Thus, the CS gives the percentage of the tested samples that are correctly predicted within the tolerance T.

4.1 FG-NET

The FG-NET [41] aging database was released in 2004 in an attempt to support research activities related to facial aging. Since then a number of researchers used the database for carrying out research in various disciplines related to facial aging. This database consists of 1002 images of 82 persons. On average, each subject has 12 images. The ages vary from 0 to 69. The images in this database have large variations in aspect ratios, pose, expression, and illumination. The Leave One Person Out (LOPO) protocol has been used due to the individual’s age variation in this database, each time a person’s images are put into a test set whereas the other persons’ images are put in a train set. Figure 6 shows the cumulative score curves for the eight different features. We can observe that the PML-LPQ descriptor performs better than the deep features in terms of CS which can be viewed as an indicator of the accuracy of the age estimators.

Fig. 6
figure 6

Cumulative scores obtained by the proposed approach on the FG-NET database

Table 1 illustrates the MAE of the proposed approach as well as of that of some competing approaches. From this table, we can observe that the best deep-features DEX-IMDB-WIKI and the best handcrafted features PML-LPQ outperform some of the existing approaches. Moreover, we can see that the DEX-IMDB-WIKI features give the best results, followed closely by the VGGFace2 and PML-LPQ features. Based on the CS curve of Fig. 6 and Table 1, we can see that the MAE and the CS are two different indicators that do not always highlight the same best approach. Since the CS curve treats prediction errors by a gradually increasing (or decreasing) tolerance, it may happen that, when two methods are compared, a method having a worse MAE could provide a CS curve (or part of it) that is better than that of the method with a good MAE. Thus, when the tolerance is increased (or decreased), it is possible that the method with a worse MAE can be more accurate in prediction adopting that large (or small) tolerance, than the method having a good MAE. The case above can be seen in the FG-NET database results. In Fig 6, the CS curves of the two features PML-BSIf (MAE = 4.48) and DEX-WIKI-IMDB (MAE = 3.74) are depicted. Despite the fact that DEX-WIKI-IMDB is globally better than PML-BSIf, the latter has a better CS curve for small tolerances between one and five years.

Table 1 Comparison with existing approaches on FG-NET database

Table 2 depicts the CPU time (in seconds) of the feature extraction stage and the training phase associated with 1002 images of FG-NET database. The experiments were carried out on the Alienware Aurora R8 workstation (Intel Core i9-9900K Processor, 16 Cache, 3.60 GHz, 64GB RAM, 2 × GPU GeForce RTX 2080, Windows 10). The handcrafted features significantly outperform the deep features in terms of CPU time execution of the feature extraction stage knowing that the deep features are computed using the GPU instead of the CPU. On the other hand, the CPU time associated with the regressor training with the deep features is smaller than that of the regressor training using the handcrafted features. This is due to the fact that the size of the deep features is smaller than that of the handcrafted features. It is worthy noting that if dimensionality reduction or feature selection are applied on the features before the regression phase, the cost of the training of the latter will be the same for all types as long as the size of the final features is the same.

Table 2 CPU time (in seconds) of extracting features and training stages on the FG-NET database

4.2 PAL

The Productive Aging Lab Face (PAL) database from the University of Texas at Dallas [38] contains 1,046 frontal face images from different subjects (430 males and 616 females) in the age range from 18 to 93 years old. The PAL database can be divided into three main ethnicities: African-American subjects 208 images, Caucasian subjects 732 images and other subjects 106 images. The database contains faces having different expressions. For the evaluation of the approach, we conduct 5-fold cross-validation, our distribution of folds is selected based on age, gender and ethnicity. Figure 7 shows the cumulative score curves for the eight different features. We can appreciate a change in the performance compared to FG-NET performance curve. It can be seen that the DEX-IMDB-WIKI features outperform the best handcrafted features approach which is PML-LPQ.

Fig. 7
figure 7

Cumulative scores obtained by the proposed approach on the PAL database

Figure 8 illustrates the accuracy of the age estimation associated with nine different age groups for the PAL dataset using the two best deep features and the best handcraft feature. These results were obtained by using the CS (2) with T = 5 (the used tolerance is five years) and for nine different subsets of test images that correspond to nine age groups.

Fig. 8
figure 8

Age estimation accuracy (CS) of different age groups when the absolute error is less than five years, i.e., the threshold T is set to five

We can observe that the used features (handcrafted and deep) are fairly robust to the variance of age groups in the PAL database.

Table 3 illustrates the MAE of the proposed approach as well as of that of some existing approaches. These results show that the DEX-IMDB-WIKI features outperform most of the existing approaches on the PAL database.

Table 3 Comparison with existing approaches on the PAL database

4.3 FACES

This database [12] consists of 2052 images from 171 subjects. The ages vary from 19 to 80 years. For each subject, there are six expressions: neutral, disgust, sad, angry, fear, and happy. The database encounters large variations in facial expressions bringing an additional challenge for the problem of age prediction. Figure 9 shows the cumulative score curves for the eight different features. The results considered all images in all expressions. We can observe that the DEX-IMDB-WIKI features outperform all the other features. It is followed by the PML-LPQ features.

Fig. 9
figure 9

Cumulative scores obtained by the proposed approach on the FACES database

Figure 10 illustrates the cumulative score curves for the different facial expressions of the FACES database when using the DEX-IMDB-WIKI features. This figure shows that the neutral and sadness expressions correspond to the most accurate age estimation. In other words, among all tested expressions the neutral and sadness expressions are the ones that lead to the best age estimation.

Fig. 10
figure 10

Cumulative scores obtained by the DEX-IMDB-WIKI features for different expressions on the FACES database

Table 4 presents comparison with some existing approaches. This table confirms the idea that the neutral expression is the expression that provides the most accurate age estimation compared to other facial expressions.

Table 4 Comparison with existing approaches on FACES database

5 Conclusion

This paper presents a study about using handcrafted and deep features for facial age estimation. Using a small number of images, the results showed that the handcrafted features sometimes gave better results than the deep features. Thus, it confirms that deep-learning-based approaches are not necessarily the best ones specially for low-cost less-accurate real-time visual analysis applications, so handcrafted-based approaches become more appealing for this case. However, for applications requiring an accurate age prediction and robustness to facial poses, the deep-learning-based approaches are far better.

As it can be seen, the main strength of the handcrafted-based approaches is their relatively cheap computational cost associated with the training and testing phases. This allows them to be easily deployed on devices having limited hardware resources. Their main limitation is their possible dependency on an accurate face detection and alignment. On the other hand, despite the good accurate age prediction provided by the deep features, their main limitation is the expensive computational cost associated with the training and testing stages.

As future work, we envision the adaptation and fine-tuning of some recent CNN architectures on face databases and creating new CNNs from scratch. We also envision studying the effect of deep features extraction level on the performance of the facial age estimation by evaluating the difference between the extraction of low-level and high-level features from pretrained networks.