1 Introduction

Age-invariant face recognition (AIFR) plays a vital role in several applications, such as biometric security systems [5], forensic applications [10], and identification of missing individuals [11]. Facial characteristics largely change over time with different age pattern per person. This refers personal genes, individual health, lifestyle, etc. Extracting robust features describing aging facial details is challenging research problem, especially when large age gaps between face images are considered [24]. In this work, two well-known standard databases are investigated for the job of AIFR, namely MORPH [26] and FGNET [6] databases. The challenges in these data can be summarized as follows:

  • Each subject has a variable low number of images of different ages. This makes the learning process very hard due to the limited number of training images per person and the fact that each person has his/her own age pattern based on genes, health, etc.

  • Pose, illumination, blurring, and distance from the camera are different among images.

  • The part of the data which is between ages 0–2 years, where strong personalized face features were still not established yet. At this age, it is very hard to associate the face of this age (0–2) with the correct same person’s elder faces even with human eyes.

  • Existence of large age gaps, especially for FGNET (ages are between 0 and 69 years).

Recently, convolutional neural networks (CNN) architectures have shown a remarkable success to address the challenge of recognizing faces in spite of their age [1, 5, 15, 25, 39]. In this paper, analysis of CNN extracted features is investigated to optimize a set of features that precisely describe the faces invariantly of age. To achieve this goal, the proposed recognition system used the pre-trained VGG-Face deep-learning CNN architecture to extract face features. To improve the recognition rate (RR), the proposed system performs the following contributions.

  • Analyzing the potential of the extracted features from VGG-Face model layers that were unexplored in the literature (i.e., flatten layer) for AIFR.

  • Discriminating related features to AIFR using MDCA, which significantly reduces the feature space dimensions and improves the recognition rate.

  • Performing space reduction of the extracted features, which in turn reduces the overall system time.

The proposed MDCA fusion system has achieved a RR of 81.5% on FGNET and 96.5% on MORPH. The rest of this paper is organized as follows. Section 2 summarizes the related work for AIFR. Section 3 describes the proposed system. Section 4 presents the results and discussions. Finally, Sect. 5 concludes the paper.

2 Related work

AIFR techniques are classified into generative, discriminative, and deep-learning approaches, as follows:

2.1 Generative approaches

Generative approaches attempt to simulate age process by constructing a synthesized face image (a pseudo-photograph) from individual’s aging images that are acquired before performing face recognition. For example, Lanitis et al. [13] built a 3D simulated aging model based on shape and intensity features, achieving a RR of 68.5% on a privately collected database. Park et al. [24] added a pose correction stage to a simulated 3D aging model of shape and texture, achieving RR of 37.4% and 79.8% on FGNET and MORPH databases, respectively. Although generative approaches could get simulated-age models, these models are limited by strong unrealistic parametric assumptions [7].

2.2 Discriminative approaches

Discriminative approaches extract discriminative facial features that mainly describe the aging process. For example, Ling et al. [19] used a gradient orientation pyramid (GOP) to describe the aging process and a support vector machine (SVM) was used for classification. Li et al. [17] used scale-invariant feature transform (SIFT) [18] and multi-multiple local binary pattern (MLBP) [23] as discriminative features to achieve age-invariant recognition. Gong et al [8] applied a maximum entropy feature descriptor (MEFD) that encoded the microstructure of facial images into a set of discrete codes in terms of maximum entropy. Li et al. [14] analyzed discriminative features using a modified hidden factor analysis (MHFA); they considered the correlation between age and identity, instead of assuming them to be mutually independent. Recently, Zhou et al. [41] used an identity inference model for AIFR based on a probabilistic linear discriminant analysis and expectation–maximization (EM) algorithm.

2.3 Convolutional neural networks (CNN)

To take into account the local correlations in natural images, CNN includes both fully connected hidden layers and locally connected convolutional layers, which use parameter sharing, pooling, and dropout to greatly reduce the number of parameters (“features”) learned by the CNN [5]. Recently, researches have shown that the output of CNN layers produces highly discriminative descriptors for AIFR. In this regard, Yan et al. [39] used a CNN architecture to extract facial features and a SVM classifier is used to select the belonging age range. Additionally, Li et al. [15] employed a deep CNN model that performed both feature extraction and classification. Xu et al. [37] used a coupled auto-encoders network to extract identity features from face image for AIFR. Li et al. [16] applied an optimization model that learn features and a distance metric simultaneously, for AIFR. Shakeel et al [27] used a pre-trained CNN model to extract face features. The extracted features were further encoded using a learned codebook, followed by linear regression mapping for face matching. Recently, the VGG-Face CNN pre-trained model [25] was widely used in face recognition applications, e.g., in [1, 5, 25]. This model is also adopted in our work to extract discriminate features of face images.

Fig. 1
figure 1

Proposed framework for age-invariant face recognition composed of four stages: preprocessing to align faces frontally, VGG-deep-learning feature extraction, MDCA fusion, and KNN classification

Fig. 2
figure 2

Block diagram of the steps of the proposed image preprocessing

Fig. 3
figure 3

The top row shows three raw data samples from FGNET and three raw data samples from MORPH datasets, respectively. The bottom row shows the same samples after the proposed preprocessing steps

3 Proposed system

This section details the proposed system for age-invariant face recognition. The system is composed of four modules, namely preprocessing, feature extraction, fusion, and classification, as shown in Fig. 1. The details of each of these stages are illustrated below.

3.1 Image preprocessing

The input of the preprocessing step is the raw facial image (as acquired from its database), and the output is an aligned frontal face image. The objective of this step is to register all images, i.e., on the basis of eye coordinates, such that they are aligned on the same standard size, i.e., each point on any given face is aligned to the same point on all of the images. To do this job, face is detected for each raw image using Viola–Jones method  [36]. Hough transform [2] is then applied to detect the center of each pupil as clearly discussed in the work of [12, 29]. Afterward, a face image is aligned frontally by rotating the face such that the detected centers are on the same horizontal line. Finally, the image is contrast-enhanced, auto-cropped, and resized to the standard \(224\times 224\), which is the input image size for of VGG-Face model. The details of the preprocessing step is shown in Algorithm 1. Figure 2 illustrates the results of each step of the algorithm. Figure 3 shows sample results of the preprocessing step for different samples from FGNET and MORPH raw data.

figure a

3.2 Feature extraction

The second stage of the proposed system uses VGG-Face CNN model [25] to extract the face features. This model is an application of the very deep convolutional network VGG-16 architecture that is trained specifically on a very large-scale face database (about 2.6 million face images of 2622 different people). The architecture of VGG-Face model is composed of 15 layers (12 convolutional layers and three fully connected layers, named FC6 and FC7). The vector of activities of the fully connected VGG layers is repeatedly used in the literature as a powerful generic image descriptor that can be applicable to other face databases [1, 28, 33]). In the proposed system, transfer learning is applied to train the VGG-Face model on new datasets (FGNET or MORPH) for age-invariant face recognition. This involves removing the last output layer of a trained VGG-Face model and using the activations of the last fully connected layers (FC6 and FC7) as the extracted features. Additionally, the proposed system investigates the power of extracting features from the last convolutional layer (regularly called “flatten layer”). Thus, these extracted features (of length 4096 for both FC6 and FC7 and of length 25,088 for flatten layer) are used in the proposed system as the input features.

3.3 Feature fusion

Fusion can be performed either in the base of feature-level or decision-level (multi-classifier outputs). The proposed system uses feature-level fusion since it is expected to provide better recognition results than decision-level as this fusion tries to enrich the information extracted from input image fusion other than the matching score or output decision of a classifier [21, 38]. Traditional feature fusion techniques, such as parallel, which is based on creating a complex vector [40], and serial fusion [20], which is based on concatenating two different features directly. However, these methods of fusion may cancel out the discriminative power of the individual input vectors [41]. Recently, for feature fusion canonical correlation analysis (CCA) is repeatedly used in the literature for different pattern recognition problems [30,31,32], based on maximizing the correlation of corresponding features across the two input feature sets. Discriminant correlation analysis (DCA) [9] is an extension of CCA that takes into account the class labels. The idea behind DCA is to maximize the correlation of corresponding features across the two input feature sets (as done in CCA-methods) in addition to performing a correlation for features that belong to different classes within each feature set. In addition to feature-level fusion, DCA performs significant dimension reduction, which implies low computational complexity. Therefore, the extracted fused features can be employed in real-time applications.

Fig. 4
figure 4

MDCA fusion using a cascaded two-level DCA : level one is used to fuse the different investigated layers of VGG model illustrating the extracted feature length from each layer and level two is used to produce the final fused output vector and it’s significant reduced length (81 in case of FGNET). A number in inputs and output blocks indicate the number of features per block

In the proposed system, a cascade version of DCA, called multimodal discriminant correlation analysis (MDCA) [9] is used for fusion process. While DCA works on two sets of input variables at a time, MDCA can fuse any number of input pairs. Figure 4 illustrates how MDCA is employed in the proposed system to combine the extracted features from three VGG layers with different feature length vectors as shown in Fig. 4. Using a cascaded two-level DCA, level one is used to fuse the three VGG layers through two blocks of DCA and level two is used to produce the final fused output vector of reduced length. Note that FC6 is used as a common input for the two DCA blocks in level one since the results in this paper show that it is the most discriminative layer among the three investigated layers.

3.4 Classification methods

To recognize faces across ages, a K-nearest neighbor (KNN) [4] classifier, based on the Euclidian distance, is applied on the extracted deep features. In our experiments, \(K=1\) is used to select the image of the nearest feature vector to the feature vector of the input test image. KNN classifier is selected to offer a very simple design and significant less computations. In addition to the KNN classifier, a support vector machine (SVM)  [34] classifier is investigated to check its ability to improve system accuracy. Binary linear kernels are selected to support multiclass SVM classification.

Table 1 FGNET and MORPH (album-II) Rank-1 recognition rates (RRs) for different VGG-Face CNN layers (Flatten, FC6, and FC7), their DCA fusion, their serial fusion, and the proposed MDCA fusion using LOPO scheme and KNN classifier

4 Experimental results

4.1 Datasets

To test the proposed system, two well-known databases for face aging are for the job of age-invariant face recognition, including MORPH [26] FGNET [6] databases (see Table 1). Table 2 describes these databases. As shown in Table 2, FGNET [6] is a relatively small database (1002 images of 82 subjects of different ages that vary between zero and 69 with a large age gap per subject (0–45), making it more challenging. For the MORPH album-II [26], which is available for academic research purposes, contains 55, 134 images of 13, 000 subjects collected over 4 years and includes age between 16 and 77 years.

Table 2 Description of FGNET and MORPH databases

4.2 Experimental setup

Rank-1 recognition rate (RR) is used to evaluate the performance of the proposed system. A scheme of leave one image of a person out (LOPO) is used for testing. LOPO is widely used to test recognition accuracy on FGNET and MORPH datasets as in [22], as LOPO scheme identifies a person face from his previous face images. This is repeatedly applied in security applications (i.e., forensic history to identify wanted criminals and missing individuals, based on their previous images at different ages).

The reason behind using LOPO is that it provides robust evaluation especially in case of small data size [3], which is the case for FGNET dataset as it contains 1002 images of 82 different subjects which are all used for our experiments. In MORPH (album-II) dataset, each person has limited face images, a maximum of five per person, a random subset of 1100 images from MORPH corresponding to 372 subjects are used in our experiments. (This number of samples is chosen randomly to nearly equal FGNET samples in consideration to the computation.)

4.3 Experiments on FGNET dataset

The proposed system in Fig. 1 is applied on FGNET data. In the first stage of the proposed system, preprocessing of the raw images is performed to produce frontally aligned images with the standard VGG model size. In the second stage of the system, features acquired of the VGG-Face model from different layers (Flatten, FC6, and FC7) are extracted. To evaluate the efficiency of the VGG-Face model, Rank-1 RRs of using each individual layer features (Flatten, FC6, and FC7) are evaluated using LOPO scheme and KNN classifier, with \(K=1\), on the whole FGNET dataset. As shown in Table 1, recognition rates achieve scores of 68.2%, 74.9% and 74.75% using each of VGG model layer features: Flatten, FC6, and FC7, respectively. These results confirm the capability of these features to achieve good initial results; however, further improvement is required. To enhance the RR, the third stage of the proposed system attempts to fuse between the Flatten, FC6, and FC7 features. As shown in Table 1, applying a naive serial fusion, by concatenating the features, is not able to improve the RR, because it cancels out their discriminating power.

Fig. 5
figure 5

Visual system recognition for a four samples from FGNET and b four samples from MORPH (album-II) datasets using LOPO scheme and KNN classifier. For each dataset, the first column is the test image. Columns 2,3,4 represent visual results of using individual Flatten, FC6, and FC7 layers, respectively. The last column represents the results of the proposed MDCA fusion system. A large red rectangular frame indicates failed recognition, whereas a green small rectangular frame indicates correct recognition

The proposed system uses a smart fusion between the Flatten, FC6, and FC7 features using the MDCA feature fusion. MDCA significantly reduces the feature dimension space to output the most powerful features for age-invariant face recognition, i.e., only 81 discriminant features in FGNET case. This in turn reduces the computational cost of the proposed system. The significant reduction is around 99.76% of the input space (the input space is of size 25088 (Flatten) + 4096 (FC6) + 4096 (FC7)), see Table 4. In addition, the visual results in Fig. 5 show that the proposed MDCA is able to achieve the correct matching in many cases, even for cases where none of the individual layers achieves the correct matching (as in the last row sample data in Fig. 5). In additional, quantitative results confirmed the privileges of the proposed MDCA fusion, which achieves an improved recognition rate of 80.4% using KNN.

To investigate the potential of more sophisticated classifiers, SVM is applied on the fused data; the results in Table 3 show that SVM successfully improves RR for FGNET dataset from 80.4% using KNN to 81.5%

Table 3 Rank-1 recognition results of proposed system using KNN and SVM classifiers
Table 4 Time performance evaluation of the KNN classification step elapsed for an FGNET test sample

4.4 Experiments on MORPH dataset

Same experiments on FGNET are repeated using a random subset of 1100 images from MORPH corresponding to 372 subjects with same experimental setting. Figure 5 and Table 1 summarize the visual and quantitative results of the experiments. As in FGNET, the results show the impact of the proposed MDCA fusion in achieving better RR. In addition, since the data are collected over only 5 years (a maximum of 5 years age gap between the images) with ages ranging from 16 to 77 years, the recognition rate has reached 96.5% using the proposed system, in comparison with FGNET, where age gap between images can reach 45 years, with age ranging from 0–69 years, making it more challenging.

4.5 Time performance

All experiments in the paper are implemented using MATLAB 2017a (64-bit) with 2.5 GHz Intel CORE i5 2450MCPU and 8 GB of RAM. The library MatConvNet [35] is used for GPU parallelization. To evaluate time performance, the processing time has been evaluated for a number of 30 different trials and the mean and standard deviation are calculated, and noted as \(\mathrm{mean}\pm \mathrm{SD}\).

Table 5 Comparative Rank-1 recognition results of state-of-the-arts for age-invariant face recognition
Fig. 6
figure 6

Proposed system retrieved images for six samples from FGNET dataset using our proposed framework. The first row is the test image. The second row shows retrieved images using our system . The last row is the ground truth (GT), determined as the recognized person image of the nearest age. A green rectangular frame indicates correct recognition, and a red rectangular frame indicates failed recognition

The proposed MDCA fusion successfully does the following :

  • reducing the feature space size to only 81 FGNET features, as illustrated in Table 4

  • enhancing the estimated speed of the proposed system up to 7.7\(\times \) over using all the features of the individual FC6 or FC7 layers and 52\(\times \) over using all concatenated serial fusion features

The elapsed times for the proposed system stages (preprocessing, feature extraction, feature fusion, and classification) are 0.784 s, 0.792 s, 0.005 s, and 0.004 s, respectively. Therefore, the overall time of the complete proposed framework is 1.58 sec.

4.6 Comparing with state-of-the-art algorithms

To verify the quality of the proposed system for age-invariant face recognition, a comparison with other age-invariant face recognition algorithms is summarized in Table 5. Note that some of these methods cannot be compared, since they use different number of samples per databases or different experimental setting. As shown in Table 5, the proposed MDCA fusion step in the system has improved the recognition accuracy over other compared systems.

4.7 Discussion and limitations

Table 1 indicates that FC6 features give the highest Rank-1 RR for the all investigated layers’ features (Flatten, FC6, and FC7). Moreover, combining the features from FC6 with each of Flatten and FC7 using MDCA fusion has shown a significant improvement in the Rank-1 RR even with a simple KNN classifier, with \(K=1\), in additional to the significant dimension reduction of the input feature space. Note that FC6 features are used as a common input for each input pair of the MDCA fusion since it achieves the highest RR.

The difference in the Rank-1 RR between FGNET dataset (large age gap) and MORPH is due to that most of the FGNET images appear in the childhood and teenage stages, as face features appear to be unstable as shown in the failed retrievals examples in Fig. 6. But on the contrary, most of the MORPH images (the maximum age gap between images of each person is 4 years) lie in the elder age groups as face features tend to be more stable; this reflects the higher RR that MORPH dataset achieves. In the future, more features will be investigated to improve the accuracy of the system, specially if large age gap is considered.

5 Conclusion

In this paper, a system is proposed for age-invariant face recognition. The proposed system applies an image preprocessing step, followed by features extraction using the utility of a pre-trained VGG-Face CNN to get highly discriminative descriptors (rather than hand-crafted). Then, a dimension reduction using an efficient MDCA fusion has significantly reduced the input feature space and improved the recognition rate of the system. The proposed system achieves 81.5% accuracy for challenging FGNET (with large age gap (0-45) between face images) dataset and 96.5% for MORPH (album-II) dataset. In addition, it outperforms the state-of-the-art methods on the same data. These results show the promise of the proposed system in age-invariant face recognition.