1 Introduction

Facial age estimation is defined as the automatic estimation of either the biological age or age group of a person (child, adult, elderly, etc.), while facial gender estimation is defined as the classification of a person’s gender as male or female based on a facial model. Age and gender estimation has vital importance in applications such as consumer profile estimation, social media advertising, demographic profiling, and customized advertising systems [1]. In addition, age and gender estimation can be useful in many areas such as the identification of people to prevent fraud, theft, or anything like that, controlling access to a system, and human–machine interaction [2]. It also plays an important role in smart applications such as healthcare and marketing intelligence [3].

Automatic face and gender estimation, which has become interesting in recent years, is a challenging research topic [4]. When race and gender differences are included, age estimation from a human face image is quite difficult. The performance of the learning model to be used for age and gender estimation largely depends on the data in the dataset. Human face images include many features such as age, gender, race, emotional expression, and health status. Studies on the age estimation problem using facial images date back to 1994. Since then, many different studies have been carried out. Despite extensive work on the age estimation problem, the obtained results still do not show the accuracy and reliability performance to meet real-life demands [5]. The issue of age and gender estimation has been interesting in recent years, both because of the difficulties in age and gender estimation problems and because it still does not fully meet the demands.

Image processing and computer vision have become among the most popular fields with real-world applications [6] and have been used in many different fields with supervised and unsupervised deep learning approaches in recent years [7,8,9,10,11,12,13,14]. After the advent of computer vision and machine learning, automated computerized age and gender estimation systems have become more popular [15]. With the advent of deep learning, the study of facial systems has completely changed. As deep learning has entered many fields, the techniques based on DCNNs have also become a research point in the field of facial age and gender estimation [16]. Estimating age and gender from human facial images is becoming an exciting business in the field of computer vision. Compared to traditional handcrafted methodologies, CNNs and DCNNs, which have been widely used recently during the categorization task, performed much better in age and gender estimation ([17,18,19,20]). Significant advances in DCNN architectures such as parallel and hierarchical feature extraction, multitasking and transfer learning, data analysis and predictability, and computational efficiency have been effective in increasing performance.

In the literature, many applications of AI methods can be found to estimate age and gender from human face images. Lee et al. proposed a deep residual learning network model consisting of three deep neural networks for age and gender estimation. Model training was done via the IMDB-WIKI database with images collected (more than 14,000) from the internet. In the proposed model, after the faces in the images in the dataset were detected, the age and gender of the faces were estimated. In the augmented IMDB-WIKI dataset with the proposed model, an accuracy rate of 52.2% for age and 88.5% for gender was achieved [21]. Terhörst et al. proposed a new reliability measure neural network that determines the reliability of model predictions for reliable age and gender estimation from face images using the Adience dataset. In their experiments with age groups consisting of eight classes (0–2, 4–6, 8–13, 15–20, 25–32, 38–43, 48–53, and 60 +) and gender groups consisting of two classes (female and male) in the dataset containing more than 26.5 k images, they found that the proposed method was successful in measuring the estimation reliability. From the images in the Adience dataset, they achieved an accuracy of 64.3% in estimating age groups and 89.8% in estimating gender [22].

Ozbulak et al. proposed DCNN-based generic AlexNet-like and domain-specific VGG-face CNN models to obtain age and gender classification in the wild. The models were used and fine-tuned with the Adience dataset prepared for age and gender classification in uncontrolled environments. As a result of the analysis, 57.9% accuracy in age classification and 92.0% accuracy in gender classification were obtained [23]. In another study using the Adience dataset, a new model was proposed by fine-tuning with pre-trained CNNs. With the proposed model, 62.26% age classification success was achieved [24]. Duan et al. proposed a hybrid structure for age and gender classification that integrates the synergy of two classifiers, including a CNN and an extreme learning machine (ELM). As a result of the analyses, they made using the MORPH-II and Adience benchmark datasets, they achieved a success rate in age classification at 52.3% and in gender classification at 88.2% [25]. Another study using the Morph dataset proposed a face recognition method that does not change according to age. With the proposed method, 92.8% success was obtained [26].

Rwigema et al., in their study, proposed a new hybrid algorithm consisting of conventional artificial neural networks (C-ANNs) and CNN for age and gender classification using a decision fusion technique. They combined the decisions obtained by the two neural networks with probabilistic decision fusion techniques such as majority voting decision fusion, Naive–Bayes combination decision fusion, and sum rule decision fusion. They divided 2000 different images of individuals obtained from the internet into age groups (1–24, 25–49, 50 years, and over) and gender groups. Using the created dataset, an 86.1% accuracy rate in age classification and a 98.4% accuracy rate in gender classification were achieved [27]. Sharma et al. proposed CNN approach improved for face-based age and gender estimation. UTKFace, IMDB-WIKI, FG-NET, and CACD datasets were used for model evaluations. The proposed model trained on a large-scale dataset, UTKFace (aligned and cropped faces), showed 94.01% accuracy for age estimation and 99.86% accuracy for gender estimation [2]. Additionally, many different applications have been implemented for age and gender estimation from human face images using different datasets [28,29,30,31,32,33, 46].

In addition, many applications of AI methods can be found in the literature for age estimation from human face images with training made using the FG-NET aging database discussed in this study. Nithyashri and Kulanthaivel, in their study using the FG-NET aging database, developed a system on wavelet transformation (WT) to extract facial features and an artificial neural network (ANN) to classify age groups. As a result of their analysis dividing the age groups into four groups child (0–12 years), adolescence (13–18 years), adult (19–59 years), and senior adult (60 years and above), they achieved a 94.28% correct classification success by using the distance evaluated between the eye center and the mouth (FPD3) as the feature point distance and Coif Wavelet [34]. Choobeh, FG-NET aging database was divided into two groups (child and adult) and a one-dimensional feature space was created using the active appearance model (AAM) and then linear discriminant analysis (LDA), and the minimum distance classifier on this space was used. As a result of the analysis made with the image-based method created to separate children from adults, children, and adults, they were recognized with an accuracy rate of up to 89% and 90%, respectively [35]. In another study conducted by dividing the FG-NET aging database into two groups (child and adult), statistical modeling of the face was used to separate children from adults according to their facial images. By applying LDA to the face parameters, useful features were extracted, and the Euclidean distance function was used. With the proposed algorithm, an 85% accuracy rate has been achieved [36]. In another study, Razalli et al. presented a two-stage image-based method to distinguish children from adults. According to the face shape elliptical ratio and facial features angle distribution, the analyzes performed using SVM and multi-DVM classification achieved a success rate of 92% [37].

In addition, as a result of the training with different datasets, many AI methods have been applied in the literature where the FG-NET aging database discussed in this study is used as a test dataset. In a study in which FG-NET aging database was used as a test dataset, age progression/regression was determined by conditional adversarial autoencoder (CAAE). It was determined that CAAE showed superior performance with age groups divided as 0–5, 6–10, 11–15, 16–20, 21–30, 31–40, 41–50, 51–60, 61–70, and 71–80 [38]. In another study, Tyagi and Sood used FG-NET dataset as a test with machine learning techniques and examined age groups. An algorithm supporting adaptive features based on local binary patterns and vector machine classification is proposed. With the proposed model, 56% of classification success was achieved [39]. Kumar et al. proposed a method (ADMM + Gabor + SVM) based on Seg-Net-based architecture and support vector machine with machine learning algorithm for age and gender classification from various face images. As a result of the analyses, they made by dividing the age groups into eight classes 0–4, 5–9, 10–14, 15–19, 20–24, 25–44, 45–55, and 56–70, a success of 92.48% was achieved in age classification with the FG-NET dataset [40].

Age and gender estimation contribute to many real-world applications such as controlling access to a system through human–machine interaction, identification and activity recognition in case of a crime, consumer profile detection in the marketing process, and the development of customized advertising systems. Solving the problems of age and gender estimation from human face images has still not been fully met. Since DCNN models perform better than traditional CNN models, using the DCNN approach for age and gender estimation served as the motivation for this study. In this study, several approaches are presented for age and gender estimation from human face images in an imbalanced dataset. First, a new model called FINet was developed. Then, seven different keras models accepted in the literature were discussed, and these models were trained using weights previously trained with ImageNet. From the Xception, InceptionV3, InceptionResNetV2, MobileNet, MobileNetV2, NASNetMobile, and NASNetLarge models discussed in this study, the number of parameters was reduced by using the layer reduction approach to the InceptionV3 and NASNetLarge model structures, and a new model called INFINet was developed by concatenate these two models with the FINet model. To summarize, the FINet model was developed to modulate the structure parametrically with an understanding of simplicity. The INFINet model was developed to obtain diversity in the feature space by producing different features from state-of-the-art DCNNs with the understanding of cost-efficiency without the need for very big data and to prevent overfitting with controls during the learning process. FINet and INFINet models developed for age and gender estimation on an imbalanced dataset were compared with state-of-the-art DCNN models in the field of AI.

When the studies in the literature were examined, it was seen that many different methods were used to estimate age and gender. In this study, two different models are proposed in addition to the studies in the literature. The novelty, contribution, and importance of this study can be summarized as follows:

  • Development of the fast identify network (FINet) model for age and gender estimation.

  • Performing the triple concatenation of two fine-tuned AI models with the FINet model

  • Proposing the inception Nasnet fast identification network (INFINet) model

  • Evaluation of age and gender estimation performance of DCNN models on human face images.

  • Training and testing the two proposed models and seven different models in the literature using two different datasets in the same environment and obtaining more successful results with the proposed model.

The rest of the study is organized as follows: Materials and methods related to the study are mentioned in Chapter 2, results and discussion in Chapter 3, and conclusions in Chapter 4.

2 Materials and methods

2.1 Dataset

In this study, face and gesture recognition research network (FG-NET) aging database has been used for age and gender estimation from human face images [41]. Data have been obtained from different sources as the original FG-NET website no longer provides this publicly available dataset [42, 43]. FG-NET is a dataset consisting of a total of 1002 images in different pixel sizes, which are used for age estimation and face estimation between ages and include 82 people between the ages of 0–69. The age distribution in the dataset is not balanced. Images with individuals at their more recent ages are the ones for which digital images were available. For most individuals, images were collected through scanning photographs in personal collections [44]. In addition, when the dataset is analyzed in terms of gender distribution, there are 431 images of 34 women and 571 images of 48 men [45]. In Fig. 1, sample images of different ages of a person in the FG-NET aging database are given.

Fig. 1
figure 1

Sample images of different ages of a person in the FG-NET aging database

Each image in the dataset is named to provide information about the image. For example, the meaning of image “048A00.JPG” is the image of person number 48 at age 0, and the meaning of image “048A54.JPG” is the image of person number 48 (the same person) at age 54.

2.2 Image pre-processing

To realize the training process more successfully, image processing techniques such as face recognition and size adjustment were applied to the images in the dataset. First, face recognition was performed using the haar cascade classifier on all images. Later, the recognized face images were adjusted to 224 × 224 × 3 pixel dimensions. In Fig. 2, sample images obtained as a result of pre-processing are given.

Fig. 2
figure 2

Sample images obtained as a result of pre-processing in the FG-NET aging database

The problem can be approached as a classification task by dividing the age ranges into classes [46]. Therefore, each image was collected in different folders according to certain age ranges, age groups were created and labeled (For example, people aged 0, 1, and 2 were collected in the folder belonging to the 0–2 age group, and people aged 4, 5, and 6 were collected in the folder belonging to the 4–6 age group). In order to make a comparison with the literature, age groups are divided into eight classes: 0–2, 4–6, 8–12, 15–20, 25–32, 38–43, 48–53, and 60 + . Additionally, gender groups are divided into two classes: female and male. Accordingly, the facial region images obtained were grouped separately in terms of age and gender. In Table 1, the number of images in each class according to age and gender groups is given.

Table 1 Number of images in each class according to age and gender groups in the FG-NET dataset

2.3 Work environment and training parameters

With the models used in this study, the Google Colaboratory Pro [47] working environment was used to make age and gender estimation successfully on the imbalanced dataset. All operations in the Colab environment with NVIDIA Tesla K80 graphics processor are coded using the Python programming language.

In this study, state-of-the-art models were preferred to obtain successful classification results in age and gender estimation from the imbalanced dataset consisting of human face images. These models are the Xception [48], InceptionV3 [49], InceptionResNetV2 [50], MobileNet [51], MobileNetV2 [52], NASNetMobile, and NASNetLarge [53]. To carry out the training of the preferred models, the dataset is divided into two datasets 80% train and 20% test, separately according to both age and gender. Model training was evaluated with the training dataset, and model performances were evaluated with the test dataset. As a result of dataset splitting, train-test dataset image numbers obtained for each age and gender group are given in Table 2.

Table 2 Number of images of the train-test dataset obtained for each age and gender group in the FG-NET dataset

There are many hyperparameters in neural network models that affect performance. Many trials were made to determine the hyperparameters to be used in this study. When one parameter changed, the model performance was observed by keeping the other parameters constant and the best hyperparameter value was determined. Therefore, the train and test of each model were carried out using the hyperparameters given in Table 3. In this study, the epoch 30, mini-batch size 8, optimization algorithm SGD, and learning rate were determined as 0.001. Since the age groups are in eight classes, the activation function was determined as Softmax and the loss function as categorical crossentropy in age estimation. Since the gender groups are in two classes, the activation function was determined as sigmoid and the loss function as binary crossentropy in gender estimation. Experiments were conducted using the same environment and the same training parameters to determine the success of all models used in this study.

Table 3 Hyperparameters used for train and test of each model

2.4 Model structures

In this study, a new model named FINet was developed, and a concatenate model named INFINet was created to make age and gender estimation successfully on the imbalanced dataset.

2.4.1 FINet model

In this study, a new DCNN model named fast identify network (FINet), whose schematic diagram is given in Fig. 3, is developed for age and gender estimation on an imbalanced dataset.

Fig. 3
figure 3

Schematic diagram of the FINet model

In the FINet model given in Fig. 3, a convolution operation with filter, kernel, stride, and padding values of 32, 3, 2, and “valid” in the initial block, respectively, is applied to the input images, and then, the ReLU activation function is applied. In the next block, a convolution operation with filter, kernel, stride, and padding values of 64, 3, 1, and “same”, respectively, and after convolution, the batch normalization and ReLU activation function are applied twice. After these blocks, the max-pooling layer with a pool size of 2 and the dropout layer with a drop rate of 0.15 come. These operations are repeated in the next blocks by changing the parameter values. The filter size is increased by 32 in each convolution layer. Other parameters are increased or decreased within a certain logic in each layer. In this way, a new model with a parametrically modifiable structure has been developed.

Each layer used in the FINet model can be summarized as follows:

  • Convolution: It was used to obtain an output with new features by shifting a filter on the input data.

  • Batch normalization: It was used to make the training of the network faster and more stable by rescaling the input values coming to the neuron in the neural network.

  • ReLU (Rectified linear unit): It was used to determine an output value in response to the input value coming to a neuron in the neural network.

  • MaxPool: It was used to take the maximum value in the area covered by the filter to reduce the number of parameters of the neural network by reducing the size of the feature map created by the convolution layer.

  • Dropout: It was used to prevent overfitting of the neural network by eliminating random nodes.

  • Flatten: It was used to transform the two-dimensional arrays obtained from the feature map into a single long vector.

  • Dense: It was used to change the size of the vector in a way that is deeply connected to the previous layers.

  • Softmax: It was used in the output layer to perform classification in the neural network.

2.4.2 INFINet model

It is seen that DCNN models with attribute combinations perform better than models without attribute combinations [54]. For this purpose, in this study, block cutting was applied to the InceptionV3 and NASNetLarge model structures to create a new and effective model by concatenation of state-of-the-art DCNN models in the field of AI for age and gender estimation on an imbalanced dataset. A series of ablation studies were carried out to evaluate the effects of block cutting on the performance of the models and to show that it is important for the functionality of the cut section. As a result of many trials, the best performance was achieved with the InceptionV3 and NASNetLarge models, so block cutting was applied to these two models. Figure 4 shows InceptionV3 (left) and NASNetLarge (right) model structures obtained as a result of block cutting. In the following sections of the study, the effect of the cut section on the performance of the models, the parameter numbers of the models formed as a result of block cutting, and their comparison with the parameter numbers of other models are given.

Fig. 4
figure 4

InceptionV3 (left) and NASNetLarge (right) model structures as a result of block cutting

After block cutting from the points shown in Fig. 4, the cut InceptionV3 and NASNetLarge models and the FINet model were concatenated. Thus, a new DCNN model named inception Nasnet fast identify network (INFINet) was created. The schematic diagram of the INFINet model is given in Fig. 5.

Fig. 5
figure 5

Schematic diagram of the INFINet model

In the INFINet model given in Fig. 5, firstly, layer freezing is applied to the InceptionV3 and NASNetLarge models, which are processed with pre-trained weights to classify the input images. Then, blocks are cut from the points indicated in Fig. 4 in InceptionV3 and NASNetLarge models, and from the classification layer in the FINet model. As a result of block cutting, outputs with 12 × 12 × 768 shapes in the InceptionV3 model, 14 × 14 × 2016 shapes in the NASNetLarge model, and 7 × 7 × 192 shapes in the FINet model are obtained. By adding new layers with different values consisting of convolution, batch normalization, ReLU activation function, average pooling, and dropout layers to all three models, a common output of 4 × 4 × 192 shapes with the same feature values is obtained. After all three models have values of 4 × 4 × 192, the models are concatenated. After the models are concatenated, layers consisting of convolution, batch normalization, ReLU activation function, average pooling, global average pooling, dropout, and softmax blocks are added to the end of the model, respectively. In this way, a new model with a structure consisting of these three models, which was not tried to be concatenated in another study before, has been created.

Each layer used in the INFINet model can be summarized as follows (the layers defined in the FINet model are not given again, only the layers that are different from the FINet model are defined):

  • AvgPool: It was used to average the values in the area covered by the filter to reduce the number of parameters of the neural network by reducing the size of the feature map created by the convolution layer.

  • Global average pool: It was used to reduce all spatial dimensions by averaging each feature map.

2.5 Evaluation metrics

After the model is created, various evaluation metrics are needed to measure how its performance works [55]. Evaluation metrics mostly come from the confusion matrix [56]. The four most common evaluation metrics (accuracy, precision, recall, and f1 score) were used for age and gender estimation on an imbalanced dataset of human face images. Mathematical expressions given in Eqs. 1, 2, 3, and 4 have been used to determine the accuracy, precision, recall, and f1 score of each model discussed in this study.

$$\text{Accuracy}= \frac{\text{True Positive }+\text{ True Negative}}{\text{True Positive }+\text{ False Positive }+\text{False Negative }+\text{True Negative}}$$
(1)
$$\text{Precision}= \frac{\text{True Positive}}{\text{True Positive }+\text{False Positive}}$$
(2)
$$\text{Recall}= \frac{\text{True Positive}}{\text{True Positive }+\text{False Negative}}$$
(3)
$$\text{F}1-\text{Score}= 2\times \frac{\text{Precision }\times \text{ Recall}}{\text{Precision }+\text{ Recall}}$$
(4)

The overall performance of the proposed models on the datasets was evaluated with the accuracy metric. In addition, the precision metric, used in cases where false positives are costly, was used to ensure that the proposed models correctly recognized a particular class, and the recall metric, used in cases where false negatives were costly, was used to measure the ability of the proposed models not to miss a class. The F1-score metric was used to measure the overall performance of the proposed models in a balanced way.

3 Results and discussion

In this study, training processes have been carried out with both the newly developed FINet model and the INFINet model, which was created as a result of the concatenation of the InceptionV3, NASNetLarge, and FINet models, for age and gender estimation using the imbalanced dataset. To compare the results obtained with state-of-the-art DCNN models in the field of AI, the same training was conducted with the Xception, InceptionV3, InceptionResNetV2, MobileNet, MobileNetV2, NASNetMobile, and NASNetLarge models. The training of the models was carried out separately according to the train-test dataset, which was first separated according to the eight-class age groups and then according to the two-class gender groups.

3.1 Results for age estimation

The accuracy and loss graphs obtained because of the model training for age estimation are given in Fig. 6. In addition, the numerical results obtained from the models for age estimation are given in Table 4, and the confusion matrix plots are given in Fig. 7.

Fig. 6
figure 6

Training plots of the models for age estimation in the FG-NET dataset: a accuracy and and b loss

Table 4 Numerical results obtained from model training for age estimation in the FG-NET dataset
Fig. 7
figure 7

Confusion matrix plots obtained from model training for age estimation in the FG-NET dataset: a Xception, b InceptionV3, c InceptionResNetV2, d MobileNet, e MobileNetV2, f NASNetMobile, g NASNetLarge, h FINet, and i INFINet

When the accuracy and loss graphs given in Fig. 6, the numerical results given in Table 4, and the confusion matrix results given in Fig. 7 are examined, it is understood that the age estimation has been performed most successfully by the INFINet model. In the INFINet model, 61.22% accuracy and 1.0234 loss values were obtained. In Table 4, the accuracy rate of the InceptionV3 model, which was 51.70%, increased to 54.42% with the InceptionV3_CutLayer model obtained after block cutting. In addition, the accuracy rate of the NASNetLarge model, which was 50.34%, increased to 52.38% with the NASNetLarge_CutLayer model obtained after block cutting. Therefore, the block cutting process applied to the InceptionV3 and NASNetLarge models made a significant contribution to the successful results obtained with the INFINet model in age estimation. In addition, the FINet model with a success accuracy of 53.06% has been the best model after the INFINet model. When examined in terms of age groups, both INFINet and FINet models made the most inaccurate classification in the 48–53 and 60 + age groups. Both models performed better than other models in terms of age groups.

3.2 Results for gender estimation

The accuracy and loss graphs obtained because of the model training for gender estimation are given in Fig. 8. In addition, the numerical results obtained from the models for gender estimation are given in Table 5, and the confusion matrix plots are given in Fig. 9.

Fig. 8
figure 8

Training plots of the models for gender estimation in the FG-NET dataset: a accuracy and b loss

Table 5 Numerical results obtained from model training for gender estimation in the FG-NET dataset
Fig. 9
figure 9

Confusion matrix plots obtained from model training for gender estimation in the FG-NET dataset: a Xception, b InceptionV3, c InceptionResNetV2, d MobileNet, e MobileNetV2, f NASNetMobile, g NASNetLarge, h FINet, and i INFINet

When the results given in Figs. 8, 9 and Table 5 are examined, it is seen that INFINet has been the most successful model in gender estimation. With the INFINet model, 80.95% accuracy and 0.4050 loss values were achieved. In Table 5, the accuracy rate of the InceptionV3 model, which was 74.15%, increased to 76.87% with the InceptionV3_CutLayer model obtained after block cutting. In addition, the accuracy rate of the NASNetLarge model, which was 76.19%, increased to 77.55% with the NASNetLarge_CutLayer model obtained after block cutting. Therefore, the block cutting process applied to the InceptionV3 and NASNetLarge models made a significant contribution to the successful results obtained with the INFINet model in gender estimation. In addition, the FINet model with a success accuracy of 78.91% has been the best model after the INFINet model. As in age estimation, the two best models in gender estimation were seen as INFINet and FINet.

3.3 Comparative analysis

To better compare and discuss all the models discussed in this study, the datasets created for both age and gender estimation were trained separately under the same conditions. Comparative accuracy and loss graphs, other numerical results, and confusion matrix results obtained because of the training show that the proposed models in both age and gender estimation are better or close to state-of-the-art DCNN models in the field of AI. In addition to these remarkable successful results in the proposed models, the models should also be compared in terms of the number of parameters. In this direction, the total parameter numbers of the models discussed in this study are given in Fig. 10.

Fig. 10
figure 10

Total number of parameters of the models

According to the total number of parameters given in Fig. 10, it is seen that the newly developed FINet model has the least number of parameters. The second-best accuracy in both age and gender estimation was achieved with the FINet model, which has the least complexity. For this reason, it is predicted that the FINet model can be competitive with state-of-the-art models in AI in age and gender estimation. Although the number of parameters of the INFINet model, which is created because of the concatenation of InceptionV3, NASNetLarge, and FINet models, is 12.06M more than the InceptionV3 model, it is 51.07M less than the NASNetLarge model. The most successful age and gender estimation was achieved with the INFINet model. Although it has 33.88M parameters, which is higher than FINet with 1.80M, MobileNetV2 with 2.27M, MobileNet with 3.24, NASNetMobile with 4.28M, Xception with 20.88M, and InceptionV3 with 21.82M, comparing the performance of the INFINet model with these models still makes the trade-off reasonable. To logically support such a claim, smaller models only achieved an overall accuracy of < 54% for age estimation and only < 79% for gender estimation, whereas the proposed INFINet model reached 61.22% for age estimation and 80.95% for gender estimation. In addition, compared to larger models such as InceptionResNetV2 (age: 49.66%, gender: 73.43%) and NASNetLarge (age: 50.34%, gender: 76.19%), the proposed INFINet model exceeds the performance reached by these models.

It is quite difficult to make high-success age and gender estimations in imbalanced datasets. Although the success rates are not very high, it is seen that the developed FINet and concatenated INFINet models are more successful than other models when considering the number of parameters in both age and gender estimation. This shows that the developed models are better than the compared models.

3.4 Testing models using different datasets

To test the models discussed in this study on different data, model training was carried out using a dataset called UTKFace. UTKFace is a large-scale dataset containing more than 20k + cropped face images with age, gender, and ethnicity. This dataset can be used in various tasks such as age and gender estimation, face detection, and age progression/regression [61, 62]. Age groups are divided in this dataset as described in the previous section. Age groups are divided into eight classes 0–2, 4–6, 8–12, 15–20, 25–32, 38–53, 48–53, and 60 + , and gender groups are divided into two classes female and male. In Table 6, the number of images in each class according to age and gender groups is given.

Table 6 Number of images of the train-test dataset obtained for each age and gender group in the UTKFace dataset

After applying the same pre-processing described in the previous sections of the study to this dataset, all model training was carried out. Numerical results obtained from the models as a result of extensive experiments using the UTKFace dataset are given in detail in Table 7. In addition, confusion matrix plots obtained from models for age and gender estimation are given in Figs. 11 and 12, respectively.

Table 7 Numerical results obtained from models using the UTKFace dataset
Fig. 11
figure 11

Confusion matrix plots obtained from models for age estimation using the UTKFace dataset: a Xception, b InceptionV3, c InceptionResNetV2, d MobileNet, e MobileNetV2, f NASNetMobile, g NASNetLarge, h FINet, and i INFINet

Fig. 12
figure 12

Confusion matrix plots obtained from models for gender estimation using the UTKFace dataset: a Xception, b InceptionV3, c InceptionResNetV2, d MobileNet, e MobileNetV2, f NASNetMobile, g NASNetLarge, h FINet, and i INFINet

When the numerical results given in Table 7, Figs. 11 and 12 are examined, the proposed INFINet model was the most successful compared to other models in both age and gender estimation, even when a different dataset was used. The INFINet model achieved 72.00% accuracy, 0.7408 loss in age estimation, and 90.50% accuracy, 0.2274 loss in gender estimation on the UTKFace dataset. It is seen that the proposed INFINet model is successful not only on a single dataset but also on a different datasets. In terms of age groups, the INFINet model made the most inaccurate classification in the 38–43 and 48–53 age groups. The proposed FINet model achieved the best success rate after the INFINet model in both age and gender estimation. This shows that the FINet model is competitive with other models in age and gender estimation. Therefore, the proposed FINet and INFINet models were introduced to the literature as new models.

3.5 Discussion

In the studies conducted with the FG-NET dataset in the literature review in the introduction chapter of this study, the dataset was used in training and classified according to different age groups such as 2–4, and successful results were achieved [34,35,36,37]. High successes have also been achieved in studies in the literature where training is made with different datasets and only testing is done with the FG-NET dataset [38,39,40]. Additionally, there are other studies using datasets other than the FG-NET dataset [28,29,30, 33, 46]. In our study, the FG-NET dataset was divided into eight groups. Moreover, it was used not only in the testing process but also in the training–testing processes. In this regard, recent works in which training–testing processes were performed with the FG-NET dataset and age groups were used similar to our work were examined, and the results are presented comparatively in Table 8.

Table 8 Comparison of the proposed work with recent works

When Table 8 is examined, it can be seen that the proposed INFINet model performs much better than other methods. We can say that the proposed FINet model exhibits a competitive performance with other methods.

4 Conclusions

In this study, an imbalanced dataset of human face images was used for age and gender estimation. Two new DCNN models have been developed for age and gender estimation with an imbalanced dataset of human face images. The first of the developed models, the FINet model, is a new model design with a parametrically modifiable structure. The second developed model, the INFINet model, is a model created because of the concatenation of InceptionV3, NASNetLarge, and FINet models after improvements. Both models were designed for the first time and have unique structures.

The FINet and INFINet models developed for age and gender estimation have been compared with many models that have shown significant success in the field of AI in recent years. FINet, INFINet, Xception, InceptionV3, InceptionResNetV2, MobileNet, MobileNetV2, NASNetMobile, and NASNetLarge models were trained under the same conditions and compared. As a result of the comparisons, the highest accuracy (age: %61.22, gender: %80.95 in the FG-NET dataset, age: 72.00%, gender: 90.50% in the UTKFace dataset) and lowest loss (age: 1.0234, gender: 0.4050 in the FG-NET dataset, age: 0.7408, gender: 0.2274 in the UTKFace dataset) values were achieved with the INFINet model developed for both age and gender estimation. It is one of the important achievements of the study that the INFINet model, which is brought to a parameter between the combined InceptionV3 model and the NASNetLarge model, has achieved higher success than both these two models and other AI technologies. Another remarkable achievement is that with the FINet model, which has much fewer parameters than all the AI technologies discussed in this study, more successful accuracy values are achieved than all other models except for INFINet.

It has been concluded that FINet and INFINet models developed for age and gender estimation are competitive with other models in cases where it is difficult to obtain successful results with imbalanced datasets. Future studies planned to be carried out include several objectives:

  • Application of the developed FINet model with different parameters

  • Developing the FINet model and producing its second version

  • Creating new and better models by combining the FINet model with different AI technologies

  • Estimating age on the selected gender dataset, after determining the gender

  • Estimating age based on specific ages

  • Creating models for object recognition using the Transformer approach

  • Testing the developed models on many different balanced/imbalanced datasets

  • Evaluating model performance by applying balancing techniques and augmentation methods

  • Application of the developed models to real-world problems