Introduction

Beauty is in the eye of the beholder. Because of humans’ unusually well-developed ability to interpret, identify, and extract information from other people’s features, the human face has piqued the interest of psychologists and other scientists in recent years. Our publications and television screens are not simply loaded with faces; they are filled with lovely faces, and both men and women are anxious about a possible partner’s appearance. Humans value their physical appearance, and some characteristics appear to be desirable across people and cultures [1, 2]. Cunningham et al.’s multidimensional fitness model of physical beauty proposes that perception of high physical attractiveness incorporates a number of desirable traits and personal attributes [3]. Such characteristics might be assessed using a biologically inspired face analysis system. To assess a face without undertaking a blind spatial search, terms such as saliency or gist may be employed. The saliency model is made up of highly concurrent low-level calculations in domains including intensity, direction, and hue. It is used as a starting point to draw attention to a group of prominent spots in a picture. When used with the saliency model, the gist model may offer predicted holistic image attributes [4]. Empirical evidence suggests that there is an optimal arrangement of facial characteristics (ideal ratios) that can improve a person’s face’s beauty [5]. Computational prediction of face attractiveness has gained significant scientific attention, with several applications in multimedia. Bio-inspired, deep learning–based discriminative representations for face aesthetic prediction can aid in identifying needed spatial regions of interest during human subjects’ facial aesthetic evaluations [6], leading towards the motivation of our research, that is the dealing with the main difficulty in such approaches: to extract discriminative and perception-aware elements that can be used to describe facial beauty.

The perception of human beauty is naturally subjective, but when assessing the beauty and attractiveness of other people, it is often accepted if the person being evaluated meets certain beauty standards or, openly, does not meet them. This agreement on who is and is not beautiful is so widespread that competitions to assess human beauty are held [7, 8]. One of the best-known examples where beauty is the key criterion is the Miss Universe competition [9]. The long history and popularity of such competitions imply the existence of beauty and attractiveness criterion that most people agree on.

Facial beauty analysis [10,11,12] is used for a variety of purposes, including face enhancement programs (MeiTu, FaceTune) and plastic surgery. Some diseases, such as facial palsy, can be diagnosed by analyzing facial characteristics such as symmetry [13]. To perform this analysis, methods for assessing the beauty and attractiveness of the human face have already been developed, including facial proportions [14], the golden ratio [15], ideal dimensions [16], and geometric features [17]. Although facial beauty prediction (FBP) has achieved high accuracy in photos taken in a controlled environment, it remains a challenging problem in real-world face photographs [18]. Assessing human beauty is a particularly difficult task, as it is affected by numerous variables such as photo resolution, human face angle, and lighting [19]. Artificial intelligence methods are frequently used to solve such problems [20,21,22].

Human face images are now commonly analyzed using machine learning and computer vision techniques. The human facial image conveys information such as age, gender, identity, emotion, race, and attractiveness to both humans and computers [23]. Several studies on facial attractiveness have been conducted. The perception of facial attractiveness is highly subjective and can be influenced by sociological or cultural factors as well as personal desires. Iyer et al. [24], for example, extracted facial landmarks to create facial ratios based on Golden and Symmetry Ratios. In the Hue, Saturation, and Value (HSV) space, texture, shape, and color features are retrieved as Gray Level Covariance Matrix (GLCM), Hu’s Moments, and Color Histograms, respectively. Another ablation trial is performed to determine which feature, when combined with facial landmarks, works best. In experiments, combining key facial traits with facial landmarks increased the facial beauty prediction score. Some facial features are objectively more appealing than others [25]. Objects that have a ratio are considered harmonious and beautiful [26]. Face studies show that symmetrical faces are more appealing [27]. In the case of identical twins, one study found that a twin with more symmetric proportions was considered more attractive [28]. The uniformity of people’s faces is defined by their mass as well as their similarity to other people. Different people’s faces have distinct facial features that set them apart from the majority of the population. Some research suggests that identical faces are more attractive [29]. People with identical faces are more likely to be symmetrical, and symmetrical faces, as previously observed, are considered more attractive. Baby-like characteristics are linked to sympathy and people’s proclivity to patronize protection. A large, rounded forehead, low position of the eyes and mouth, large, round eyes, and a low chin distinguish baby features. The study of youthful faces also shows that face attractiveness is positively related to its youthfulness [29].

According to research [30], there is a link between human facial health and facial attractiveness. One of the indicators of good human health is healthy facial skin. Furthermore, studies show that people are more likely to associate skin redness with being in good health. Human attractiveness is directly related to human wellness [27]. According to these studies, certain facial proportions are objectively more appealing to the majority of people [19]. A popular method for determining the accuracy of beauty determination, in which the Pearson correlation coefficient [31] is computed. It allows evaluating how strongly the method of determining beauty correlates with the human beauty determined in the opinion of a real expert. Ideally, the method of determining beauty should have a Pearson correlation of 1.

Another way to determine the beauty and attractiveness of the face is considered more modern and is becoming more and more popular, based on deep learning [32,33,34,35]. Convolutional neural networks (CNN) and other deep learning models can be used to automatically recognize facial features that determine the beauty and attractiveness of the face. This method is used by researchers to assess the beauty and attractiveness of the human face. For example, ResNet50, one of the more advanced architectures in convoluted networks, was used in the study [34]. The researchers claim to have achieved Pearson’s correlation coefficient of 0.87 after training the network with their dataset.

Vahdati et al. [35] employ a multi-task learning strategy to identify the best shared features for three related tasks (i.e., facial beauty assessment, gender recognition, and ethnicity identification). To improve attractiveness calculation accuracy, specific parts of face images (e.g., the left eye, nose, and mouth) as well as the entire face are fed into multi-stream CNNs. Each two-stream network accepts both a portion of the face and the entire face as input. Beauty3DFaceNet, the first deep learning network for evaluating attractiveness in 3D faces, is proposed by Xiao et al. [36]. It combines facial geometry, texture, and history to produce a more realistic 3D facial attractiveness score, similar to that of human raters. They also provide 3DFacePointNet + +, a novel network based on facial landmark priors that improves the Beauty3DFaceNet’s performance by simulating human eye perceptual sensitivity.

Lin et al. [21] define facial beauty prediction as a special regression problem driven by ranking data. We present R3CNN, a general CNN architecture that incorporates the relative ranking of faces in terms of aesthetics, to improve the performance of Facial beauty prediction. Bougourzi et al. [37] propose a two-branch architecture (REX-INCEP) based on merging the architecture of two already trained networks to deal with the difficult high-level features associated with the facial beauty prediction problem. They present an ensemble regression method based on CNNs and employ both networks in this ensemble (REX-INCEP). Recently, generative adversarial networks (GANs) [38], a type of deep generative neural architecture, have enabled unprecedented realism in the generation of synthetic human faces [39], landscapes and buildings [40], and medical images [41]. This paper aims to improve human beauty and attractiveness by using GANs to predict human beauty and attractiveness.

This paper’s contribution is an innovative GAN-based methodology for predicting human beauty and attractiveness. Our bio-inspired approach allows using a generated face instead of the real face to enhance the accuracy of determining facial attractiveness. The remainder of the paper is structured as follows: the “Method” section focuses on the technique and algorithms developed, while the “Experiments and Results” section discusses experimental assessment and findings. The “Discussion and Conclusions” section concludes the article and discusses our future research plans.

Method

Outline of the Methodology

Figure 1 depicts an outline of the methodology. The suggested approach has the following stages described further in the article: (1) detection of faces in photos; (2) generation of an artificial face for each detected face; (3) extraction of facial characteristics from a generated face; (4) evaluation of each face using a Multilayer Perceptron (MLP) model trained on generated faces; (5) evaluation of each face using a CNN model trained on generated faces. The findings are then compared to the results of face evaluation using MLP and CNN but without the use of artificial face creation.

Fig. 1
figure 1

Outline of the methodology

Face Detection in Group Photos

Recognizing people’s faces in group photographs was the initial step towards analyzing them. A histogram of directed gradients was used to detect facial characteristics (HOG) and the support vector machine (SVM) was then trained for face identification using features found in 3000 photos from the Labeled Faces in the Wild dataset [42].

Extracting Facial Characteristics

The identified face then underwent required transformations in order to prepare for attractiveness analysis. These changes allowed considerable increase in the accuracy of assessing face beauty. The 81 face points were retrieved using the “dlib” package and a custom-trained model [43]. Once the mesh vector was in place, the face underneath it was rotated, cut off, and turned into a 1024 × 1024 pixel picture.

Before extracting facial characteristics, facial transformations were performed, and then the necessary facial points were extracted for each transformed face, or if the points needed to determine the attractiveness of the face using the generated facial copy are extracted for the generated face, as summarized in Table 1, which employs the face characteristic measurements shown in Fig. 2.

Table 1 Facial characteristics used to assess facial attractiveness
Fig. 2
figure 2

Measurements of facial characteristics: a – face length, b – face width, c – distance from eye to nose, d– distance from the top to the eye, e – distance from nose to lips, f – eye height, g – distance between eyes, h – distance between outer corners of eyes, i – distance between inner corners of eyes

Multilayer Perceptron Model

A MLP model was used to determine the attractiveness of the face using facial characteristics. Network input had 13 facial characteristics, while the network output indicated a probability for each possible estimate (Fig. 3). Each reflected a possible assessment of attractiveness.

Fig. 3
figure 3

MLP network architecture CNN model

A modified ResNet50 model, customized to compute the likelihood of each estimate, was used to measure face attractiveness using convoluted networks, similarly to the MLP model above. Instead of face attributes, the photo itself was fed as input. The model was trained to detect the facial traits that define the beauty of the face and to score the attractiveness of the photo based on these aspects. Figure 4 depicts the architecture of the CNN model.

Fig. 4
figure 4

CNN model architecture face generation using GAN model

GAN (generative adversarial network) model consists of two parts: a generator(s). generator) and discriminator and discriminator. The operation of the generator and the discriminator can also be expressed in the formula (see Eq. (1)) which is the price (cost) function \(V(G, D)\):

$$m\underset Gax \; m\underset Din\;V\left(D,G\right)=E_{x\sim P(x)}\lbrack\log(D\left(x\right))\rbrack+E_{z\sim P\left(z\right)}\;\lbrack\log\left(1-D\left(G\left(z\right)\right)\right)\rbrack\;$$
(1)

Here, \(G\) is the generative model (generator), \(D\) is the discriminator, \(z\) is noise, \(P\) is distribution function, and \(x\) is input.

Face generation uses the StyleGAN2-Ada [44] network. To generate a copy of the real face using the GAN network, our approach moves a real face from the photo to latent space.

Face generation in latent space is shown in Fig. 5. The encoder was used to reduce the amount of image data. The decoder was used to recover the encoded image. The decoder extracts the image from the compressed secret space from the encoder and returns it to its original state before encoding.

Fig. 5
figure 5

Encoding in a GAN image puts it in a latent space and restores the image from latent space

We have used a VGG16 model to extract and store face characteristics in a latent space. The generator then returns the encoded face in latent space to its original form. During the facial design process, a vector in a latent space is sought that, if decoded, would restore the papered face. The scenario below depicts the design of a face in a latent space.

Face generation in latent space is performed as follows:

  1. 1

    The generator generates a face for the latent space for the vector

  2. 2

    The face generated by the generator is placed in a latent space using the VGG16 network

  3. 3

    The real face is placed in a latent space using the VGG16 network

  4. 4

    Calculates the error between the generated face and the face in the secret latent) in space

  5. 5

    Optimizes to reduce error

The procedure is summarized as an algorithm in Fig. 6.

Fig. 6
figure 6

Algorithm of synthetic face generation

Experiments and Results

This section describes several experiments that were carried out in order to increase the accuracy of face beauty rating under diverse settings. Faces were examined under non-standard settings, such as poor resolution, in each experiment. The goal was to reduce mean absolute error (MAE). Also, beauty estimate (Mean Opinion Scale (MOS)) is presented for a better understanding of how certain cases lead to the determination of beauty rating [45].

Training and Evaluation of the MLP Network

We have generated a duplicate of each of the faces in SCUT-FBP5500 dataset using the Style GAN2-ADA network, before training the network. Characteristics for each face were produced after recognizing people’s faces in a group shot and completing the appropriate changes, and these characteristics are transferred to the MLP network. The network outputted the probability of each estimate, which was used to calculate the MOS (actual attractiveness rating). Figure 7 depicts network training using stochastic gradient descent. Following the generation of face clones, training was carried out utilizing generated images and attractiveness ratings for original faces. After training, the network achieved a Pearson correlation of 0.741. The training outcomes are shown in Table 2. The following parameters are used to apply the stochastic gradient landing: learning frequency - 0.001, descent - 1 * 10−6, moment - 0.9. The training was carried out for 100 epochs.

Fig. 7
figure 7

MLP network training using generated faces

Table 2 MLP network training results using generated faces

Figure 8 depicts the ROC curves of the MLP model with 250 faces taken from the SCUT-FBP5500 dataset for testing with a goal to compare the accuracy of beauty evaluation with a wide variation of facial attractiveness. For each facial beauty assessment, 50 faces were chosen from the data set with attractiveness ratings of 1 to 5. Figure 9 indicates that the inaccuracy in the facial attractiveness assessment is similarly wrong with all conceivable estimations of facial attractiveness, since the curves are next to each other. The MLP model enhances the area under the average ROC curve from 0.76 to 0.78 by using face creation. The larger area under the ROC curve indicates that the attractiveness of the face is determined with greater precision.

Fig. 8
figure 8

MLP network ROC curves using generated faces

Fig. 9
figure 9

MLP network ROC curves using generated faces with a sample in which the face attractiveness estimate (MOS) is evenly distributed

Training and the Evaluation of the CNN-Based Model

Continuous transfer learning was applied in this approach. First, the dataset was used to train the modified ResNet50 model (further CNN model), where layers were modified such that the model learns to calculate the likelihood of each attractiveness estimate. The CNN model produces the same style of outputs as the MLP model. Both networks produce five probabilities, representing the likelihood of receiving each face beauty estimate (1 to 5). However, the MLP and CNN networks’ inputs and learning processes differ. CNN model use a picture with 224 × 224 resolution as its input (a human face cut from the photo). Such photographs are accompanied by an estimate of each person’s attractiveness. As with the MLP model, these estimates and original human images were generated from the SCUT-FBP5500 dataset. Unlike the MLP model, however, the CNN model’s training was affected by data growth, used for diversifying the quantity of data available for learning by utilizing various data transformations.

Training process also employed the following augmentation transformations: twisting, zooming out, zooming in, and rotating. Figure 10 depicts the results of model training. Pearson’s correlation coefficient of 0.865 was obtained after training this model. Table 3 summarizes the training results. Adam [46] optimizer was used to train the model. Adam’s optimization function was set to the following parameters: learning frequency - 0.001, β- 0.9, β- 0.999. Early stopping technique was applied to help avoid overfitting.

Fig. 10
figure 10

CNN network training

Table 3 Results of the CNN network’s training

Figure 11 shows a confusion matrix used to measure the accuracy of the face attractiveness assessment of 1095 faces from the SCUT-FBP5500 dataset. Figure 12 shows the ROC curve. Because the faces analyzed were picked at random, the expert (true) attractiveness rating was evenly dispersed. Most faces were evaluated by experts based on their average attractiveness. As a result, most of the faces tested were rated between 2 and 4. The confusion matrix and ROC curves demonstrate the beauty estimate (MOS) determination accuracy for each feasible estimate (1 to 5). According to the confusion matrix, the most common attractiveness score (measured at 600 faces) is 3. This rating is correct, although the CNN network assigns it somewhat more frequently than it should.

Fig. 11
figure 11

CNN network confusion matrix

Fig. 12
figure 12

CNN network ROC curves

The comparison of the accuracy of beauty assessment with a wide distribution of facial attractiveness is presented in Fig. 13, which shows the ROC curves of the network when 250 faces from the SCUT-FBP5500 dataset were used for testing. For each facial beauty assessment, 50 faces were chosen from the data set with attractiveness ratings of 1, 2, 3, 4, and 5. The ROC curve of the CNN network reveals that the area under the average ROC curve improves from 0.76 to 0.95 when compared to the MLP network. The wider the area under the ROC curve, the more precisely the attractiveness of the face is determined.

Fig. 13
figure 13

CNN network’s ROC curves with a sample in which the face attractiveness estimate (MOS) is equally distributed

Evaluation of the Results Using Generated Faces

After extracting and preprocessing faces from the photos, a “copy” of each face was generated and fed into the CNN network instead of the original photo. The network’s training result is shown in Fig. 14. Pearson’s correlation coefficient was measured as 0.882. The results of the training are given in Table 4. Adam optimizer was used to train the network. The Adam optimization function used the following parameters: learning frequency - 0.001, β- 0.9, β- 0.999. Early stopping was applied to help avoid overtraining.

Fig. 14
figure 14

CNN network training using generated faces

Table 4 CNN network training results using generated faces

A confusion matrix is presented in Fig. 15 and ROC plot in Fig. 16. Faces were chosen randomly, to keep the expert (real) beauty estimate naturally distributed. Most faces were evaluated by experts based on their average attractiveness. Most of the faces tested were rated between 2 and 4. The confusion matrix and ROC curves demonstrate the beauty estimate (MOS) determination accuracy for each feasible estimate (1 to 5). The CNN network accurately assigned a beauty estimate of 4 to 170 faces, while 40 faces were mistakenly marked in this assessment round. The CNN network also accurately classified the beauty of 104 faces with a beauty score of 2, but 85 faces were estimated with a small mistake. For the least appealing faces with an assessment of one, two faces were correctly recognized and two were incorrectly identified. Even after 15 attempts, the most gorgeous faces could not be identified.

Fig. 15
figure 15

CNN network confusion matrix using generated faces

Fig. 16
figure 16

CNN network’s ROC curves using generated faces

Figure 17 depicts a comparison of the accuracy of beauty assessment with face attractiveness. It displays the MLP network’s ROC curves when 250 faces from the SCUT-FBP5500 dataset were examined. For each facial beauty estimate, 50 faces were chosen from the dataset with attractiveness ratings of 1, 2, 3, 4, and 5. The usage of face synthesis by the CNN network decreases the area under the average ROC curve from 0.95 to 0.91. These findings imply that the CNN network, which employs generated faces, should improve when evaluating faces with beauty ratings of four or two.

Fig. 17
figure 17

CNN network’s ROC curves using generated faces with a sample in which the face attractiveness estimate (MOS) is evenly distributed

MLP and CNN Network Training to Generate Face Photos

Five thousand five hundred pictures from the SCUT-FBP5500 dataset were added to the latent space to train MLP and CNN networks to judge facial attractiveness by inserting a generated human face. Figure 18 shows an example of a face formed in latent space. An original image is shown on the left side of the figure (always the same). The right side shows how GAN might separate its output in its generation process (so a new face is always different).

Fig. 18
figure 18

Faces generated by GAN in the latent space

Facial attractiveness evaluation without the use of face copies is illustrated in Fig. 19 and facial attractiveness assessment for generated face reproductions is shown in Fig. 20. Both figures show an MLP model beauty estimate ("MLP MOS") and a CNN model beauty estimate ("CNN MOS") for each face.

Fig. 19
figure 19

Measure the attractiveness of faces without generating copies of faces

Fig. 20
figure 20

Facial attractiveness rating for generated face copies

Summary of Experiments

We reduced the margin of error in the experiments by employing our CNN-based model. A study of face pictures with varying brightness and contrast, as well as in low resolution, was made to test the robustness. The results have revealed that approach based on the CNN model has less mistakes than the more basic MLP model. This is due to the fact that the MLP model only responds to geometric facial proportions, but the CNN model additionally responds to changes in face color. A summary of the experimental results is given in Table 5. This table shows how different picture parameters impact the overall evaluation. While a change in resolution is a minor signal of a change in evaluation values, a change in brightness, as well as a change in contrast, can have a significant influence on the outcome in many circumstances.

Table 5 Summary of experimental results

Discussion and Conclusions

Comparison with the state-of-the-art methods using SCUT-FBP5500 dataset is offered in Table 6. The comparison indicates that our methodology outperforms most state-of-the-art algorithms (MAE score of 0.205), with comparable accuracy to regression ensemble–based CNN (MAE score of 0.201), which integrates the ResneXt-50 and Inception-v3 architectures through FC layers but lacks an internal GAN-based evaluator like ours, thus theoretically having a potentially lover internal complexity when approaching unseen faces.

Table 6 MAE comparison with state-of-the-art approaches

We suggested a novel approach for determining the attractiveness of a face by generating artificial faces. The model’s hidden layers can potentially learn useful face traits that are congruent with human visual perception. When it comes to risks to internal authenticity, the network depth is critical. There are two dimensions to external validity threats. On the one hand, training labels are evaluated by certain pupils in a specific community, which may not cover a general perspective in using a simulated face instead of the real face enhances the accuracy of determining facial attractiveness. The training photos, on the other hand, are drawn from Asian and Caucasian population (dependent on the benchmark dataset used), which may lead to a potential data bias in diversity. The capacity to evaluate face beauty also opens the door to hidden facial beauty alterations. The incorporation of the GAN as a component naturally adds some computing cost to the process when compared to pure CNN-based models, which do not use extra or augmented information and rely only on vast, precisely annotated databases for training. However, we feel that refining pruning approaches and minimizing training time is an area that requires attention, as is the adaptation of GAN structures, which might benefit from enhanced sparsity and selectiveness.

Unfortunately, the presented solution may have race-based bias due to composition of a dataset used for training. A more diverse dataset representing more racial types of human faces may be needed to avoid the own-race bias problem [50] and achieve more fair results.

Future study is needed to enhance face normalization in the latent space to increase the accuracy of determining facial attractiveness. Finally, despite improvements shown in recent studies dealing with attractiveness accuracy score due to the non-linearity of deep feature representations, no model is yet sufficiently robust for face beauty evaluation in unconstrained environments, with photos taken from different angles, such as the side, top, or bottom.