1 Introduction

To let a machine evaluate facial beauty is a challenging task that has recently attracted ever-growing interests from both research and industry communities. Facial beauty prediction (FBP) has an application potential in aesthetic surgery [59], face beautification [25] and content-based image retrieval [34]. Although generic or universal FBP, where the mean of scores provided by different raters treats as the ground-truth, has been extensively researched and achieved impressive results [27], investigation on personal facial beauty preferences is quite limited.

Fig. 1
figure 1

Face image samples with scores given by 5 different raters and their average scores (A). The figure illustrates how scores of a face image may vary considerably from one individual to another

Beauty perception is well-known to be highly subjective as individuals may have very different visual preferences, depending on an individual’s personality traits, social and cultural backgrounds [2]. Moreover, the generic or average beauty score of a face image counteracts individual preferences and does not allow to effectively infer the beauty perception of an individual from the generic facial beauty. As an example, Fig. 1 demonstrates the variety of scores provided to the images by 5 raters, the average scores are also shown for comparison. The individual nature of facial beauty makes the study of its personalized aspect not only important, but also extends the application of FBP. Recommender systems in social media, personalized makeup recommendation [30], online dating [22]—these are some application examples of personalized facial beauty assessment.

At the same time, the high subjectivity of personalized FBP brings a challenge that generic FBP is not required to take. Since the process of annotation images with beauty scores by an individual is time-consuming and annoying in real-world applications, an effective personalized method is required to produce a reasonable accuracy on a small amount of training image samples. Another issue in this research direction is the lack of data to study personal preferences. Most publicly available facial beauty datasets contain scores collected from only few dozens of raters of one age, gender or ethnic group [44, 50]. Therefore, this data is not sufficient to conduct the personalized research.

Since only few works have been addressed to the personalized FBP issue, where geometric features and shallow predictors [3, 47], visual regularization for matrix factorization [39] are exploited to predict individual preferences, personalized methods from the related research area, image aesthetics, are investigated. There are personalized aesthetic assessment methods that adapt the pretrained generic aesthetic model in order to learn individual scores [26, 37]. However, generic models that are trained to predict the average score of an image counteract the differences of personal aesthetic perceptions. Thus, these methods are not able to effectively capture preferences shared among individuals. Moreover, these methods strongly average the prediction results of individual preferences and not capable of quickly adapting to a new individual. Recently meta-learning that is aimed to train a model that can be quickly adapted to a new task has been successfully applied to reduce cold-start problems in recommender systems [24, 32], where the number of items annotated by an individual is also strongly limited.

In this work, a novel personalized facial beauty assessment method based on meta-learning is introduced. First of all, beauty preferences shared by individuals are first learnt during meta-training. Then, the model is adapted to a new individual with a few labeled image samples in the meta-testing phase. The experiments are conducted on a facial beauty dataset with a high diversity in age, gender and ethnicity, where images are rated by hundreds of volunteers. The results demonstrate that the proposed method is capable of effectively learning personal beauty preferences from a limited number of annotated images.

The main contributions of the work are the following:

  1. 1.

    The personalized aspect of facial beauty assessment is studied. The importance of this issue is statistically demonstrated, including the relation and deviation between personal and generic facial beauty perceptions.

  2. 2.

    A meta-learning-based approach that is able to learn an individual’s beauty preferences on a few annotated face images is introduced. The preferences shared by an extensive number of individuals are captured during the meta-training phase, while fine-tuning to an individual’s taste is conducted in the meta-testing phase.

  3. 3.

    The proposed method, as well as various personalized frameworks from facial beauty and aesthetic research areas, is evaluated for the personal beauty preferences prediction on different numbers of face images.

  4. 4.

    The extensive experiments conducted on a facial beauty dataset in-the-wild that includes faces of various ethnic, gender, age groups and rated by hundreds of volunteers with different social and cultural backgrounds demonstrate that the proposed meta-learning method overcomes the FBP state-of-the-art in quantitative comparisons.

The rest of this work is organized as follows. Generic FBP methods, as well as personalized FBP and aesthetic studies, are outlined in Sect. 2. The comprehensive analysis of personal preferences with regard to generic beauty scores is presented in Sect. 3. The proposed meta-learning-based personalized facial beauty assessment method is described in Sect. 4. Experimental results and comparisons are given in Sect. 5.

2 Background

In this section, generic FBP frameworks are first outlined. Then, existing personalized beauty prediction methods, as well as related studies from the image aesthetics research area, are presented.

2.1 Generic facial beauty prediction

The earliest FBP approaches are based on shallow predictors, such as support vector machine [33], linear regression [14], tree-based classifier [19], k-nearest neighbors [1], that are trained by geometric features, like facial ratios and landmark distances. In some works, these geometric features are needed to be manually labeled or adjusted [21, 41], while other studies present special automatic feature extraction methods that help to avoid costly manual annotation of facial landmarks [6, 60]. Computer-generated female faces without diversity in face expression, pose, hairstyle are used to investigate the relationship between facial beauty and facial proportions [15]. Textural features, such as Gabor decompositions [7], Eigenface projections [47], Gist, HOG [3], SIFT descriptor [56] are also exploited to train shallow machine learning algorithms in order to predict facial attractiveness.

As well as for other computer vision tasks, deep learning brought the great achievements for facial beauty assessment [16, 18, 50]. A range of neural network structures are specifically designed for the task [51, 52]. Different loss functions [54] are also presented in order to enhance CNN and achieve better accuracy. A residual-in-residual network structure that builds a better pathway for information transmission and improves feature representation is presented in [4]. Transfer learning-based approaches are a big part of FBP. Various CNN structures, such as VGG16 [39, 55], ResNeXt-50 [27], FaceNet [23] pretrained for object recognition and face verification are exploited as features extractors have demonstrated their superior to previous works. A stacking ensemble model combined with features extracted by VGGface2 introduced in [45]. A lighted deep convolution neural network constructed by both Inception model of GoogleNet and Max-Feature-Max activation layer is also introduced for facial beauty prediction. An attribute-aware network that takes full advantage of facial attributes as prior knowledge is presented in [29]. Facial parsing masks for learning accurate representation of facial composition combined with co-attention learning mechanism are also exploited for facial beauty prediction [40].

Multi-task networks that simultaneously predict facial beauty, gender and ethnicity from a face image are recently presented [46, 53]. Different from single-task networks, these multi-task networks combine task-assisted information from other datasets related to facial beauty assessment to improve its performance [16]. A deep cascaded forest combined with features extracted from a face image by multi-grand scanning is also introduced for facial beauty prediction [57, 58]. A deep manifold-learning approach with the supervised Local Discriminant Embedding algorithm is proposed in [12]. All previous works adopt fully supervised schemes. Lately, the use of semi-supervised learning for solving the problem of face beauty scoring is introduced. A nonlinear flexible manifold embedding for solving the score propagation is presented in [10]. Graph-based semi-supervised learning methods are applied in [11, 13].

Since facial shape, symmetry and proportions play key roles in beauty, 3D facial attractiveness has recently been explored. A range of 3D beauty enhancement methods have been introduced [28, 62]. Dense 3D face information is utilized to automatically reshape a portrait to be more shapely and better proportioned by using deep learning in [48]. The 3D facial attractiveness prediction issue is addressed in [49]. A CNN that effectively learns beauty scores on 3D faces with geometry and textual information is presented. A 3D FBP dataset with 6000 3D faces is also introduced in the work.

2.2 Personalized assessment

The first attempts to learn an individual’s facial beauty preferences exploit support vector regression that maps low-level image features, e.g., eigenface projections, Gabor filters, edge orientation histograms, geometric features, into facial beauty scores [47]. A personalized relative beauty ranking system that also exploits geometric and textual features is presented in [3]. The visual regularization for matrix factorization that regresses a new image query to a latent space in order to predict a user’s beauty ratings is introduced in [39]. Logistic regression trained by FaceNet embeddings is leveraged for personalized facial beauty assessment in [20]. CNN features that play the most significant role for an individual are selected to train Random Forest in [22]. A Siamese multi-task deep learning architecture that is exploited to predict similarity and match between the personal interests, preferences and attitude of two individuals based on their face images presented in [17].

Table 1 Evaluation of publicly available facial beauty datasets from the ability to be applied to study personal preferences point of view
Table 2 The ethnic, gender and age composition of face images in MEbeauty

Since personalized image aesthetics assessment is addressed to stu dy an individual’s visual preferences as well, its methods are also investigated. A feature-based collaborative filtering that transforms the features of an item to latent vectors in order to predict an individual’s image aesthetics preferences is introduced in [36]. A residual adapters model, which keeps a reduced number of user-specific parameters and makes it scalable to a big number of images, is proposed in [38]. A residual-based model adaptation scheme that learns an offset to compensate for the generic aesthetics score is presented in [37]. Another personalized method that is also based on the pretrained generic aesthetics assessment model is introduced in [26]. An image quality assessment method with social-sensed aesthetic preferences, where a deep neural network is developed for the task is introduced in [8].

3 Personal preferences analysis

In order to explore how individual preferences relate, depend and differ from generic beauty perception, its statistical analysis is conducted. The correlation between an individual’s scores and the ground-truth, as well as the standard deviation, is investigated.

Fig. 2
figure 2

The statistical information about raters in the MEbeauty dataset. The ethnic composition of female annotators is presented in a, while the age structure is shown in c. The ethnic and age statistics of male raters are shown in b and d, respectively

3.1 Dataset

In contrast to previous FBP works where employed datasets have restrictions on image/rater age, gender, ethnicity (Table 1), the dataset exploited for the analysis and experiments in this work includes 2550 faces of various ethnic, gender and age groups.Footnote 1 Table 2 contains detailed statistical information on the composition of the dataset. First of all, face images are grouped by ethnicities, including Black, Indian, Hispanic, Mideastern, Asian and Caucasian. Secondly, images are separated into male and female genders. Third, the exact number of face images belonging to various age groups, including faces younger than 14 years old, from 14 to 20, 20 to 35, 35 to 55 and older than 55, are presented. The total number of images representing each ethnic, gender and age group is also demonstrated.

The scores are collected by using Amazon SagemakerFootnote 2 on a scale of 1–10, where 1 denotes not attractive, 10 is very attractive face. The images are rated by more than 300 volunteers of different cultural and social backgrounds in order to reduce any biases on the ground-truth score of an image. Moreover, it provides the opportunity to study the personalized aspect of facial attractiveness. Figure 2 demonstrates the proportion of a particular ethnic and age group among female and male raters in the MEbeauty dataset.

3.2 Correlation of individual scores to the ground-truth

The correlation between scores gathered from different individuals to the ground-truth is demonstrated in Fig. 3, where the red bar is assigned to female images, while the blue one is for male faces. The Pearson correlation (PC) coefficient described later in Sect. 5.3 is used to calculate the correlation of each rater’s scores to the corresponding ground-truth and then the mean is computed. This statistical measure is demonstrated not only for all the scores and face images, but also for the separated ethnic groups in the dataset.

The analysis has confirmed the expectations about the subjectivity of personal scores and demonstrates the PC of 0.62 for scores assigned to female images and 0.54 for male images. Scores assigned to male faces tend to be less correlated to the ground-truth and have more personal character than scores of female images in most ethnic groups presented in the dataset. Regarding the ethnic division, the scores assigned to Hispanic and Black faces also have more subjective assessment than Asian and Caucasian faces in the dataset.

Fig. 3
figure 3

Correlation between individual scores and the ground-truth of face images. The red bar is assigned to female faces, while the blue one is for male images. The correlations are demonstrated for all images, as well as for separated ethnic groups presented in the dataset

3.3 Standard deviation of personal preferences

While PC in the previous subsection allows to demonstrate how significant individuals’ scores are relative to the ground-truth, standard deviation helps to analyze how the scores gathered from raters differ from the ground-truth. Figure 4 illustrates the distributions of standard deviation of individual scores given to an image from its ground-truth. On the graphs, the red curve is assigned to scores provided to female faces, while the blue curve is for male images. On a scale of 1–10 used for scoring in the dataset, the average deviation is about two. The deviation of scores assigned to male images is higher than the deviation of scores given to female score from their ground-truth.

Fig. 4
figure 4

Distribution of standard deviation. The red curve is assigned to scores given to female faces, while the blue curve is for male images

Fig. 5
figure 5

The proposed personalized facial beauty assessment method. There are two main phases: meta-training that is aimed to learn the beauty preferences shared among individuals and meta-testing for fine-tuning the model to a new user

4 Methodology

Figure 5 illustrates the proposed personalized meta-learning-based facial beauty assessment method. The approach includes two main steps: meta-training which is aimed to learn preferences shared by an extensive number of individuals and meta-testing for fine-tuning the model trained in the first step to an individual’s preferences. The details of the proposed methods are discussed in this section.

4.1 Image preprocessing

Since deep learning is exploited in the approach, image preprocessing with face cropping, alignment and data augmentation is necessary. The Multi-task Cascaded Convolutional Network [61] is used to detect a face in an image and conduct its alignment. Image augmentation, including rotation, shifting and flipping, is also performed on training data. Figure 6 demonstrates an example of the image preprocessing exploited in this work.

Fig. 6
figure 6

The pipeline of the image preprocessing applied for the method. Face cropping, face alignment and image augmentation are included to the pipeline

4.2 Meta-training

First of all, facial beauty preferences shared by an extensive number of individuals are learnt in the meta-training phase. The goal of this phase is to obtain the model that contains the shared rules that different individuals judge facial beauty. To train this model, a large number of personalized facial beauty assessment (PFBA) tasks are used. The PFBA task of an individual is referred to the face images and scores provided to these images by the individual. All tasks for the meta-training phase are divided into the support set and query set, where the first one is exploited to update the model parameters, while the latter one is used for the model validation. If \(n_i\) is the number of images rated by i individual that participates in the meta-training phase, then the meta-training set of the individual is defined as

$$\begin{aligned} {{D_S}^i} \cup {{D_Q}^i} = {\{({x_j},{y_j})\}}^{n_i}_{j=1} \end{aligned}$$
(1)

where \(x_j\) is the image and \(y_j\) is the score given to this image by the individual. D in divided to the support set \(D_S\) and the query set \(D_Q\) for each individual.

For predicting the score \(\hat{y_j}\), \(x_j\) image is fed into the model

$$\begin{aligned} \hat{y_j} = {f_\theta }(x_j; \theta ) \end{aligned}$$
(2)

where f is the model or CNN, \(\theta \) denotes the random initial parameters of the model. Since the goal of the model training is to minimize the difference between the real score \(y_j\) and the predicted score \(\hat{y_j}\) of the image \(x_j\), the loss function is calculated by

$$\begin{aligned} {\mathcal {L}} = {\Vert {f_\theta }(x;\theta ) - y \Vert }_2^2 \end{aligned}$$
(3)

Then, the model parameter \(\theta \) is updated according to the loss

$$\begin{aligned} \theta ^* \leftarrow \theta - \alpha \bigtriangledown _\theta {\mathcal {L}}_i(f_\theta ) \end{aligned}$$
(4)

where \(\alpha \) denotes the inner learning rate. This is the local update that can be referred as the procedure of personalization of individuals. To find the parameters \(\theta \) that can be quickly adapted to a new PFBA task, the loss \({\mathcal {L}}_Q(f_{\theta ^*})\) is used to calculate

$$\begin{aligned} \theta \leftarrow \theta - \beta \bigtriangledown _\theta \sum _{i=1}^{n} {\mathcal {L}}_Q(f_{\theta ^*}) \end{aligned}$$
(5)

where \(\beta \) in the outer learning rate. After training on PFBA tasks used for meta-training, the model with shared facial beauty preferences can be adapted to a new individual’s preferences.

In the proposed method, the Inception-ResNet [43] architecture is adopted as the backbone. Since the approach is aimed to work with face images, the network is pretrained on the VGG Face 2 dataset [5]. The CNN is fine-tuned and two fully connected layers are added. In order to avoid overfitting, dropout and batch normalization are also added to the network. Moreover, since the number of images available for meta-training and meta-testing is relatively low, the weights of all layers, except the last 2 convolutional and added fully connected, are frozen. The loss function described in the previous section is used to train the network, while the Adam optimizer minimizes the loss function over the training images.

4.3 Meta-testing

When the meta-learner is trained, it can be adapted to an individual’s taste with a few annotated images. In other words, the model can be fine-tuned to a new PFBA task with different images rated by a different individual. To this end, the images rated by this individual are divided into the support set and the query set for meta-testing. The parameters of the meta-learner are updating during the training process on the support set. When the training is done, the query set is used to validate the final user-specific model.

5 Experimental results

5.1 Settings

The proposed method, as well as other personalized methods presented in this work, is evaluated on the dataset described in Sect. 3.1. In order to organize meta-training and meta-testing on different images, rated by different individuals, the dataset is divided into two parts. The first one contains 1900 images labeled by 300 raters, and is used in the meta-training phase, while the remaining 650 images evaluated by 60 individuals are employed for meta-training. The number of images annotated by one individual in both training and testing sets varies from 50 to 1000.

All experiments in this work are conducted on Amazon Elastic Compute Cloud.Footnote 3 A GPU-based deep learning virtual machine with Ubuntu 18.04 is selected to design and run the experiments.

5.2 Implementation details

5.2.1 Meta-training

The maximum volume of the support set is specified to 80 images due to the different numbers of image samples rated by each individual in the meta-training phase, and the remaining images annotated by an individual go to the query set. If the amount of images rated by an individual is less than 80, all samples are split to 80% for the support set and 20% for the query set. The meta-learner is trained within 50 epochs. The learning rate is set to 0.001 and 0.0001 for CNN and two added fully connected layers, respectively. The rate of 0.5 is chosen for the dropout layer.

5.2.2 Meta-testing

Since the number of images in the support set and potentially in the query set of meta-testing is low, the performance strongly depends on few images. To obtain more objective evaluation, the annotated images of each individual are randomly divided into the support and query sets with a particular number of training images five times and the average accuracy over all five splits is calculated and defined as the result for this individual. The learning rate 0.0001 is experimentally selected for fine-tuning in the meta-testing phase and the number of epochs is set to 25 for training on each individual.

Table 3 Performance of the proposed personalized facial beauty assessment method on different numbers of training image samples

5.3 Evaluation metrics

The performances of the proposed and other methods used in this work are reported in terms of Pearson Correlation (PC), Mean Absolute Error (MAE), Root Mean Square Deviation (RMSE).

PC is a measure of linear correlation and calculated by

$$\begin{aligned} PC = \frac{\sum _{i=1}^{n}({{a_i}}-\bar{a})({b_i}-\bar{b})}{\sqrt{\sum _{i=1}^{n}(a_i-\bar{a})^2}\sqrt{\sum _{i=1}^{n}(b_i-\bar{b})^2}} \end{aligned}$$
(6)

where n is the number of image samples, \({a_i}\), \({b_i}\) are sample points, and

$$\begin{aligned} \bar{a} = \frac{1}{n}\sum _{i=1}^{n}{a_i}, \quad \bar{b} = \frac{1}{n}\sum _{i=1}^{n}{b_i} \end{aligned}$$
(7)

PC ranges from −1 to 1, where 1 indicates total positive linear correlation, 0 defines that there is no linear correlation, and −1 means total negative linear correlation.

MAE and RMSE also measure the quality of a machine learning model, and the values close to zero indicate better performances.

$$\begin{aligned} MAE= & {} \frac{1}{n}\sum _{i=1}^{n}|f({x_i}) -{y_i}| \end{aligned}$$
(8)
$$\begin{aligned} RMSE= & {} \sqrt{\frac{1}{n}\Sigma _{i=1}^{n}{({f({x_i})-{y_i}})^2}} \end{aligned}$$
(9)

where n is the number of image samples, \({x_i}\)—the input feature vector of face image i, \(f(\bullet )\)—the learning algorithm, \({y_i}\)—the ground-truth score of the face image i.

5.4 Performance of the personalized meta-learning-based approach

Since the goal of this work is to propose a method that is able to effectively learn an individual’s facial beauty preferences on a small amount of training images, the evaluation of the presented personalized approach is demonstrated on different numbers of training samples. In the case of the meta-training approach, training images stand for the query set in meta-testing. Table 3 demonstrates the accuracy of the proposed method in terms of PC, MAE and RMSE on 10, 50, 100, 250 and 500 image samples. The most optimal number of images that is needed to be annotated by an individual among all presented quantities is 100, since 10 and 50 samples produce much lower accuracy and 250 images do not show enough improvements compared to 100 images to ask an individual to rate the double amount of images.

Fig. 7
figure 7

The performance comparison of the model with shared beauty preferences obtained in the meta-training phase (green bar) and the model fine-tuned to an individual (red bar)

Table 4 Performance of the personalized meta-learning-based method with different backbones on 100 face images with and without preprocessing
Table 5 The performance evaluation of the shared preferences model obtained in the meta-training phase and the personalized model fine-tuned in the meta-testing phase with different backbones and with/without preprocessing on 100 face images

With regard to separated evaluation of the meta-learning and meta-testing performances, Fig. 7 illustrates the prediction accuracy in terms of PC directly on the model with shared beauty preferences obtained in the meta-training phase and this model fine-tuned on images rated by an individual in the meta-testing phase. To demonstrate the effectiveness of these two models, 20 individuals are randomly selected among 60 raters whose scores participate in the meta-testing phase. The green bar is assigned to the model with shared preferences while the red one is the accuracy of the personalized model. Depending on an individual, the performance improvement archived by fine-tuning amounts to 20% (PC 0.2). At the same time, on some individuals personalized models demonstrate similar or even slightly lower accuracy than the model with shared beauty preferences only.

5.5 Ablation study

Ablation studies are performed to analyze the contributions of the backbone CNN model used in both meta-learning and meta-testing phases, as well as image prepossessing. As it was stated in Sect. 5.4, the optimal number of face images needed to train the proposed approach is 100. This amount of images is employed in the ablation study of the work.

In order to evaluate the performance of the network exploited in the proposed approach, originally used Inception-ResNet pretrained on the VGG Face 2 dataset is replaced with ResNet18 pretrained on ImageNet [9]. By doing so, not only a network architecture is evaluated for the proposed personalized method, but also its weights. It is especially important when the number of images is relatively low. Moreover, a network pretrained for face recognition might be attractive to enhance the proposed personalized facial beauty assessment task.

The contributions of the image preprocessing described in Sect. 4.1 are also evaluated. To this end, the proposed approach with the original backbone, its configuration, the same approach to split the dataset for meta-training and meta-testing and, in turn, support and query sets is performed on the dataset images that have a high diversity in face pose and size in relation to the image without any preprocessing.

Table 4 demonstrates the performances of ResNet18 pretrained on ImageNet and Inception-ResNet pretrained on VGG Face 2 in the proposed personalized method on face images with and without image preprocessing. The number of images used to fine-tune the model with shared preferences to an individual is 100. As expected, the performances on Resnet18 pretrained on ImageNet are lower than on Inception-ResNet pretrained on VGG Face 2. It can be assumed that a higher number of images in a FBP dataset could improve the results on the network pretrained on ImageNet. Since the dataset used for the experiments includes face images in-the-wild and has no restriction on face expression, pose and the area occupied by a face on the image, face cropping significantly improves the performances of the proposed personalized facial beauty assessment method.

The performance of the model with shared preferences and the personalized model are evaluated separately for all scenarios described above—different backbones and image prepossessing. Table 5 demonstrates that despite the best results achieved by the combination of Inception-ResNet pretrained on VGG Face 2 and Image preprocessing, the fine-tuning conducted in the meta-testing phase contributes more on face images without preprocessing.

Table 6 Comparison to the state-of-the-art

5.6 Comparison to the state-of-the-art

In this section, the performance of the proposed meta-learning-based personalized facial beauty assessment method is compared to other personalized FBP approaches. Moreover, since the number of studies addressed to predict an individual’s facial beauty preferences is quite limited, some generic FBP methods and personalized approaches from the related research area, image aesthetics assessment, are also considered for the task. In contrast to the evaluation presented in the previous section, the scale of training image samples is changed from 10–500 to 50–500 because most existing personalized and generic FBP methods are not able to learn individual preferences on 10 image samples.

For the state-of-the-art methods presented in this section, the following train–validation split approach is applied. All experiments are conducted on each user’s scores into a cross-validation scenario. The data is randomly divided into the training set with the corresponding number of face images and the rest of the data goes to the testing set. For each user, the performance is computed as the average over all five splits. The mean among all users performances is defined as the result.

Fig. 8
figure 8

The pipeline of the residual-based personalized aesthetics assessment method [37] adapted to FBP. The method is employed as one of the state-of-the-art approaches to demonstrate the advantages of the proposed meta-learning method

Firstly, the method based on low-level image features and SVR [47] is evaluated. Since eigenface achieves the best among all features exploited in this work, it is selected for the comparison. The applied number of eigenface components is set to 100 and 40 for 50 training images. The SVR configuration defined in the work is also applied. The study demonstrates the results only on 800 training images and presents its accuracy only in terms of PC. Instead, performances of this state-of-the-art are presented in our work in terms of PC, MAE, RMSE on 50, 100, 250 and 500 training samples.

As mentioned earlier in this section, the number of studies specifically addressed to the personalized facial beauty assessment is limited in the literature. That’s why various frameworks that originally proposed for generic FBP are adapted to personalized scenarios and evaluated on different amounts of training images. The framework presented in [55] looks attractive for the personal beauty preferences prediction task as it includes a shallow predictor trained on deep features transferred from the face recognition task. More concretely, VGG 16 pretrained on the VGG Face 2 dataset is exploited as a feature extractor, while Bayesian Ridge is used as a predictor. The approach presented in [45] also leverages CNN pretrained on VGG Face 2, but contains more complex learning mechanism which combines the predictions of several base machine learning models. The transfer learning-based method [27] that exploited the AlexNet network pretrained on ImageNet is also compared with the proposed approach.

Since the FaceNet network that is originally designed and pretrained for the face recognition task produces face embeddings with relatively low dimensionality, it potentially demonstrates effective performances on a low number of annotated face images and can be effective for personalized FBP. FaceNet embeddings are exploited for gender prediction in [42]. However, the authors in the work consider the task as classification and employed various traditional machine learning models for gender classification. In our work, SVR is trained on FaceNet embeddings in order to predict an individual’s beauty preferences and evaluate these features for personalized FBP with various numbers of training images.

Personalized approaches originally proposed for the related task, image aesthetics assessment, were also considered for the facial beauty prediction. A Residual-based method [37] that adapts the generic model to an individual’s preferences is considered for the personalized facial beauty assessment. The main idea of this approach is to use the average scores as the basis for prediction and learn the difference between the average and the individual score. SVR is trained on aesthetic and content attributes of images. To adapt this approach to the personalized facial beauty assessment task, 40 facial attributes [31] replace the features exploited in the original method. Figure 8 illustrates the pipeline of the residual-based method adapted to personalized facial beauty assessment.

Fig. 9
figure 9

Prediction accuracy in terms of PC on various numbers of training images achieved by the state-of-the-art and presented method (red line)

Table 6 demonstrates the performances of the methods described above and the proposed personalized approach in terms of PC, MAE and RMSE on 50, 100, 250 and 500 training samples. Additionally, in order to show the advantages of the presented meta-learning approach more clearly, the PC that is achieved by all mentioned approaches on the different numbers of images is shown in Fig. 9.

As it can be seen, the method that is based on geometric features and a shallow predictor shows the lowest accuracy in all cases. Moreover, since this approach is applied to a multi-ethnic dataset with face images in-the-wild, it achieves lower results than in its original work [47]. The generic methods adapted to the personal preferences prediction are quite effective on 500 samples [27, 42]. However, these approaches are still ineffective on 250, 100 and especially 50 training images. The frameworks based on VGG face embeddings [45, 55] show much better performances in the personalized scenario. VGG features in combination with the ensemble method demonstrate relatively effective results on 250 samples. Since the prediction of the residual approach [37] includes the generic score of an image, it demonstrates competitive performances even on 50 images. However, the method is only able to improve its “personal” part on 500 images. In contrast, the proposed meta-learning method is not only more effective on 50 images, but is able to significantly improve its accuracy on 100 training samples that makes the method the most sufficient in the personalized scenario. Moreover, the proposed personalized meta-based facial beauty assessment method confidently outperforms all the presented approaches on all the numbers on training samples.

6 Conclusion and future works

This work is addressed to study the personalized aspect of facial beauty assessment. First of all, the importance of this issue is investigated through statistical analysis of the relations between generic and personal beauty perceptions. Secondly, a personalized meta-learning-based facial beauty assessment approach is proposed. The method includes the meta-training phase, where beauty preferences shared among an extensive number of individuals are learnt, and meta-testing that adapts the model obtained in the previous phase to a new individual. Experiments conducted on a facial beauty dataset in-the-wild with a rich diversity in age, ethnicity, gender, face pose and expression demonstrate that the proposed method effectively learns individual’s preferences on a small amount of annotated images and outperforms the facial beauty prediction state-of-the-art in personalized scenarios.

Other machine learning concepts are going to be explored for the personalized facial beauty assessment in order to improve the prediction accuracy on as small as possible amount of annotated images. Another way to enhance FBP and, at the same time, to make our research closer to real-life conditions is to study and apply 3D facial attractiveness prediction. Active learning that improves the performance on a limited number of training samples by selecting images that potentially provide more effective learning is also going to be investigated for the task. The usage of the proposed method combined with recommendation approaches in order to enhance recommender system and reduce cold-start problems is another direction of our future work.