1 Introduction

Facial attributes represent intuitive semantic features that describe human-understandable visual properties of face images, such as smiling, eyeglasses, and mustache. Therefore, as vital information of faces, facial attributes have contributed to numerous real-world applications, e.g., face verification (Kumar et al. 2009; Berg and Belhumeur 2013; Song et al. 2014; Zhang et al. 2018c; Chen et al. 2018), face recognition (He et al. 2018c; Shi et al. 2015; He et al. 2018b; Song et al. 2018b; Rao et al. 2018), face retrieval (Li et al. 2015; Nguyen et al. 2018; Fang and Yuan 2018; Toderici et al. 2010), and face image synthesis (Huang et al. 2018a, b; Cao et al. 2019b; Song et al. 2019; Egger et al. 2018). Facial attribute analysis, aiming to build a bridge between human-understandable visual descriptions and abstract feature representations required by real-world computer vision tasks, has attracted increasing attention and has become a hot research topic. Recently, the development of deep learning techniques has made excellent progress in learning abstract feature representations, leading to significant performance improvements of the current algorithms in the field of deep facial attribute analysis.

Deep facial attribute analysis mainly consists of two sub-issues: facial attribute estimation (FAE) and facial attribute manipulation (FAM). Given a face image, FAE trains attribute classifiers to recognize whether a specific facial attribute is present, and FAM modifies face images to synthesize or remove desired attributes by constructing generative models. We provide concise illustrations of these two sub-issues in Fig. 1.

Deep FAE methods can generally be categorized into two groups: part-based methods and holistic methods. Part-based FAE methods first locate the positions of facial attributes and then extract features according to the obtained localization cues for the subsequent attribute prediction. According to the different schemes for locating facial attributes, part-based methods can be further classified into two subcategories: separate auxiliary localization based methods and end-to-end localization based methods. Specifically, separate auxiliary localization based FAE methods seek help from existing part detectors or auxiliary localization algorithms, e.g., facial key point detection (Mahbub et al. 2018; Wu and Ji 2017) and semantic segmentation (Kalayeh et al. 2017; Gonzalez-Garcia et al. 2018). Then, corresponding features from different positions can be extracted for further estimation. Note that the localization and the estimation are performed in a separate and independent manner. On the contrary, end-to-end localization based methods exploit the locations of attributes and predict their presence simultaneously in end-to-end frameworks. In contrast to part-based methods, holistic methods focus more on learning attribute relationships and estimating facial attributes in a unified framework without any additional localization modules. By assigning shared and specific attribute learning to different layers of networks, holistic methods model correlations and distinctions among facial attributes to explore the complementary information. During this process, holistic FAE algorithms resort to additional prior or auxiliary information, such as attribute grouping or identity information (Cao et al. 2018), to customize their network architectures.

Fig. 1
figure 1

Illustrations of the two sub-issues in deep facial attribute analysis, i.e., FAE and FAM (a comes from CelebA dataset (Liu et al. 2015), and b comes from Xiao et al. (2018))

Fig. 2
figure 2

Tree diagram for diverse categories of deep facial attribute analysis algorithms

Deep FAM methods are mainly constructed based on generative models, of which generative adversarial networks (GANs) (Goodfellow et al. 2014; Mirza and Osindero 2014; Chen et al. 2016) and variational autoencoders (VAEs) (Kingma and Welling 2013; Huang et al. 2018a, b) serve as the backbones. Furthermore, deep FAM algorithms can be divided into two groups: model-based methods and extra condition-based methods, where the main difference between them is whether extra conditions are introduced. Model-based methods construct a model without any extra conditional inputs and learn a set of model parameters that only correspond to one attribute during a single training process. Thus, when editing another attribute, another training process needs to be executed in the same way. In this case, multiple attribute manipulations correspond to multiple training processes, resulting in expensive computation costs. In contrast, extra condition-based methods take extra attribute vectors or reference images as input conditions, and they can alter multiple attributes simultaneously by changing the corresponding values of attribute vectors or taking multiple exemplars with distinct attributes as references. Specifically, given an original image, an extra conditional attribute vector, such as a one-hot vector indicating the presence of the attribute, is concatenated with the latent original image codes. By comparison, extra conditional reference exemplars exchange specific attributes with the original image in the framework of image-to-image translation. Note that these reference images do not need to have the same identity as the original image. Hence, rather than merely altering the values of attribute vectors to edit facial attributes, attribute transfer based on reference images can discover more specific details of references and yield more faithful facial attribute images (Zhou et al. 2017; Xiao et al. 2018; Ma et al. 2018). Due to more abundant facial details and more photorealistic performance of generated images, this type of method has attracted much attention of current researchers.

Fig. 3
figure 3

The evolution of deep FAE methods (best viewed by zooming in the electronic version)

In summary, we create a taxonomy of contemporary deep facial attribute analysis algorithms in a tree diagram in Fig. 2. Furthermore, aiming to summarize the progress in deep facial attribute analysis, milestones of both deep FAE and FAM methods are listed in Figs. 3 and 4, respectively.

As shown in Fig. 3, part-based FAE methods and holistic FAE methods share two parallel routes. The study of deep FAE can be traced back to the earliest part-based work of Zhang et al. (2014), who take the whole person images as inputs.

Then, LNet+ANet (Liu et al. 2015) pushes deep FAE into an independent research branch, where only face images are taken as inputs for merely estimating face-related attributes. In addition, two large-scale face datasets, i.e., CelebA and LFWA, with 40 labeled attributes, are released to provide data support for deep FAE methods. Then, part-based and holistic methods share joint development and success but have distinct directions and trends. Part-based methods extremely emphasize facial details for discovering localization cues (Kalayeh et al. 2017; Mahbub et al. 2018), whereas holistic methods incline to employ attribute relationships to customize networks for learning more discriminative features (Rudd et al. 2016; Hand et al. 2017; Cao et al. 2018).

We outline the development of deep FAM methods in Fig. 4. Note that, model-based methods and two types of extra condition-based methods have their own evolutionary processes, but all follow the advances in GANs or VAEs. The earliest deep FAM work DIAT (Li et al. 2016), a model-based method, first attempts to utilize simple GANs to generate facial attributes. Meanwhile, conditional GAN (Perarnau et al. 2016) and VAE (Yan et al. 2016) dominate extra condition-based FAM methods by taking attribute vectors as conditions. Though extra attribute vector based methods have the remarkable advantage of changing multiple attributes simultaneously, they cannot guarantee that the remaining details that are irrelevant to manipulated attributes keep unchanged. Model-based methods can overcome this problem, but they cannot manipulate multiple attributes in a single training process. In light of these issues, methods conditioned on reference exemplars come into researchers’ attention. They can balance the change of multiple interested attributes and the preservation of other irrelevant attributes; meanwhile, generate more photorealistic facial attribute images. Hence, exemplar-guided FAM methods are becoming a popular research trend.

Fig. 4
figure 4

The evolution of deep FAM methods (Best viewed by zooming in the electronic version)

Although a large number of methods achieve appealing performance in deep FAE and FAM methods, there are still several severe challenges for future deep facial attribute analysis. Therefore, we summarize these urgent challenges and analyze possible opportunities in terms of data, algorithms, and applications. The corresponding overview is described in Fig. 5.

First, from the perspective of data, contemporary deep FAE methods suffer from the problem of insufficient training data. The most commonly used two datasets come from celebrities or news (Liu et al. 2015), where attribute types, illumination, views, and poses, all have significant differences from real-world data (Hand et al. 2018a). Therefore, future deep FAE models would have high demands for diverse data sources and excellent data quality [e.g., video data (Wang et al. 2016; Hand et al. 2018b)]. Future facial attribute images need to cover more real-world scenarios and wider-range attribute types. In this way, models can better capture representative features that conform to real-world data distribution. In addition, imbalanced data distribution of facial attribute images highlights in two aspects: the attribute category imbalance over a single dataset and the domain gaps between different training and testing datasets. The former called class-imbalance issue makes FAE models bias towards the majority samples and ignore the minority ones, resulting in the degraded performance in minority sample recognition. In contrast, the latter called domain adaption issue, which has not been fully explored in current deep FAE algorithms yet, is related to the generalization of models, especially when testing over unseen data.

Fig. 5
figure 5

Summary of challenges and opportunities in deep facial attribute analysis

Regarding the data challenges and opportunities in deep FAM, the rapid development of multimedia in the era of big data has given rise to rich video data. However, tracking and annotating facial attributes in videos is difficult because of spatial and temporal dynamics (Saito et al. 2017). Hence, video attribute manipulation is still a task to be addressed due to the lack of available training data. In addition, a large proportion of current algorithms evaluate the quality of their generated facial attribute images based on the visual fidelity (Li et al. 2016; Perarnau et al. 2016; Zhang et al. 2017a; Xiao et al. 2018). Because of the lack of established protocols and standards, such measurements might have adverse effects on the performance evaluation of deep FAM methods. Therefore, it would be a thorny problem to seek unified and standard data metric schemes that achieve both qualitative and quantitative analyses.

Second, from the perspective of algorithms, deep part-based FAE methods mainly focus on two aspects. The first is to integrate multiple face-related tasks, such as attribute estimation and face recognition, into a unified framework. In this way, the complementary information among different tasks could be fully exploited to improve all of them. For the second aspect, future part-based FAE algorithms are expected to discover more relationships among different attribute locations to handle in-the-wild data with complex environmental variations. For deep holistic FAE algorithms, current algorithms discover attribute relationships with the help of the prior information, e.g., human-made facial attribute groups. Such artificial partitions would limit the generalization ability of models. Hence, the critical challenge that holistic FAE methods face is to design adaptive attribute partition schemes for automatically exploring attribute relationships during the training processes.

With regard to the algorithm challenges and opportunities in deep FAM, model-based methods have a severe drawback: they cannot keep other attribute-irrelevant details unchanged as supervised information only comes from the target images with desired attributes. In terms of extra condition-based FAM methods, on the one hand, attribute vector based algorithms need to work harder to manipulate attributes continuously, where interpolation schemes might be a solution worth considering. On the other hand, future reference exemplar-based algorithms are expected to generate more diverse attribute styles in more faithful and photorealistic face images.

Finally, from the perspective of applications in deep FAE, face images with different viewpoints might have different attributes for the same person. It is possible that an attribute shown on the front face is not emphasized on the profile. This is called attribute inconsistency issue. By filtering more confident images to make the prediction (Lu et al. 2018a), existing methods might neglect rich information in multi-view face images. Therefore, how to keep attributes from the same identity consistent, while taking full advantage of information for capturing features with multiple views are important questions for the future. Second, biometric verification (Hadid et al. 2007; Günther et al. 2013; Fathy et al. 2015; Samangouei et al. 2017; Trokielewicz et al. 2019) is a developing application for digital mobile devices to resist various attacks in the real world. Compared with full-face based biometric verification (Fathy et al. 2015; Günther et al. 2013), facial attributes contain more detailed characteristics and can better facilitate active authentication. The main obstacles lie in the following two aspects: the first is to introduce facial attributes into the task of active authentication appropriately and efficiently (Samangouei et al. 2017), and the second is to explore the available deep features and classifiers with the trade-off between the verification accuracy and mobile performance.

Regarding the application challenges and opportunities in deep FAM, facial makeup (Li et al. 2018c; Chang et al. 2018; Cao et al. 2019a) and aging (Suo et al. 2010; Nhan Duong et al. 2019; Liu et al. 2019) have become hot topics in computer vision. The two tasks focus more on subtle facial details related to makeup and age attributes. Due to promising performance in mobile devices entertainment and identity-relevant verification, they have turned into crucial study branches independent of general deep FAM methods, and have shown significant potentials to facilitate more practical applications (Hu et al. 2018; Lu et al. 2018b; Song et al. 2018a). In addition, contemporary deep FAM research only works well with a limited range of resolutions and under laboratory conditions. On the one hand, such a limitation leads to more difficult high-resolution and low-quality face image manipulation in real-world applications; on the other hand, it provides an opportunity to integrate face super-resolution into attribute manipulation (Lu et al. 2018a; Dorta et al. 2018) in future research.

In addition, the relationships between deep FAE and FAM might contribute to improving both tasks. On the one hand, FAM is a vital scheme of data augmentation for FAE, where generated facial attribute images can significantly increase the amount of data to further relieve the overfitting issue. On the other hand, FAE can be a significant quantitative performance evaluation criterion for FAM, where the accuracy gap between real images and generated images can reflect the performance of deep FAM algorithms.

In this paper, we conduct an in-depth survey of facial attribute analysis based on deep learning, including FAE and FAM. The primary goal is to provide an overview of the two issues, and to highlight their respective strengths and weaknesses to provide newcomers prime skills. The remainder of this paper is organized as follows. In Sect. 2, we summarize a general two-stage pipeline that deep facial attribute analysis follows, including data preprocessing and model construction. The corresponding preliminary theories are also introduced for both FAE and FAM. In Sect. 3, we list commonly used publicly available facial attribute datasets and metrics. Sections 4 and 5 provide detailed overviews of state-of-the-art deep FAE and FAM methods, as well as their advantages and disadvantages, respectively. Additional related issues, as well as challenges and opportunities, are discussed in Sects. 6 and 7, respectively. Finally, we conclude this paper in Sect. 8.

Fig. 6
figure 6

Two-stage pipeline of deep facial attribute analysis (face images above come from Li et al. (2018b), Günther et al. (2017), He et al. (2019), Liu et al. (2015))

2 Facial Attribute Analysis Preliminaries

Deep facial attribute analysis follows a general pipeline consisting of two stages: data preprocessing and model construction, as shown in Fig. 6.

In this section, we first introduce two commonly used data preprocessing strategies for both FAE and FAM, including face detection and alignment, as well as data augmentation. Second, we introduce the general processes of model construction for deep FAE and FAM, respectively. Specifically, we provide the basics about feature extraction and attribute classification, which are two crucial steps when designing deep FAE models. For deep FAM methods, we review the underlying theories of backbone networks, i.e., VAEs and GANs, as well as their corresponding conditional versions.

2.1 Data Preprocessing

2.1.1 Face Detection and Alignment

Before the databases with more facial attribute annotations were released, most of the attribute prediction methods (Zhang et al. 2014; Kumar et al. 2008; Gkioxari et al. 2015) took whole human images (faces and torsos) as inputs. Only several well-marked facial attributes could be estimated, i.e., smile, gender, and has glasses. However, torso regions contain considerable face-irrelevant information, resulting in redundant computations. Hence, face detection and alignment become crucial steps to locate face areas for reducing the adverse effects of facial attribute-irrelevant areas.

For face detection, Ranjan et al. (2017) first recognize the gender attribute with a HyperFace detector that locates faces and landmarks, and then Günther et al. (2017) further extend this approach to predict 40 facial attributes simultaneously with the same HyperFace detector. In contrast, Kumar et al. (2008) use a poselet part detector (Bourdev and Malik 2009) to detect different parts corresponding to different poses, where the face is an important part of the whole person image. Compared with the poselet detector operated over conventional features, Gkioxari et al. (2015) propose a ‘deep’ version of the poselet, which trains a sliding window detector operated on deep feature pyramids. Specifically, the deep poselet detector divides the human body into three parts (head, torso, and legs) and clusters fiducial key points of each part into many different poselets. However, because all existing face detectors are used to find rough facial parts, facial attributes in more subtle areas, such as eyebrows, cannot be well predicted.

For facial alignment, well-aligned face databases with fiducial key points could alleviate the adverse effects of misalignment errors on both FAE and FAM when more specific facial regions of attributes can be located through these key points. The All-in-One Face algorithm (Ranjan et al. 2017) can be utilized to obtain fiducial key points and full faces. Based on this algorithm, Mahbub et al. (2018) divide a face into 14 segments related to different facial regions, and solve the problem of the attribute prediction in partial face images. Kumar et al. (2008) artificially divide a face into 10 functional parts including hair, forehead, eyebrows, eyes, nose, cheeks, upper lip, mouth, and chin. These facial areas are wide and robust enough to address discrepancies among individual faces, and the geometry characteristics shared by different faces can be well exploited.

Recently, researchers have tended to integrate face detection and alignment into the training process of facial attribute analysis. He et al. (2017) take face detection as a special case of general semi-rigid object detection and design joint network architectures to ensure the performance improvement in both face detection and attribute estimation. More importantly, this approach can handle in-the-wild input images with complex illumination and occlusions, and no extra cropping and aligning operations are needed. Ding et al. (2018) propose a cascade network to locate face regions according to different attributes and perform FAE simultaneously with no need to align faces (Günther et al. 2017). Li et al. (2018b) design an AFFAIR network for learning a hierarchy of spatial transformations and predicting facial attributes without landmarks. In summary, integrating face detection and alignment into the network training process is becoming a beneficial research trend.

2.1.2 Data Augmentation

For most face processing tasks, data augmentation is a vital strategy for solving the problems of insufficient training data and overfitting in deep learning. Face attribute analysis is not an exception. By imposing perturbations and distortions on the input images, data can be extended to improve deep learning models.

Günther et al. (2017) propose an alignment-Free facial attribute classification technique (AFFACT) with data augmentation. More specifically, AFFACT leverages shifts, rotations, and scales of images to make facial attribute feature extraction more reliable in the training stage and the testing stage. In the training stage, face images are first rotated, scaled, cropped, and horizontally flipped with 50% probability with defined coordinates. Then, a Gaussian filter is applied to emulate smaller image resolutions and yield blurred upscaled images. In the testing stage, AFFACT first rescales the test images and then transforms these images into 10 crops, including a center one, four corners of the original images, and their horizontally flipped versions. Finally, AFFACT averages the scores from the deep network per attribute over the ten crops to make the final prediction. In addition to taking crops, AFFACT also uses all combinations of shifts, scales, and angles, as well as their mirrored versions. All these data augmentation schemes contribute to the progressive performance of deep FAE models.

2.2 Basis of FAE Model Construction

2.2.1 Feature Extraction

Deep convolutional neural networks (CNNs) play significant roles in learning discriminative representations and have achieved attractive performance in deep FAE. In general, arbitrary classical CNN networks, such as VGG (Parkhi et al. 2015) and ResNet (He et al. 2016b), can be used to extract deep facial attribute features. For example, Zhong et al. (2016a) directly apply FaceNet and VGG-16 networks to capture attribute features of face images.

Considering that the features at different levels of the network might have different effects on the performance of deep FAE methods, Zhong et al. (2016b) take mid-level CNN features as an alternative to high-level features. The experiments demonstrate that even early convolution layers achieve comparable performance in most facial attributes with that of state-of-the-art methods, and mid-level representations can yield improved results over high-level abstract features. The reason for this superiority is that mid-level features can break the bounds of the inter-connections between convolutional and fully connected (FC) layers. Consequently, the CNN model can accept arbitrary receptive sizes for capturing rich information of face images.

In addition to using or combining classical deep networks, several methods design customized network architectures for learning discriminative features. Lu et al. (2017) design an automatically constructed compact multi-task architecture, which starts with a thin multi-layer network and dynamically widens in a greedy manner. Belghazi et al. (2018) build a hierarchical generative model and a corresponding inference model through the adversarial learning paradigm.

2.2.2 Attribute Classification

Early methods learn feature representations with deep networks but make the prediction with traditional classifiers, such as support vector machines (SVMs) (Cortes and Vapnik 1995; Bourdev et al. 2011), decision trees (Luo et al. 2013), and k-nearest neighbor (kNN) classifier (Huang et al. 2016, 2019). For example, Kumar et al. (2009) train multiple SVMs (Cortes and Vapnik 1995) with radial basis function (RBF) kernels to predict multiple attributes, where each SVM corresponds to one facial attribute. Bourdev et al. (2011) present a feedforward classification system with linear SVMs and classify attributes at the image patch level, the whole image level, and the semantic relationship level. Luo et al. (2013) construct a sum-product decision tree network to yield facial attribute region locations and classification results simultaneously. Huang et al. (2016, 2019) adopt kNN algorithm to solve the class-imbalance attribute estimation problem.

In terms of the classifiers based on deep learning, several convolutional layers followed by FC layers constitute a deep attribute classifier, which can be attached to the end of deep feature extraction networks to make the prediction. Then, the specific loss function is used to measure the discrepancy between the outputs of FC layers and the ground truths for reducing classification errors. Below, we introduce two commonly used loss functions for deep FAE models.

The most prevalent loss function is the sigmoid cross-entropy loss, which makes a binary classification for each attribute (Hand et al. 2017). For example, Hand et al. (2017) adopt the sigmoid cross-entropy loss to evaluate its network output and calculate the scores of all facial attribute. Besides, Rudd et al. (2016) consider multiple facial attribute classification as a regression issue to minimize the mean squared error (MSE) loss, i.e., the Euclidean loss, by mixing the errors of all attributes. In this way, multiple attribute labels can be obtained simultaneously via a single deep convolutional neural network (DCNN). In contrast, Rozsa et al. (2016) also adopt the Euclidean loss but train a set of DCNNs, where each network predicts a facial attribute. Despite higher prediction accuracy that DCNNs achieve for facial attributes, they have the severe problem of high computation and memory costs.

To explore the effects of different loss functions on deep facial attribute classifiers, Günther et al. (2017) test and compare the Euclidean loss and the sigmoid cross-entropy loss. The experiments over the same network but different loss functions demonstrate that the two loss functions are capable of achieving comparable performance for attribute estimation. Therefore, future researchers can choose either of these loss functions according to their tasks with little performance change.

2.3 Basis of FAM Model Construction

2.3.1 Variational Autoencoder

In general, a variational autoencoder (VAE) has two components: the generator, which samples the variables x parameterized by \(\theta \) with given latent variables z, i.e., \({p_\theta }(x|z)\); the encoder, which maps the variables x to the latent variables z that approximate a prior p(z), i.e., \({q_\phi }(z|x)\) parameterized by \(\phi \). The key of VAE is training to maximize the variational lower bound \(\mathcal{L}_{VAE}\) (Huang et al. 2018a):

$$\begin{aligned} \begin{aligned} \mathcal{L}_{VAE} = {\mathbb {E}_{z \sim {q_\phi }(z|x)}}\log {p_\theta }\left( {x|z} \right) -{D_{KL}}\left( {{q_\phi }\left( {z|x} \right) ||p\left( z \right) } \right) , \end{aligned} \end{aligned}$$
(1)

where \({D_{KL}}\) denotes Kullback–Leibler divergence.

For the conditional version of VAE, given the attribute vector y and latent representation z, it aims to build a model \({p_\theta }(x|y,z)\) for generating images x that contain desired attributes, taking y and z as conditional variables. This image generation task follows a two-step process: the first step is randomly sampling the latent variables z from the prior distribution p(z), and the second step is generating an image according to the given conditional variables. Hence, the variational lower bound of conditional VAE can be written as (Yan et al. 2016)

$$\begin{aligned} \begin{aligned} {\mathcal{L}_{CVAE}}&=\mathbb {E}_{z \sim {q_\phi }(z|x,y)}\log {p_\theta }\left( {x|y,z} \right) \\&\quad - {D_{KL}}\left( {{q_\phi }\left( {z|x,y} \right) ||p\left( z \right) } \right) , \end{aligned} \end{aligned}$$
(2)

where \({q_\phi }(z|x,y)\) is the true posterior from the encoder.

Table 1 An overview of facial attribute datasets

2.3.2 Generative Adversarial Network

A generative adversarial network (GAN) consists of two parts: the generator G and the discriminator D, where G attempts to synthesize data from a random vector z obeying a prior noise distribution \(z \sim p\left( z \right) \), and D attempts to discriminate whether data is from the realistic data distribution or from G. Given data \(x \sim {p_{data}}(x)\), G and D are trained in an adversarial manner with a min-max game as (Goodfellow et al. 2014)

$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _G \;\mathop {\max }\limits _D \;{\mathcal{L}_{GAN}}&= {\mathbb {E}_{x \sim {p_{data}}\left( x \right) }}\log \left( {D\left( x \right) } \right) \\&\quad +\, {\mathbb {E}_{z \sim p\left( z \right) }}\log \left( {1 - D\left( {G\left( z \right) } \right) } \right) . \end{aligned} \end{aligned}$$
(3)

The conditional version of GAN is more frequently used by feeding the attribute vector y into both G and D in different ways. Specifically, the attribute vector y is concatenated with the prior input noise p(z) in the generator. Meanwhile, it is taken as an input along with x into a discriminative function. Therefore, the min-max game of conditional GAN is denoted as (Mirza and Osindero 2014)

$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _G \;\mathop {\max }\limits _D \;{\mathcal{L}_{CGAN}}&= {\mathbb {E}_{x \sim {p_{data}}\left( x \right) }}\log \left( {D\left( {x|y} \right) } \right) \\ {}&\quad +\, {\mathbb {E}_{z \sim p\left( z \right) }}\log \left( {1 - D\left( {G\left( {z|y} \right) } \right) } \right) . \end{aligned} \end{aligned}$$
(4)

3 Facial Attribute Analysis Datasets and Metrics

3.1 Facial Attribute Analysis Datasets

We present an overview of publicly available facial attribute analysis datasets for both FAE and FAM, including data sources, sample sizes, and test protocols. More details of these datasets are listed in Table 1.

FaceTracer dataset is an extensive collection of real-world face images collected from the internet. There are 15,000 faces with fiducial key points and 10 groups of attributes, where 7 groups of facial attributes are composed of 19 attribute values, and the remaining 3 groups denote the quality of images and the environment. This dataset provides the URLs of each image for considering privacy and copyright issues. In addition, FaceTracer takes 80% of the labeled data as training data, and the remaining 20% as testing data with 5-fold cross-validation.

The Labeled Faces in the Wild (LFW) dataset consists of 13,233 images of cropped, centered frontal faces derived from Miller et al. (2007). This dataset is collected from 5749 people using online news sources, and there are 1680 people that have two or more images. Kumar et al. (2009) first collect 65 attribute labels through Amazon Mechanical Turk (AMT)Footnote 1 and then expand to 73 attributes (Kumar et al. 2011). We denote them as LFW-65 and LFW-73 in Table 2. Liu et al. (2015) extract 40 attribute labels automatically by binarizing corresponding values of labels in LFW dataset, instead of labeling by manual. Moreover, they annotate 5 fiducial key points, leading to LFWA dataset, which is partitioned into half for training (6263 images) and the remains for testing.

PubFig dataset is a large, real-world face dataset containing 58,797 images of 200 people collected from the internet under uncontrolled situations. Thus, this dataset covers considerable variations in poses, lights, expressions, and scenes. PubFig dataset labels 73 facial attributes, as many as LFW-73, but it includes fewer individuals. Besides, this dataset divides the development set and the evaluation set, containing 60 identity images and 140 identities, respectively.

Celeb-Faces Attributes (CelebA) dataset is constructed by labeling images selected from Celeb-Faces (Sun et al. 2014), which is a large-scale face attribute dataset covering large pose variations and background clutter. There are 10,177 identities, 202,599 face images with 5 landmark locations, and 40 binary attribute annotations per image. In the experiment, CelebA is partitioned into three parts: images of the first 8000 identities (with 160,000 images) for training, images of another 1000 identities (with 20,000 images) for validation and the remains for testing.

Berkeley Human Attributes dataset is collected from H3D (Bourdev and Malik 2009) dataset and PASCAL VOC 2010 (Wang et al. 2016) training and validation datasets, containing 8053 images centered on full bodies of persons. There are wide variations in poses, viewpoints, and occlusions. Thus, many existing methods that work on front faces do not perform well on this dataset. AMT is also used to provide labels for all 9 attributes by 5 independent annotators. The dataset partitions 2003 images for training, 2010 for validation and 4022 for testing.

Attribute 25K dataset is collected from Facebook, which contains 24,963 people split into 8737 training, 8737 validation and 7489 test examples. Since the images have large variations in viewpoints, poses and occlusions, not every attribute can be inferred from every image. For instance, we cannot label the wearing hat attribute when the head of the person is not visible.

Ego-Humans dataset draws images from videos that track casual walkers with the OpenCV frontal face detector and facial landmark tracking in New York City over two months. What makes it different from other datasets is that it covers the location and weather information through clustering GPS coordinates. Moreover, nearly five million face pairs along with their same or not same labels are extracted under the constraints of temporal information and geolocations. Wang et al. (2016) manually annotate 2714 images with 17 facial attributes randomly selected from these five million images. For the testing protocol, 80% images are selected randomly for training and the remaining for testing.

University of Maryland Attribute Evaluation Dataset (UMA-AED) comes from image searches taking 40 attributes as search terms and the HyperFace as face detector (Ranjan et al. 2017). UMD-AED serves as an evaluation dataset and contributes to class-imbalance learning in deep facial attribute estimation. It is composed of 2800 face images labeled with a subset of 40 attributes from CelebA and LFWA. Each attribute has 50 positive and 50 negative samples, which means that not every attribute is tagged in each image. In addition, compared with CelebA containing mostly frontal, high-quality, and posed images, UMD-AED comprises a large number of variations, e.g., distinct image quality, varying lights and poses, wide age ranges, and different skin tones. UMD-AED offers a much more unbiased metric for real-world data, and it can be used to evaluate whether the attribute estimation models have learned discriminative feature representations.

Table 2 An overview of facial attributes

YouTube Faces Dataset (with attribute labels) Original YouTube Faces Dataset contains 3245 videos from 1595 celebrities with 620,000 frame images (Wolf et al. 2011) for face verification. Hand et al. (2018b) further extend it for the video-based facial attribute prediction issue. They label 40 attributes from CelebA in the first of four frames from every video, where the remaining three frames without attribute labels come from one third, two-thirds, and the last of the way per video, respectively. As a result, this dataset makes it possible for exploring deep FAE methods merely with weakly labels. Ten-fold cross-validation is adopted for the protocol. Then, all the testing experiments need to be conducted on the labeled frames of the testing splits with the average of all 10 splits.

To provide a more comprehensive overview of all existing attribute labels, we list all the labels in LFW-73 dataset with the maximum number of attributes in Table 2. Different facial attribute datasets contain different subsets of these attribute annotations for deep FAE and FAM. Note that in Table 2, There are 40 facial attributes in two commonly used facial attribute analysis datasets, i.e., CelebA and LFWA. The remaining 33 attributes are also labeled and used in other datasets, e.g., LFW with 65 attributes mentioned in Table 1.

3.2 Facial Attribute Analysis Metrics

3.2.1 Facial Attribute Estimation Metrics

Below, we list the frequently used metrics for FAE algorithms and provide detailed descriptions of these metrics in terms of definitions and formulas.

  • Accuracy and Error Rate (Acc and ER)

The classification accuracy and the error rate are the most commonly used measures for evaluating classification tasks. Facial attribute estimation is not an exception, and its accuracy can be defined as (Rudd et al. 2016)

$$\begin{aligned} Accuracy = \left( {\left( {{t_p} + {t_n}} \right) /\left( {{N_p} + {N_n}} \right) } \right) . \end{aligned}$$
(5)

where \(N_p\) and \(N_n\) denote the numbers of positive and negative samples, respectively, and \(t_p\) and \(t_p\) denote the numbers of true positives and true negatives (Huang et al. 2016). Meanwhile, the error rate can be defined as

$$\begin{aligned} Error\;rate = 1 - Accuracy. \end{aligned}$$
(6)
  • Balanced Accuracy and Error Rate (BAcc and BER)

When dealing with class-imbalance data, the traditional classification accuracy is not befitting due to the bias of the majority class. Hence, a balanced classification accuracy is defined as (Rudd et al. 2016)

$$\begin{aligned} Balanced\;Accuracy = \frac{1}{2}\left( {{t_p}/{N_p} + {t_n}/{N_n}} \right) . \end{aligned}$$
(7)

Similarly, the balanced error rate can be defined as Balanced Error Rate \(=\)1 − Balanced Accuracy. When addressing the imbalance issue from the perspective of source and target distributions (Rudd et al. 2016), the balanced error rate is defined as

$$\begin{aligned} Balanced\;Error\;Rat{e^*} = \left( {{T^ +}\left( {{t_p}/{N_p}} \right) + {T^ -}\left( {{t_n}/{N_n}} \right) } \right) , \end{aligned}$$
(8)

where \(T^+\) and \(T^-\) denote the target domain distributions of positive and negative examples, respectively. The superscript \(*\) is used to indicate the balanced version of error rate. Besides, more details of the class-imbalance issue are introduced in Sect. 6.1.

  • mean Average Precision (mAP)

As there is more than one label in multi-label image classification, the mean Average Precision (mAP) becomes a prevalent metric (Yue et al. 2007; Philbin et al. 2007), which computes the average of the precision–recall curve from the recall 0 to recall 1. Moreover, mAP is the mean of Average Precision (AP) for a set of categories, while AP is the more general version that combines the recall and precision to yield prediction results for a single class.

3.2.2 Facial Attribute Manipulation Metrics

There are two types of measurements in deep FAM: qualitative metrics and quantitative metrics, where the former evaluates the performance of generated images through statistical surveys, and the latter measures the preservation degree of the face detail related information after attribute manipulation. We provide more detailed descriptions of these two types of metrics below.

  • Qualitative Metrics

Statistical survey is the most intuitive way to qualitatively evaluate the quality of generated images in most generative tasks. By establishing specific rules in advance, subjects vote for generated images with appealing visual fidelity, and then, researchers draw conclusions according to the statistical analysis of votes. For example, Choi et al. (2018) quantitatively evaluate the performance of generated images in a survey format via AMT (see footnote 1). Given an input image, the workers are required to select the best generated images according to instructions based on perceptual realism, quality of manipulation in attributes, and preservation of original identities. Each worker is asked a set number of questions for validating human effort.

Zhang et al. (2017b) conduct a statistical survey that asks volunteers to choose the better result from their proposed CAAE or existing works. Sun et al. (2018c) instruct volunteers to rank several deep FAM approaches based on perceptual realism, quality of transferred attributes, and preservation of personal features. Then, they calculate the average rank (between 1 and 7) of each approach. Lample et al. (2017) perform a quantitative evaluation on two different aspects: the naturalness measuring the quality of generated images, and the accuracy measuring the degree of swapping an attribute reflected in the generation.

  • Quantitative Metrics

Distribution difference measure calculates the differences between real images and generated face images. Xiao et al. (2018) achieve this goal by the Fréchet inception distance (Heusel et al. 2017) (FID) with the means and covariance matrices of two distributions before and after editing facial attributes. Wang et al. (2018) compute the peak signal to noise ratio (PSNR) to measure the pixel-level differences. They also calculate the structure similarity index (SSIM) and its multi-scale version MS-SSIM (Wang et al. 2004) to estimate the structure distortion and the identity distance. All these measurements contribute to evaluating the high-level similarity of two face images. In addition, He et al. (2019) use an Inception-ResNet (Szegedy et al. 2017) to train a face recognizer for measuring the identity preservation ability with rank-1 recognition accuracy. Therefore, face identity preservation is becoming a promising metric because it can indicate whether models have excellent performance in preserving facial details outside of manipulated attributes.

Table 3 State-of-the-art deep facial attribute estimation approaches
Fig. 7
figure 7

The illustration of deep part-based FAE methods [images are from Ding et al. (2018)]

Facial landmark detection gain uses the accuracy gain of landmark detection before and after attribute editing to evaluate the quality of synthesized images. For example, He et al. (2016a) adopt an ERT method (Kazemi and Sullivan 2014 ), which is a landmark detection algorithm trained on 300-W dataset (Sagonas et al. 2013). During testing, they divide the test sets into three components: the first containing images with the positive attribute labels, the second containing images with the negative labels, and the last containing the manipulated images from the first part. Then, the average normalized distance error is computed to evaluate the discrepancy of landmarks between the generated images and the ground truths.

Facial attribute estimation constructs additional attribute prediction networks to measure the performance of FAM according to the classification accuracy. Perarnau et al. (2016) first design an Anet to predict facial attributes on the manipulated face images. If the outputs of the Anet are closer to the desired attribute labels, the generator can be considered to have satisfactory generation performance. Larsen et al. (2016) train a regressor attribute prediction network to calculate the attribute similarity between the conditional attributes and generated attributes. Note that FAE models used for the evaluation are independent of FAM’s training processes, which means that they have to be trained well in advance and have base accuracy performance over all facial attributes.

4 State-of-the-Art Facial Attribute Estimation Methods

Generally, state-of-the-art deep FAE methods can be divided into two main categories: part-based methods and holistic methods. In this section, we provide detailed introductions to these two types of methods in terms of algorithms, performance, as well as their respective advantages and disadvantages. The overview is provided in Table 3.

4.1 Part-Based Deep FAE Methods

As shown in Fig. 7, part-based deep FAE methods first locate the areas where facial attributes exist through localization mechanisms. Then, features corresponding to different attributes on each highlighted position can be extracted and further predicted with multiple attribute classifiers. Hence, the key of part-based methods lies in the localization mechanism. In light of this point, part-based deep FAE methods can be further divided into two subgroups: separate auxiliary localization based methods and end-to-end localization based methods. Corresponding details are provided as follows.

4.1.1 Separate Auxiliary Localization based Methods

Since facial attributes describe subtle details of face representations based on human vision, locating the positions of facial attributes enforces subsequent feature extractors and attribute classifiers to focus more on attribute-relevant regions. The most intuitive approach is to take existing face part detectors as auxiliaries.

Poselet (Bourdev and Malik 2009; Bourdev et al. 2011) is a valid part detector that describes a part of the human pose under a given viewpoint. Because these parts include evidences from different areas of the body at different scales, complementary information can be learned to benefit attribute prediction. Typically, given a whole person image, poselet detector (Zhang et al. 2014) is first used to decompose an image into several image patches, named poselets, under various viewpoints and poses. Then, a PANDA network is proposed to train a set of CNNs for each poselet and the whole image. Then, the features from all these poselets are concatenated to yield final feature representations. Finally, PANDA branches out multiple binary classifiers where each recognizes an attribute by the binary classification. Based on PANDA, Gkioxari et al. (2015) introduce a deep version of the Poselet detector and build a feature pyramid, where each level computes a prediction score for the corresponding attribute.

However, the poselet detector only discovers coarse body parts and cannot explore subtle local details of face images. Considering that the probability of an attribute appearing in a face image is not uniformed in the spatial domain, Kalayeh et al. (2017) propose employing semantic segmentation as a separate auxiliary localization scheme. They exploit the location cues obtained by semantic segmentation to guide the attention of attribute prediction to the naturally occurring areas of attributes. Specifically, a semantic segmentation network is first designed in an encoder-decoder paradigm and trained over Helen face dataset (Le et al. 2012). During this process, the semantic face parsing (Smith et al. 2013; Lu et al. 2018b) is performed as an additional task to learn detailed pixel-level location information. After discovering the location cues, the semantic segmentation based pooling (SSP) and gating (SSG) mechanisms are presented to integrate the location information into the attribute estimation. SSP decomposes the activations of the last convolutional layer into different semantic regions and then aggregates those regions that only reside in the same area. Meanwhile, SSG gates the output activations between the convolutional layers and the batch normalization (BN) operation to control the activations of neurons from different semantic regions.

In contrast, Mahbub et al. (2018) utilize key points to segment faces into several image patches, which is a more straightforward way compared with semantic segmentation. Then, these segments are fed into a set of facial segment networks to extract corresponding feature representations and learn prediction scores, where the whole face image is fed into a full-face network. A global predictor network fuses the features from these segments, and two committee machines merge their scores for the final prediction.

Compared with the above methods that search for location clues of attributes directly, He et al. (2018a) resort to synthesized abstraction facial images that contain local facial parts and texture information to achieve the same goal indirectly. A designed GAN is used to generate facial abstraction images before inputting them into a dual-path facial attribute recognition network, where the real original images are together fed into this recognition network. The dual-path network propagates the feature maps from the abstraction sub-network to the real original image sub-network and concatenates the two types of features for the final prediction. Despite the abundant location and textual information that is contained in generated facial abstraction images, the quality of these images may have a significant impact on performance, especially when some attribute related information is lost in image abstraction.

Note that all the separated auxiliary localization based deep FAE methods share a common drawback: relying too much on accurate facial landmark localization, face detection, facial semantic segmentation, face parsing, and facial partition schemes. If these localization strategies are imprecise or landmark annotations are unavailable, the performance of the subsequent attribute estimation task would be significantly affected.

4.1.2 End-to-End Localization Based Methods

Compared with the separate auxiliary localization based methods that locate attribute regions and make the attribute prediction separately and independently, end-to-end localization based methods jointly exploit location cues where facial attributes appear and predict their presence in a unified framework.

Fig. 8
figure 8

The illustration of deep holistic FAE methods [face image comes from Ding et al. (2018)]

Liu et al. (2015) first propose a cascaded deep learning framework for joint face localization and attribute prediction. Specifically, the cascaded CNN is made up of an LNet and an ANet, where the LNet locates the entire face region and the ANet extracts the high-level face representation from the located area. LNet is first pretrained by classifying massive general object categories to ensure excellent generalization capability, and then it is fine-tuned using the image-level attribute tags of training images to learn features for face localization in a weakly supervised manner. Note that the main difference between LNet and separated auxiliary localization based methods is LNet does not require face bounding boxes or landmark annotations. Meanwhile, ANet is first pretrained by classifying massive face identities to handle the complex variations in unconstrained face images, and then it is fine-tuned to extract discriminative facial attribute representations. Furthermore, rather than extracting features patch-by-patch, ANet introduces an interweaved operation with locally shared filters to extract multiple feature vectors in a one-pass feed-forward process. Finally, SVMs are trained over these features to estimate attribute values per attribute, and the terminal prediction is made by averaging all these values for addressing the small misalignment of face localization. The cascaded LNet and ANet framework shows the benefit of pretraining with massive object categories and massive identities in enhancing the feature representation learning. With such customized pretraining schemes and cascaded network architecture, this method exhibits outstanding robustness to backgrounds and face variations.

However, coarse entire face regions discovered by LNet cannot be used to explore more local attribute details. Hence, Ding et al. (2018) propose a cascade network to jointly locate facial attribute-relevant regions and perform attribute classification. Specifically, they first design a face region localization network (FRL) that builds a branch for each attribute to automatically detect a corresponding relevant region. Then, the following parts and whole (PaW) attribute classification network selectively leverages information from all the attribute-relevant regions for the final estimation. Moreover, in terms of the attribute classification, Ding et al. define two FC layers: the region switch layer (RSL) and the attribute relation layer (ARL). The former selects the relevant prediction sub-network and the latter models attribute relationships. In summary, the cascaded FRL and PaW model not only discovers semantic attribute regions but also explores rich relationships among facial attributes. Besides, since this model automatically detects face regions, it can achieve outstanding performance on unaligned datasets without any pre-alignment step.

Note that FRL-PaW method learns a location for each attribute, which makes the training process redundant and time-consuming. This is because several facial attributes generally exist in the same area. However, to the best of our knowledge, there is currently no specific solution for tackling this issue. We expect that future research would reduce computation costs; meanwhile, make the prediction according to attribute locations as accurately as possible.

In summary, part-based deep FAE methods first locate the positions where facial attributes appear. Two strategies can be adopted: separate auxiliary localization and end-to-end localization. The former leverages existing part detectors or auxiliary localization-related algorithms, and the latter jointly exploits the locations in which facial attributes exist and predicts their presences. Compared with the separate auxiliary localization based methods operating separately and independently, end-to-end localization based methods locate and predict in a unified framework. After obtaining the location clues, features corresponding to certain attribute areas can be extracted and further be fed into attribute classifiers to make the estimation. Recently, researchers are currently more inclined to shift their focus on holistic FAE algorithms when the part-based counterparts are generally distracted and affected by attribute localization mechanisms.

4.2 Holistic Deep FAE Methods

In contrast to part-based FAE approaches that detect and utilize facial components, holistic deep FAE methods focus more on exploring the attribute relationships and extracting features from entire face images rather than facial parts. A schematic diagram of holistic FAE models is provided in Fig. 8.

As shown in Fig. 8, the key to modeling attribute relationships is learning common features at low-level shared layers and capturing attribute-specific features at high-level separated layers. Each separated layer corresponds to an attribute group. In general, these attribute groups are obtained manually according to semantics or attribute locations. By assigning different shared layers and attribute-specific layers, complementary information among multiple attributes can be discovered such that more discriminative features can be learned for the following attribute classifiers.

In general, there are two crucial issues that holistic deep FAE methods need to address when designing network architectures: (1) how to properly assign shared information and attribute-specific information at different layers of networks, and (2) how to explore relationships among facial attributes for learning more discriminative features. Taking these two problems as the main focus, we provide a brief review of holistic FAE methods in the following parts.

To the best of our knowledge, MOON (Rudd et al. 2016) is one of the earliest holistic FAE methods with the multi-task framework. It has a mixed objective optimization network that learns multiple attribute labels simultaneously via a single DCNN. MOON takes deep FAE as a regression problem for the first time and adopts a 16-layer VGG network as the backbone network, in which abstract high-level features are shared before the last FC layer. Multiple prediction scores are calculated with the MSE loss to reduce the regression error. Similarly, Zhong et al. (2016b) replace the high-level CNN features in MOON with mid-level features to identify the best representation for each attribute.

In contrast to splitting networks at the last FC layer, the multi-task deep CNN (MCNN) (Hand et al. 2017) branches out to multiple groups at the mid-level convolutional layers for modeling the attribute correlations. Specifically, based on the assumption that many attributes are strongly correlated, MCNN divides all 40 attributes into 9 groups according to semantics, i.e., gender, nose, mouth, eyes, face, around head, facial hair, cheeks, and fat. For example, big nose and pointy nose are grouped into the ‘nose’ category, and big lips, lipstick, mouth slightly open and smiling are clustered into the ‘mouth’ category. Therefore, each group consists of similar attributes and learns high-level features independently. At the first two convolutional layers of MCNN, features are shared by all attributes. Then, MCNN branches out several forks corresponding to different attribute groups. That means each attribute group occupies a fork. At the end of the network, an FC layer is added to create a two-layer auxiliary network (AUX) to facilitate attribute relationships. AUX receives the scores from the trained MCNN and yields the final prediction results. Hence, MCNN-AUX models facial attribute relationships in three ways: (1) sharing the lowest layers for all attributes, (2) assigning the higher layers for spatially related attributes, and (3) discovering score-level relationships with the AUX network.

However, MCNN has a significant limitation: shared information at low-level layers may vanish after network splitting. One solution to overcome this limitation is jointly learning shared and attribute-specific features at the same level rather than in order of precedence.

Therefore, Cao et al. (2018) design a partially shared structure based on MCNN, i.e., PS-MCNN. It divides all 40 attributes into 4 groups according to attribute positions, i.e., upper group, middle group, lower group, and whole image group. Note that the entire partition process is performed by hand, and this artificial grouping strategy can be regarded as the prior information based on human knowledge. The partially shared structure connects four attribute-specific networks (TSNets) corresponding to four different groups of attributes and one shared network (SNet) sharing features among all the attributes. Specifically, each TSNet learns features for a specific group of attributes. Meantime, SNet shares informative features with each task. In terms of the connection mode between the SNet and the TSNets, each layer of SNet receives additional inputs from the previous layers of TSNet. Then, features from SNet are fed into the next layers of shared and attribute-specific networks. At a certain level of PS-MCNN, both task-specific features and shared features are captured in different branches. In addition, shared features at a specific layer are closely related to the features of all of its previous layers. This connection mechanism contributes to informatively shared feature representations.

Apart from attribute correlations, Han et al. (2017) introduce the concept of attribute heterogeneity. They note that individual attributes could be heterogeneous concerning data type and scale, as well as semantic meaning. In terms of data type and scale, attributes can be grouped into ordinal versus nominal attributes. For instance, if attributes age and hair length are ordinal, then attributes gender and race are nominal. Note that the main difference between ordinal and nominal attributes is ordinal attributes have an explicit ordering of their variables, whereas nominal attributes generally have two or more classes and there is no intrinsic ordering among the categories. In terms of semantic meaning, attributes such as age, gender, and race are used to describe the characteristics of the whole face, and pointy nose and big lips are mainly used to describe the local characteristics of facial components. Therefore, these two categories of attributes are heterogeneous and can be grouped into holistic versus local attributes for the prediction of different parts of a face image. Therefore, taking both the attribute correlation and heterogeneity into consideration, Han et al. design a deep multi-task learning (DMTL) CNN to learn shared features of all attributes and category-specific features of heterogeneous attributes. The shared feature learning naturally exploits the relationship among attributes to yield discriminative feature representations, whereas the category-specific feature learning aims to fine-tune the shared features towards the optimal estimation of each heterogeneous attribute category.

Note that existing multi-task learning methods make no distinction between low-level and mid-level features for different attributes. This is unreasonable because features at different levels of the network may have different relationships. Besides, the above methods share features across tasks and split layers that encode attribute-specific features by hand-designed network architectures. Such a manual exploration in the space of possible multi-task deep architectures is tedious and error-prone because possible spaces might be combinatorially large.

In light of this issue, Lu et al. (2017) present the automatic design of compact multi-task deep learning architectures, with no need to artificially discover possible multi-task architectures. The proposed network learns shared features in a fully adaptive way, where the core idea is incrementally widening the current design in a layer-wise manner. During the training process, the adaptive network starts with a thin multi-layer network (VGG16) and dynamically widens via a top-down layer-wise model widening strategy (Tropp et al. 2006). It decides with whom each task shares features in each layer, yielding corresponding branches in this layer. Finally, the number of branches at the last layer of the model is equal to that of the attribute categories to be predicted. Consequently, this training scheme considers both task correlations and the complexity of the model for facilitating task grouping decisions at each layer of the network. Therefore, the fully-adaptive network allows us to estimate multiple facial attributes in a dynamic branching procedure through its self-constructed architecture and feature sharing strategy.

To summarize, holistic methods take the entire face images as inputs and mainly work on exploring attribute relationships. Many methods design various network architectures to model the correlations among different attributes. The key to this idea is learning shared features at low-level layers and attribute-specific features at high-level layers. Thus, holistic FAE methods need to address two main problems: one is assigning different layers for learning corresponding features with different characteristics, and another is learning more discriminative features though discovering attribute relationships under customized networks. What can be observed from contemporary research is that attribute grouping by hand has become a prevalent scheme in holistic FAE. We expect that an automatic attribute grouping strategy would attract more attention in future work, and it should adaptively learn proper group partition criteria and adjust them according to models’ performance during the training.

5 State-of-the-Art Facial Attribute Manipulation Methods

In this section, we provide an overview of model-based FAM methods and extra condition-based FAM methods in terms of algorithms, network architectures, advantages and disadvantages. The summary of this overview is provided in Table 4.

5.1 Model-Based Deep FAM Methods

Model-based methods map an image in the source domain to the target domain and then distinguish the generated target distribution with the real target distribution under the constraint of an adversarial loss. Therefore, model-based methods are greatly task-specific and have excellent performance in yielding photorealistic facial attribute images.

Table 4 State-of-the-art facial attribute manipulation approaches

Li et al. (2016) first propose a DIAT model following the standard paradigm of model-based methods. DIAT takes unedited images as inputs to generate target facial images with an adversarial loss and an identity loss. The first loss ensures to obtain desired attributes, and the second encourages the generated images to have the same or similar identity as the input images. Zhu et al. (2017) add an inverse mapping from the target domain to the source domain based on DIAT and propose a CycleGAN, where the two mappings are coupled with a cycle consistency loss. This design is based on the intuition that if we translate from one domain to the other and back again, we should arrive where we start. Based on CycleGAN, Liu et al. (2017) propose a UNIT model that maps the pair of corresponding images in the source and the target domains to the same latent representation in a shared latent space. Each branch from one of the domains to the latent space performs an analogous CycleGAN operation.

However, all of the above methods directly operate on the entire face image. That means when a certain attribute is edited, the other relevant attributes may also be changed uncontrollably.

Therefore, to modify attribute-specific face areas and keep the other parts unchanged, Shen and Liu (2017) present learning residual images, which are defined as the difference between images before and after attribute manipulation. In this way, face attributes can be efficiently manipulated with modest pixel modification over the attribute-specific regions. They design a ResGAN consisting of two image transformation networks and a discriminative network to learn residual representations of desired attributes. Specifically, two image transformation networks, denoted as \(G_0\) and \(G_1\), first take two images with opposite attributes as inputs in turn and then perform the inverse attribute manipulation operation for outputting residual images. Subsequently, the obtained residual images are added to the original input images, yielding the final outputs with manipulated attributes. In the end, all these images, i.e., the two original input images and the two images from the transformation networks, are fed into the discriminative network, which classifies these images into three categories: images generated from the two transformation networks, original images with positive attribute labels, and original images with negative attribute labels. Note that \(G_0\) and \(G_1\) constitute a dual learning cycle. Given an image with a negative attribute label, \(G_0\) synthesizes the desired attribute, and \(G_1\) removes the corresponding attribute that is generated by \(G_1\). Then, \(G_1\)’s output is expected to have the same attribute label as the original given image. The experiments demonstrate that such a dual learning process is beneficial for the generation of high-quality images, and residual images could enforce the attribute manipulation process to focus on the local areas where attributes show up. Therefore, ResGAN is able to generate attractive images especially on local facial attributes.

However, model-based methods can only edit an attribute during a training process with a set of corresponding model parameters. The whole manipulation is only supervised by discriminating real or generated images with the adversarial loss. That means when multiple attributes need to be changed, multiple training processes are inevitable, resulting in significant time consumption and computation costs.

In contrast, manipulating facial attributes with extra conditions is a more prevalent approach since multiple attributes can be edited through a single training process. Hence, extra condition-based methods attract more attention from researchers, where extra attribute vectors and reference exemplars are taken as input conditions. Specifically, attribute vectors can be concatenated with the latent image codes to control facial attributes, whereas reference exemplars exchange specific attributes with the to-be-manipulated images in the image-to-image translation framework. More details about the extra condition-based deep FAM methods are introduced below.

5.2 Extra Condition-Based Deep FAM Methods

Deep FAM methods conditioned on extra attribute vectors alter desired attributes with given conditional attribute vectors, such as one-hot vectors indicating the presence of corresponding facial attributes. During the training process, the conditional vectors are concatenated with the to-be-manipulated images in latent encoding spaces. Moreover, conditional generative frameworks dominate the model construction of deep FAM. Various efforts have been made to edit facial attributes based on autoencoders (AEs), VAEs, and GANs.

Zhang et al. (2017b) propose a conditional adversarial autoencoder (CAAE) for age progression and regression. CAAE first maps a face image to a latent vector through an encoder. Then, the obtained latent vector concatenated with an age label vector is fed into a generator for learning a face manifold. The age label condition controls altering the age. Meanwhile, the latent vector ensures that the personalized face features are preserved. Yan et al. (2016) introduce a conditional variational autoencoder (CVAE) to generate images from visual attributes. CVAE disentangles an image into the foreground and the background parts, where each part is combined with the defined attribute vector. Consequently, the quality of generated complex images can be significantly improved when the foreground areas attract more attention. Perarnau et al. (2016) propose an invertible conditional GAN (IcGAN) to edit multiple facial attributes with determined specific representations of generated images. Given an input image, IcGAN first learns a representation consisting of a latent variable and a conditional vector via an encoder. Then, IcGAN modifies the latent variable and conditional vector to regenerate the original input image through the conditional GAN (Mirza and Osindero 2014). In this way, by changing the encoded conditional vector, IcGAN can achieve arbitrary attribute manipulation.

Apart from autoencoders, VAEs, GANs, and their variants, Larsen et al. (2016) combine the VAE and the GAN into a unified generative model, VAE/GAN. In this model, the GAN discriminator learns feature representations taken as the basis of the VAE reconstruction objective, which means that the VAE decoder and the GAN generator are collapsed into one by sharing parameters and joint training. Hence, this model consists of three parts: the encoder, the decoder, and the discriminator. By concatenating attribute vectors with features from these three components, VAE/GAN performs better than either plain VAEs or GANs.

Recently, taking the multiple attribute manipulation as a domain transfer task, Choi et al. (2018) propose a StarGAN to learn mappings among multiple domains with only a single generator and a discriminator trained from all domains. Each domain corresponds to an attribute and the domain information can be denoted by one-hot vectors. Specifically, the discriminator first distinguishes the real and the fake images and classifies the real images to their corresponding domains. Then, the generator is trained to translate an input image into an output image conditioned on a target domain label vector, which is generated randomly. As a result, the generator is capable of translating the input image flexibly. In summary, StarGAN takes the domain labels as extra supervision conditions. This operation makes it possible to incorporate multiple datasets containing different types of labels simultaneously.

However, all the above methods edit multiple facial attributes simultaneously by discretely changing multiple values of attribute vectors. None of them can alter facial attributes continuously.

In light of this, Lample et al. (2017) present a Fader network using continuous attribute values to modify attributes through sliding knobs, like faders on a mixing console. For example, one can gradually change the values of gender to control the transition process from man to woman. Fader network is composed of three components: an encoder, a decoder, and a discriminator. With an image-attribute pair as the input, Fader network first maps the image to the latent representation by its encoder and predicts the attribute vector by its discriminator. Then, the decoder reconstructs the image through the learned latent representation and the attribute vector. During testing, the discriminator is discarded, and different images with various attributes can be generated with different attribute values.

Note that all the above methods edit attributes over the whole face images. Hence, attribute-irrelevant details might also be changed. To address this issue, Zhang et al. (2018a) introduce the spatial attention mechanism into GANs to locate attribute-relevant areas and propose a SaGAN for manipulating facial attributes more precisely. SaGAN follows the standard adversarial learning paradigm, where a generator and a discriminator play a min-max game. To keep attribute-irrelevant regions unchanged, SaGAN’s generator consists of an attribute manipulation network (AMN) and a spatial attention network (SAN). Given a face image, SAN learns a spatial attention mask where attribute-relevant regions have non-zero attention values. In this way, the region where the desired attribute appears can be located. Then, AMN takes the face image and the attribute vector as inputs, yielding an image with the desired attribute in the specific region located by SAN.

Rather than taking the attribute vectors as extra conditions, deep FAM methods conditioned on reference exemplars consider exchanging specific attributes with the to-be-manipulated images in the image-to-image translation framework. Note that these reference images do not need to have the same identity as the original to-be-manipulate images, and all the generated attributes are present in the real world. In this way, more specific details that appear in the reference images can be explored to generate more realistic images.

Zhou et al. (2017) first design a GeneGAN to achieve the basic reference exemplar-based facial attribute manipulation. Given an image, it is encoded into two complement codes: attribute-specific codes and attribute-irrelevant codes. By exchanging the attribute-specific codes and preserving the attribute-irrelevant codes, desired attributes can be transferred from the reference exemplar image to the to-be-manipulated image.

Considering that GeneGAN only transfers one attribute in a single manipulation process, Xiao et al. (2018) construct an ELEGANT model to exchange latent encodings for transferring multiple facial attributes by exemplars. Specifically, since all the attributes are encoded in the latent space in a disentangled manner, one can exchange the specific part of encodings and manipulate several attributes simultaneously. Besides, the residual image learning and the multi-scale discriminators for adversarial training enable the proposed model to generate high-quality images with more delicate details and fewer artifacts. At the beginning of training, ELEGANT receives two sets of training images as inputs, i.e., a positive set and a negative set, which do not need to be paired. Second, an encoder is utilized to obtain the latent encodings of both positive and negative images. Then, if the i-th attribute is required to be transferred, the only step is to exchange the i-th element in the latent encodings of positive and negative images. Once the encoding step is finished, ELEGANT constructs an image generator that consists of a decoder and the encoder from the previous step to decode recombined latent encodings into images. Finally, two discriminators with identical network structures work at different scales to obtain manipulated attribute images.

6 Additional Related Issues

6.1 Imbalance Learning in Facial Attribute Analysis

Face attribute data exhibits an imbalanced distribution in terms of different categories. It is normally called the class-imbalance issue, which means in a dataset, some of the facial attribute classes have a much higher number of samples than others, corresponding to the majority class and minority class (Haixiang et al. 2017), respectively. For example, the largest imbalance ratio between the minority and majority attributes in CelebA dataset is 1:43. Learning from such imbalanced facial attribute labels can lead to biased classifiers, which tend to favor the majority and fail to discriminate the features learned from the minority. Even in the extreme case, the learned classifiers can hardly identify the minority samples.

One typical scheme to solve this problem is using an assumed balanced target distribution to guide the imbalanced source distribution by weighting objective functions. MOON (Rudd et al. 2016) weights the back-propagation error in a cost-sensitive way. A probability is assigned to each class by counting the relative numbers of positive and negative samples for both source and target domains. Then, these probabilities could be used as weights to incorporate the distribution discrepancy into the loss function.

However, MOON overlooks the label imbalance over each batch, which means that the batch-wise training scheme of deep networks is not fully utilized. In light of this, AttCNN (Hand et al. 2018a) proposes a selective learning algorithm to address the distribution discrepancy at the batch level. If the original batch in the source domain has more positive samples and fewer negative samples than the target distribution, the selective learning algorithm resamples a random subset from the positive instances. Meanwhile, it proportionally weights the negative counterparts to match the target distribution. By aligning the distributions between the source and target domains in each batch, AttCNN yields the state-of-the-art class-imbalance attribute prediction performance.

In addition, another more frequently used scheme for class-imbalance learning is data resampling for deep FAE methods. Huang et al. (2016) adopt the resampling strategy, namely large margin local embedding (LMLE), and formulate a quintuple sampling term associated with the triple-header loss. LMLE enforces the preservation of locality across clusters and the discrimination between classes. Then, a fast cluster-wise kNN algorithm is executed, followed by a local large margin decision. In this way, LMLE learns embedded features that are discriminative enough without any possible local class imbalance. On this basis, Huang et al. (2019) further propose a rectified version of LMLE, i.e., cluster-based large margin local embedding (CLMLE). CLMLE designs a loss to preserve the inter-cluster margins both within and between classes. In contrast to LMLE enforcing the Euclidean distance on a hypersphere manifold, CLMLE adopts angular margins enforced between the involved cluster distributions and uses spherical k-means for obtaining K clusters with the same size, which contributes to better performance.

On the other hand, Dong et al. (2017) take an online regularization strategy to address the facial attribute based class-imbalance issue. In detail, they exploit a batch-wise incremental hard mining on minority attribute classes, and formulate a class rectification loss (CRL) based on the mined minority examples. For the hard mining strategy, researchers first provide the profiles of hard positives and hard negatives for the minority. Then, according to the predefined profiles and model, they select K hard positives (or hard negatives) as the bottom-K (or top-K) scores on the minority class for a specific attribute. This process is executed at the batch level and incrementally over subsequent batches. Such batch-wise incremental hard mining guarantees CRL strong class-imbalance learning ability and satisfactory attribute estimation performance.

6.2 Relative Attribute Ranking in Facial Attribute Analysis

Relative attribute learning aims to formulate functions to rank the relative strength of attributes (Chen et al. 2014), which can be widely applied in object detection (Fan et al. 2013), fine-grained visual comparison (Shi and Tao 2018), and facial attribute estimation (Li et al. 2018b). The general insight in this line of work is learning global image representations in a unified framework (Lampert et al. 2009; Parikh and Grauman 2011) or capturing part-based representations via pretrained part detectors (Bourdev et al. 2011; Sandeep et al. 2014; Zhang et al. 2014). However, the former ignores the localizations of attributes, and the latter ignores the correlations among attributes. Consequently, both the two might collapse the performance of relative attribute ranking.

Xiao and Jae Lee (2015) first propose automatically discovering the spatial extent of relevant attributes by establishing a set of visual chains indicating the local and transitive connections. In this way, the locations of attributes can be learned automatically in an end-to-end way. Although no pretrained detectors are used, the optimization pipeline still contains several independent modules, resulting in a suboptimal solution.

To tackle this issue, Singh and Lee (2016) construct an end-to-end deep CNN for simultaneously learning features, localizations, and ranks of facial attributes with weakly supervised pair-wise images. Specifically, given pairs of training images ordered according to the relative strength of an attribute, two Siamese networks receive these images, where each takes one of a pair as input and builds a single branch. Each branch contains two components: the spatial transformer network (STN), which generates image transformation parameters for localizing the most relevant regions, and the ranker network (RN), which outputs the predicted attribute scores. The qualitative experiment results over LFW-10 dataset show excellent performance in attribute region localization and ranking accuracy.

To model the pair-wise relationships between images for multiple attributes, Meng et al. (2018) construct a graph model, where each node represents an image and edges indicate the relationships between images and attributes, as well as between images and images. The overall framework consists of two components: the CNN for extracting primary features of the node images, and the graph neural network (GNN) for learning the features of edges and following updates. Thus, the relationships among all the images are modeled by an fully-connected graph over the learned CNN features. Then, a gated recurrent unit (GRU) takes the node and its corresponding information as inputs and outputs the updated node. As a result, the correlations among attributes can be learned by using information from the neighbors of the node, as well as by updating its state based on the previous state.

6.3 Adversarial Robustness in Facial Attribute Analysis

Adversarial images, which are generated from the network topology, training process, and hyperparameter variation by adding slight artificial perturbations, can be used as inputs of deep facial attribute analysis models. By classifying the original inputs correctly and misclassifying the adversarial inputs, the robustness of models can be improved. Szegedy et al. (2014) first propose that neural networks can be induced to misclassify an image by carefully chosen perturbations that are imperceptible to human. Following this work, the study of adversarial images is entering the horizons of researchers.

Rozsa et al. (2017) induce small artificial perturbations on existing misclassified inputs to correct the results of attribute classification. Specifically, the adversarial images are generated over a random subset of CelebA dataset via the fast flipping attribute (FFA) technique. FFA algorithm leverages the back-propagation of the Euclidean loss to generate adversarial images. During this process, it flips the binary decision of the deep network without ground-truth labels. Through the robustness analysis, FFA has better performance in generating more adversarial examples than the existing fast gradient sign (FGS) method (Goodfellow et al. 2015) on the designed separate attribute networks (Rozsa et al. 2016). Moreover, FFA algorithm is extended to an iterative version, namely iterative FFA, to ensure the use for multi-objective networks, e.g., MOON (Rudd et al. 2016). The experiments demonstrate that the quality of adversarial examples of iterative FFA is more satisfactory than its base version, and iterative FFA can flip attribute prediction results more frequently. Despite the promising performance of these two types of FFAs, several attributes still could not be flipped over on separately trained deep models.

In addition, attribute anonymity, which conceals specific facial attributes that an individual does not want to share, is another adversarial robustness related task. When hiding corresponding attributes, the remaining attributes should be maintained, and the visual quality of images should not be damaged. Chhabra et al. (2018) achieve this basic target by adding adversarial perturbations to an attribute preservation set and an attribute suppression set. Consequently, the prediction of a specific attribute from the true category can be classified into a different target category.

In summary, the study of adversarial robustness contributes to improving the representational stability of current deep FAE algorithms. Additionally, due to the attack of adversarial examples, the robustness of deep facial attribute analysis models is moving towards a promising direction.

7 Challenges and Opportunities

Despite the promising performance of many algorithms in deep facial attribute analysis, there are still several challenging issues that deserve more attention. On the other hand, these challenges also bring hopeful opportunities for the development of this field. Therefore, in this section, we discuss challenges and future opportunities for both deep FAE and FAM, from the perspectives of databases, algorithms, and real-world applications.

7.1 Discussion of Facial Attribute Estimation

7.1.1 Data

The development of deep neural networks makes FAE a data-driven task. That means large numbers of samples are required for training deep models to capture attribute-relevant facial details. However, contemporary studies suffer from insufficient training data. In this case, deep neural networks would easily fit the data characteristics contained only in a small number of images and have degraded performance. In the following, taking two commonly used datasets as examples (i.e., CelebA and LFWA), we analyze the data challenges that exist in current facial attribute databases from the perspectives of data sources, data quality, and imbalanced data, respectively.

First, from the perspective of data sources, CelebA collects face data and attribute labels from the celebrities, and the samples of LFWA come from online news. There is no doubt that these databases are inherently biased and do not match the general data distributions in the real world. For example, the bald attribute corresponds to a small number of samples in CelebA, but in the real world, it is a common attribute among ordinary people. Hence, more complementary facial attribute datasets that cover more real-world scenarios and a wider range of facial attributes need to be constructed in the future. An earlier work (Wang et al. 2016) has made an attempt to extract images from the real-world outdoor videos, i.e., Ego-Humans dataset. However, it contains more pedestrian attributes, and only several facial attributes are predicted. Nevertheless, we believe that this dataset provides an inspired idea for collecting more facial attribute-relevant images from videos in real-world scenes (Wiles et al. 2018).

Furthermore, Hand et al. (2018b) have made the first attempt to estimate facial attributes in videos. They use weakly labeled data in YouTube Faces Dataset (with attribute labels) to keep attribute prediction consistent and accurate in videos, by imposing a temporal coherence constraint and a motion-attention mechanism. The temporal coherence constraint ensures the response invariability between video frames by transferring responses from labeled frames to unlabeled ones. Meanwhile, the motion-attention mechanism enforces their model to focus on face parts through exploring the motion relationship between labeled and unlabeled frames. On the one hand, this research significantly highlights the importance of temporal and motion factors when designing video-based deep FAE models. On the other hand, it also expresses the expectation for labeling new video datasets with facial attributes in future study.

Second, from the perspective of data quality, most faces in CelebA and LFWA are frontal and aligned images with high quality (Hand et al. 2018a). However, real-world data always have low-quality, partially visible images with various illumination and poses. Thus, attribute prediction models trained on these images could hardly learn representative features of real-world data. Therefore, we expect that more adequate real-world training data would come out to strengthen the estimation abilities of future attribute classifiers.

Finally, for CelebA, LFWA, or real-world face images, imbalanced data would induce attribute estimation models to pay more attention to learning the features of majority samples. Consequently, learned biased attribute classifiers could not identify the minorities in some extreme cases. Although many efforts have been made to solve this class-imbalance learning issue from the perspective of algorithms, as mentioned in Sect. 6.1, data support is still an urgent need.

Besides, the test datasets (i.e., target domains), may have different distributions from the training datasets (i.e., source domains). It is generally called domain adaption issue, which can be taken as a distribution imbalance. That means once the source data have a particular property, the given target domain would not always follow the same pattern. Therefore, such a discrepancy between data distributions would negatively impact the generalization ability over unseen test data and lead to significant performance deterioration.

Therefore, on the one hand, we anticipate that more available facial attribute images can be released to capture discriminative features of majority and minority samples equally well in terms of class-imbalance data. On the other hand, more algorithms are expected to be developed to solve the domain adaption issue in attribute estimation.

7.1.2 Algorithms

As mentioned before, part-based deep FAE methods and holistic deep FAE methods develop in parallel. The former pays more attention to locating attributes, and the latter concentrates more on modeling attribute relationships. Below, we provide the main challenges from the perspective of algorithms and analyze the future trends for both types of methods.

For the part-based methods, earlier methods draw support from existing part detectors to discover facial components. However, these detected parts of faces are coarse and attribute-independent. They only distinguish the whole face from the other face-irrelevant parts, such as the background in an image. Considering that existing detectors are not customized for deep FAE, some researchers begin to seek help from other face-related auxiliary tasks, which focus more on facial details rather than the whole face. There are also some studies that utilize labeled key points to partition facial regions. However, well-labeled facial images are not always available in real-world applications, and the performance of auxiliary tasks would limit the accuracy of the downstream classification task.

We believe that an end-to-end strategy would dominate future part-based deep FAE algorithms, where the attribute-relevant regions and the corresponding prediction can be yielded in a unified framework (Fukui et al. 2019). Ding et al. (2018) have attempted to tackle this issue, but learning a region for each attribute is cumbrous and computationally expensive. This is because several attributes might appear in the same region of a face.

In addition, part-based methods show great superiority when dealing with data under in-the-wild environmental conditions, such as illumination variations, occlusions, and non-frontal faces. Through learning the locations of different attributes, part-based methods integrate the information from non-occluded areas to predict attributes in occluded areas. Mahbub et al. (2018) address this issue by partitioning facial parts manually according to key points. However, such annotations are not always available. Attempting to integrate these non-occluded areas adaptively is becoming a future trend. Besides, Mahbub et al. (2018) test their model’s attribute estimation performance on partial faces by adding occlusions artificially over original databases, but this operation is not normative for the test protocol. Therefore, the lack of data under the in-the-wild conditions is still a challenge for training deep FAE networks in the wild environment.

For holistic methods, state-of-the-art approaches design networks with different architectures for sharing common features and learning attribute-specific features at different layers. However, these methods define attribute relationships to design networks by grouping attributes manually, which can be taken as extra prior information. Since different individuals might give different attribute partitions according to locations or semantics, it is difficult to determine that which facial attribute groups are suitable and optimal. Therefore, how to discover attribute relationships adaptively in the training process, without given prior information artificially, should be the focus of future works.

In addition, facial attributes have been taken as auxiliary and complementary information for many face-related tasks, such as face recognition (Kumar et al. 2009; Rudd et al. 2016; Taherkhani et al. 2018), face detection (Ranjan et al. 2017), and facial landmark localization (Zhuang et al. 2018). Kumar et al. (2009) first introduce the concept of ‘attribute’ to facilitate face verification by compact visual descriptions and low-level attribute features. In contrast, Rudd et al. (2016) utilize the mixed objective optimization network with the Euclidean loss to learn deep attribute features for promoting facial verification. Experiments illustrate that despite only 40 attributes being used, the work of Rudd et al. (2016) still performs better than that of Kumar et al. (2009), which extracts features of 73 facial attributes.

Apart from employing features learned by attribute prediction to assist face recognition, joint and incorporative learning of facial attribute relevant tasks can further enhance their respective robustness and performance by discovering complementary information. For example, considering the inherent dependencies of face-related tasks, Zhuang et al. (2018) design a cascaded CNN for simultaneously learning face detection, facial landmark localization, and facial attribute estimation under a multi-task framework to improve the performance of each task. They further attempt to perform joint face recognition and facial attribute estimation when taking the relationship between identities and attributes into account. Therefore, it is reasonable to believe that the combination of different face-related tasks is becoming a promising research direction due to the complementary relationships among them.

7.1.3 Applications

Various viewpoints of the same person are difficult challenges for maintaining the identity-attribute consistency in deep FAE methods. On the one hand, such viewpoint diversification helps to learn richer features from the same person. On the other hand, images of different viewpoints might differ in attributes even from the same identity. For example, the side face images might yield different prediction results with the front face images for the high cheekbones, as the side face images do not emphasize this attribute.

Therefore, attribute inconsistency becomes a severe problem in various viewpoints for the same identity. Lu et al. (2018a) propose a probabilistic confidence criterion to address this inconsistency issue. Specifically, this criterion first extracts the most confident face image for each subject, and then it chooses the result corresponding to the highest confidence as the final prediction of each attribute concerning each subject. However, filtering the most confident image via relevant criteria might not be the most optimal strategy, because the features from all images with different views are not taken full advantage of in making the favorable estimation.

Nowadays, digital mobile devices contain considerable amounts of valuable personal information, such as bank accounts and private emails (Samangouei et al. 2017). These personal details make these devices the targets of various attacks. Hence, biological characteristics, such as fingerprints and irises (Trokielewicz et al. 2019), have been widely used as device passwords for further protecting the privacy information of users. This technique is called biometric verification. Recently, an increasing number of biometric verification based algorithms have emerged as a solution for continuous authentication on mobile devices. Many researchers have committed to designing active authentication algorithms based on face biometrics. For example, studies in Fathy et al. (2015), Günther et al. (2013), Hadid et al. (2007) detect faces through camera sensor images and further extract low-level features for the authentication of smartphone users.

Considering that facial attributes contain more detailed characteristics than the full face, we believe that facial attributes would bring new opportunities for biometric identification in real-world applications. Samangouei et al. (2017) have attempted the active authentication of mobile devices by facial attributes. A set of binary attribute classifiers are trained to estimate whether attributes are present in images of the current user in a mobile device. Consequently, the authentication can be implemented by comparing the recognized attributes with the originally enrolled attributes.

However, Samangouei et al. (2017) extract traditional features, such as the LBP feature, which are not task-specific for attribute estimation and less discriminative than deep features. To some extent, these traditional features and SVM classifiers balances the verification accuracy and mobile performance, whereas other methods with satisfactory performance might have tremendous computation or memory costs.

Therefore, future challenges mainly lie in two aspects. The first is to better apply facial attributes for mobile device authentication. The second is exploring more discriminative deep features and classifiers under the constraints of the trade-off between verification accuracy and mobile performance. Nevertheless, we expect that facial attributes would contribute to further advance the progress of biometric verification on digital mobile devices.

7.2 Discussion of Facial Attribute Manipulation

7.2.1 Data

In this section, we start with the problems of current FAM databases and analyze the challenges and the opportunities related to data sources. Then, we express an expectation for the video data type, as we have done in the discussion of facial attribute prediction. Finally, taking the performance metrics into account, we believe that future deep FAM methods need to establish a unified standard for evaluating their experiment results.

First, in terms of data sources, note that almost all deep FAM algorithms are trained over CelebA database, while very few of them also use LFW dataset. The data sources are extremely inadequate, and facial attributes that can be manipulated are considerably limited. For 40 annotated attributes, only several notable attributes [e.g., hair colors (Li et al. 2018a), glasses (Chen et al. 2016), and smiling (Xiao et al. 2018)] can achieve satisfactory performance. Such limitation could cause a degradation in performance when manipulating various attribute types. Therefore, we expect that more high-quality facial attribute databases could be released and that more kinds of facial attributes could be manipulated in the future.

Second, from the perspective of the data type, FAM on the video data still has not been studied. Manipulating video facial attributes requires models to yield lifelike details.

When faces change with the frames of videos, models can still locate the to-be-manipulated areas precisely and keep the consistency of attribute manipulation for the same identity. Nevertheless, this task is valuable in many entertainment situations in the real world, such as beauty makeup videos. The hair colors in the videos might be varied according to users’ preference. However, to date, there is no available large-scale video data for training video-based attribute manipulation models. The possible reasons might be that it is difficult to track and annotate facial attributes in large-scale videos due to spatial and temporal dynamics (Saito et al. 2017), and the quality of video data could have significant effects on such a synthesis task. We expect that the focus will be shifted to collect and annotate video data with facial attributes for promoting the video-based deep FAM task further.

Finally, from the perspective of performance metrics, as mentioned in Sect. 3, contemporary research either evaluates generated images by statistical surveys or seeks help from other face-related tasks, such as attribute estimation and landmark detection. Unified and standard metric systems have not yet formed in terms of qualitative and quantitative analyses. We expect that the metrics of deep FAM methods could be well developed and establish a relatively unified rule in the future.

7.2.2 Algorithms

State-of-the-art deep FAM methods can be grouped into two categories: model-based methods and extra condition-based methods. Model-based methods tackle an attribute domain transfer issue and use the adversarial loss to supervise the process of image generation. Extra condition-based methods alter desired attributes with given conditional attributes concatenated with to-be-manipulated images in encoding spaces. The main difference between the two types of methods is whether extra conditions are required.

Model-based methods take no extra conditions as inputs, and one trained model only changes one corresponding attribute. This strategy is task-specific and helps to generate more photorealistic images, but it is difficult to guarantee attribute-irrelevant details are unchanged due to its operation based on the whole image directly. Few methods focus on this issue, except for ResGAN proposed by Shen and Liu (2017). However, ResGAN generates residual images for locating attribute-relevant regions under the sparsity constraint. Such a constraint relies heavily on control parameters but not attributes themselves. Hence, how to design networks to synthesize desired photorealistic attributes, as well as keep other attribute-irrelevant details unchanged, is a significant challenge in the future. In addition, as multi-domain transfer has become a hot research topic (Liu et al. 2018; Zhang 2018), we expect that these novel domain transfer algorithms would migrate to deep FAM methods for yielding more appealing performance.

Extra condition-based methods take attribute vectors or reference exemplars as conditions. These algorithms edit facial attributes by changing values of attribute vectors or latent codes of reference exemplars. One advantage of this strategy is multiple attributes can be manipulated simultaneously by altering multiple corresponding values of conditions. However, the concomitant disadvantage is also inevitable. That is, these methods cannot change attributes continuously since the values of attribute vectors are edited discretely. We believe that this shortcoming can be solved by interpolation schemes (Berthelot et al. 2019) or semantic component decomposition (Chen et al. 2019) in the future. In addition, as mentioned before, reference exemplar based algorithms are becoming a promising research direction. More specific details that appear in reference images can be explored to generate more photorealistic images compared with merely altering attribute vectors manually.

7.2.3 Applications

Face makeup (Li et al. 2018c; Chang et al. 2018; Cao et al. 2019a) and face aging (Suo et al. 2010; Nhan Duong et al. 2019; Liu et al. 2019) are two hot topics in deep FAM related applications. They have played important roles in mobile device entertainment (e.g., beauty cameras) and identity-relevant face verification. Compared with general FAM, they focus more on more subtle face attribute details. For face makeup, it concentrates more on makeup related attributes, such as the types of eyeshadows and the colors of lipsticks. The focus of studies lies on facial makeup transfer and removal (Chang et al. 2018; Cao et al. 2019a), where makeup transfer aims to map one makeup style to another for generating different makeup styles (Li et al. 2018c), and makeup removal performs an opposite process which cleans off the existing makeup and provides support to makeup-invariant face verification (Cao et al. 2019a). In terms of face aging, it renders face images with a wide range of ages and keeps identity information insusceptible. Hence, this task can not only be applied to digital entertainment but also provide support to social safety, such as fugitive researches and cross-age identity verification. The most crucial issue in face aging is that there are no sufficient paired images for the same person at different ages (Liu et al. 2019). Recently, the development of deep learning has lead face makeup and face aging to promising results, and they have become important research branches independent of general deep FAM methods. We expect the development of these two branches would bring out a hopeful prospect of future real-world applications.

Besides, resolution limitation is another tough challenge in real-world facial manipulation. Existing methods only work well with a limited range of resolutions and under lab conditions. This limitation encourages combining face super-resolution with deep FAM algorithms. For example, Lu et al. (2018a) propose a conditional version of CycleGAN (Zhu et al. 2017) to generate face images under the guidance of attributes for face super-resolution. Specifically, conditional CycleGAN takes a pair of low/high-resolution faces and an attribute vector extracted from the high-resolution one as inputs. Conditioned on attributes of the original high-resolution image, this model learns to generate a high-resolution version of the original low-resolution image. Moreover, Dorta et al. (2018) apply smooth warp fields to GANs for manipulating face images with very high resolutions through a deep network at a lower resolution. All these schemes inspire researchers to integrate state-of-the-art face super-resolution methods into attribute manipulation for achieving a win-win situation.

7.3 Relationships Between FAE and FAM

In this section, we introduce the relationships between deep FAE and FAM. We believe the discussion about how the two tasks assist each other would guide future research to improve both algorithms.

For deep FAE, deep FAM can be taken as a vital scheme of data augmentation, where generated facial attribute images can significantly increase the amount of data used for training deep neural networks. Sufficient training data can reduce the risk of overfitting and further improve the prediction accuracy. Future works should work harder on improving the quality of generated images and synthesizing as many facial attribute details as possible. In this way, generated images would better support the training of deep FAE models.

For deep FAM, the result of attribute estimation can be a significant quantitative performance evaluation criterion. The deep FAE network used for evaluation has to be well trained on real images in advance and has to provide an accuracy baseline for all real facial attributes. Then, it works on the generated facial attribute images and yields another prediction accuracy over manipulated attributes. As a result, the accuracy gap between real images and generated images can reflect the performance of deep FAM algorithms.

Despite the mutual assistance builds a bridge between deep FAE and deep FAM methods, there are still some issues that need to be addressed for the two tasks. First, generated facial attribute images may not contain too much delicate facial information. In other words, there is still a gap between real and augmented generated images, which might damage the performance of attribute estimation. Hence, how to close this gap can be an essential future research direction for data augmentation in deep facial attribute analysis. Second, the performance of attribute estimation directly affects the evaluation results of facial attribute manipulation. Therefore, how to balance the metric with the prediction performance is another challenge. We expect that deep FAE methods and deep FAM methods can strengthen their cooperation to significantly improve each others’ performance in the future.

8 Conclusion

As one type of important semantic features describing the visual properties of face images, facial attributes have received considerable attention in the field of computer vision. The analyses targeting facial attributes, including facial attribute estimation (FAE) and facial attribute manipulation (FAM), have improved the performance of many real-world applications. This paper provides a comprehensive review of recent advances in both deep learning based FAE and FAM. The commonly used databases and metrics are summarized, and the taxonomies of state-of-the-art methods over both two issues have been created, together with their advantages and disadvantages. In addition, future challenges and opportunities are highlighted in terms of data, algorithms, and applications, respectively. We are looking forward to further studies that address these challenges and take these opportunities to promote the development of deep face attribute analysis.