Keywords

1 Introduction

Face images provide crucial clues for human observation as well as computer analysis [20, 45]. However, the performance of most face image tasks, such as face recognition and facial emotion detection [11, 32], degrades dramatically when the resolution of a facial image is relatively low. Consequently, face super-resolution, also known as face hallucination, was coined to restore a high-resolution face image from its low-resolution counterpart.

Fig. 1.
figure 1

Visual comparison with state-of-the-art face hallucination methods (\(\times \)8). (a) 16 \(\times \) 16 LR input. (b) 128 \(\times \) 128 HR ground-truth. (c) Super-Resolution Convolutional Neural Network (SRCNN) [7]. (d) SRCNN incorporating our 3D facial priors. (e) Very Deep Super-Resolution Network (VDSR) [17]. (f) VDSR incorporating our 3D facial priors. (g) Very Deep Residual Channel Attention Network (RCAN) [42]. (h) Residual Dense Network (RDN) [43]. (i) Wavelet-based CNN for Multi-scale Face Super-Resolution (Wavelet-SRNet) [14]. (j) Progressive Face Super-Resolution using the facial landmark (PSR-FAN) [16]. (k) End-to-End Learning Face Super-Resolution with Facial Priors (FSRNet) [4]. (l) Our proposed method by embedding the 3D facial priors into the Spatial Attention Module (SAM3D).

Although a great influx of deep learning methods [3, 5, 9, 24, 36,37,38,39, 44, 46, 47] have been successfully applied in face Super-Resolution (SR) problems, super-resolving arbitrary facial images, especially at high magnification factors, is still an open and challenging problem due to the ill-posed nature of the SR problem and the difficulty in learning and integrating strong priors into a face hallucination model. Some researches [4, 10, 16, 28, 35, 41] on exploiting the face priors to assist neural networks in capturing more facial details have been proposed. A face hallucination model incorporating identity priors was presented in [10]. However, the identity prior was extracted only from the multi-scale up-sampling results in the training procedure and therefore cannot provide extra priors to guide the network. Yu et al. [35] employed facial component heatmaps to encourage the upsampling stream to generate super-resolved faces with higher-quality details, especially for large pose variations. Kim et al. [16] proposed a face alignment network (FAN) for landmark heatmap extraction to boost the performance of face SR. Chen et al. [4] utilized the heatmaps and parsing maps for face SR problems. Although these 2D priors provide global component regions, these methods cannot learn the 3D reconstruction of detailed edges, illumination, and expression priors. In addition, all of these aforementioned face SR approaches ignore facial structure and identity recovery.

In contrast to the aforementioned approaches, we propose a novel face super resolution method by exploiting 3D facial priors to grasp sharp face structures and identity knowledge. Firstly, a deep 3D face reconstruction branch is set up to explicitly obtain 3D face render priors which facilitate the face super-resolution branch. Specifically, the 3D facial priors contain rich hierarchical features, such as low-level (e.g., sharp edge and illumination) and perception level (e.g., identity) information. Then, a spatial attention module is employed to adaptively integrate the 3D facial prior into the network, in which we employ a spatial feature transform (SFT) [34] to generate affine transformation parameters for spatial feature modulation. Afterwards, it encourages the network to learn the spatial inter-dependencies of features between 3D facial priors and input images after adding the attention module into the network. As shown in Fig. 1, by embedding the 3D rendered face priors, our algorithm generates clearer and sharper facial structures without any ghosting artifacts compared with other 2D prior-based methods.

The main contributions of this paper are:

  • A novel face SR model is proposed by explicitly exploiting facial structure in the form of facial prior estimation. The estimated 3D facial prior provides not only spatial information of facial components but also their 3D visibility information, which is ignored by the pixel-level content and 2D priors (e.g., landmark heatmaps and parsing maps).

  • To well adapt to the 3D reconstruction of low-resolution face images, we present a new skin-aware loss function projecting the constructed 3D coefficients onto the rendered images. In addition, we use a feature fusion-based network to better extract and integrate the face rendered priors by employing a spatial attention module.

  • Our proposed 3D facial prior has a high flexibility because its modular structure allows for easy plug-in of any SR methods (e.g., SRCNN and VDSR). We qualitatively and quantitatively evaluate the proposed algorithm on multi-scale face super-resolution, especially at very low input resolutions. The proposed network achieves better SR criteria and superior visual quality compared to state-of-the-art face SR methods.

2 Related Work

Face hallucination relates closely to the natural image super-resolution problem. In this section, we discuss recent research on super-resolution and face hallucination to illustrate the necessary context for our work.

Super-Resolution Neural Networks. Recently, neural networks have demonstrated a remarkable capability to improve SR results. Since the pioneering network [7] demonstrates the effectiveness of CNN to learn the mapping between LR and HR pairs, a lot of CNN architectures have been proposed for SR [8, 12, 18, 19, 30, 31]. Most of the existing high-performance SR networks have residual blocks [17] to go deeper in the network architecture, and achieve better performance. EDSR [22] improved the performance by removing unnecessary batch normalization layers in residual blocks. A residual dense network (RDN) [43] was proposed to exploit the hierarchical features from all the convolutional layers. Zhang et al. [42] proposed the very deep residual channel attention networks (RCAN) to discard abundant low-frequency information which hinders the representational ability of CNNs. Wang et al. [34] used a spatial feature transform layer to introduce the semantic prior as an additional input of the SR network. Huang et al. [14] presented a wavelet-based CNN approach that can ultra-resolve a very low-resolution face image in a unified framework. Lian et al. [21] proposed a Feature-Guided Super-Resolution Generative Adversarial Network (FG-SRGAN) for unpaired image super-resolution. However, these networks require a lot of time to train the massive parameters to obtain good results. In our work, we largely decrease the training parameters, but still achieve superior performance in the SR criteria (SSIM and PSNR) and visible quality.

Facial Prior Knowledge. Exploiting facial priors in face hallucination, such as spatial configuration of facial components [29], is the key factor that differentiates it from generic super-resolution tasks. There are some face SR methods that use facial prior knowledge to super-resolve LR faces. Wang and Tang [33] learned subspaces from LR and HR face images, and then reconstructed an HR output from the PCA coefficients of the LR input. Liu et al. [23] set up a Markov Random Field (MRF) to reduce ghosting artifacts because of the misalignments in LR images. However, these methods are prone to generating severe artifacts, especially with large pose variations and misalignments in LR images. Yu and Porikli [38] interweaved multiple spatial transformer networks [15] with the deconvolutional layers to handle unaligned LR faces. Dahl et al. [5] leveraged the framework of PixelCNN [26] to super-resolve very low-resolution faces. Zhu et al. [47] presented a cascade bi-network, dubbed CBN, to localize LR facial components first and then upsample the facial components; however, CBN may produce ghosting faces when localization errors occur. Recently, Yu et al. [35] used a multi-task convolutional neural network (CNN) to incorporate structural information of faces. Grm et al. [10] built a face recognition model that acts as identity priors for the super-resolution network during training. Yu et al. [4] constructed an end-to-end SR network to incorporate the facial landmark heatmaps and parsing maps. Kim et al. [16] proposed a compressed version of the face alignment network (FAN) to obtain landmark heatmaps for the SR network in a progressive method. However, existing face SR algorithms only employ 2D priors without considering high-dimensional information (3D). In this paper, we exploit the 3D face reconstruction branch to extract the 3D facial structure, detailed edges, illumination, and identity priors to guide face image super-resolution.

3D Face Reconstruction. The 3D shapes of facial images can be restored from unconstrained 2D images by the 3D face reconstruction. In this paper, we employ the 3D Morphable Model (3DMM) [1, 2, 6] based on the fusion of parametric descriptions of face attributes (e.g., gender, identity, and distinctiveness) to reconstruct the 3D facial priors. The 3D reconstructed face will inherit the facial features and present the clear and sharp facial components.

Fig. 2.
figure 2

The proposed face super-resolution architecture. Our model consists of two branches: the top block is a ResNet-50 Network to extract the 3D facial coefficients and restore a sharp face rendered structure. The bottom block is dedicated to face super-resolution guided by the facial coefficients and rendered sharp face structures which are concatenated by the Spatial Feature Transform (SFT) layer.

Closest to ours is the work of Ren et al. [28] which utilizes the 3D priors in the task of face video deblurring. Our method differs in several important ways. First, instead of simple priors concatenation, we employ the Spatial Feature Transform Block to incorporate the 3D priors in the intermediate layer by adaptively adjusting the modulation parameter pair. Specifically, the outputs of the SFT layer are adaptively controlled by the modulation parameter pair by applying an affine transformation spatially to each intermediate feature map. Second, the attention mechanism is embedded into the network as a guide to bias the allocation of most informative components and the interdependency between the 3D priors and input.

3 The Proposed Method

The proposed face super-resolution framework presented in Fig. 2 consists of two branches: the 3D rendering network to extract the facial prior and the spatial attention module aiming to exploit the prior for the face super-resolution problem. Given a low-resolution face image, we first use the 3D rendering branch to extract the 3D face coefficients. Then a high-resolution rendered image is generated using the 3D coefficients and regarded as the high-resolution facial prior which facilitates the face super-resolving process in the spatial attention module.

3.1 Motivations and Advantages of 3D Facial Priors

Existing face SR algorithms only employ 2D priors without considering high dimensional information (3D). The 3D morphable facial priors are the main novelty of this work and are completely different from recently related 2D prior works (e.g., the parsing maps and facial landmark heatmaps by FSRNet [4] and the landmark heatmap extraction by FAN [16]). The 3D coefficients contain abundant hierarchical knowledge, such as identity, facial expression, texture, illumination, and face pose. Furthermore, in contrast with the 2D landmark-based priors whose attentions only lie at the distinct points of facial landmarks that may lead to the facial distortions and artifacts, our 3D priors are explicit and visible, and can generate the realistic and robust HR results, greatly reducing artifacts even for large pose variations and partial occlusions.

Fig. 3.
figure 3

The rendered priors from our method. (a) and (d) low-resolution inputs. (b) and (e) our rendered face structures. (c) and (f) ground-truths. As shown, the reconstructed facial structures provide clear spatial locations and sharp visualization of facial components even for large pose variations (e.g., left and right facial pose positions) and partial occlusions.

Given low-resolution face images, the generated 3D rendered reconstructions are shown in Fig. 3. The rendered face predictions contain the clear spatial knowledge and sharp visual quality of facial components which are close to the ground-truth, even in images containing large pose variations as shown in the second row of Fig. 3. Therefore, we concatenate the reconstructed face image as an additional feature in the super-resolution network. The face expression, identity, texture, the element-concatenation of illumination, and face pose are transformed into four feature maps and fed into the spatial feature transform block of the super-resolution network.

For real-world applications of the 3D face morphable model, there are typical problems to overcome, including large pose variations and partial occlusions. As shown in the supplementary material, the morphable model can generate realistic reconstructions of large pose variations, which contain faithful visual quality of facial components. The 3D model is also robust and accurately restores the rendered faces partially occluded by glasses, hair, etc. In comparison with other SR algorithms which are blind to unknown degradation types, our 3D model can robustly generate the 3D morphable priors to guide the SR branch to grasp the clear spatial knowledge and facial components, even for complicated real-world applications. Furthermore, our 3D priors can be plugged into any network and largely improve the performance of existing SR networks (e.g., SRCNN and VDSR demonstrated in Sect. 5).

3.2 Formulation of 3D Facial Priors

It is still a challenge for state-of-the-art edge prediction methods to acquire very sharp facial structures from low-resolution images. Therefore, a 3DMM-based model is proposed to localize the precise facial structure by generating the 3D facial images which are constructed by the 3D coefficient vector. In addition, there exist large face pose variations, such as in-plane and out-of-plane rotations. A large amount of data is needed to learn the representative features varying with the facial poses. To address this problem, an inspiration came from the idea that the 3DMM coefficients can analytically model the pose variations with a simple mathematical derivation [2, 6] and do not require a large training set. As such, we utilize a face rendering network based on ResNet-50 to regress a face coefficient vector. The output of the ResNet-50 is the representative feature vector of \(\textit{\textbf{x}}=(\varvec{\alpha },\varvec{\beta },\varvec{\delta },\varvec{\gamma },\varvec{\rho })\in \mathbb {R}^{239}\), where \(\ \varvec{\alpha }\in \mathbb {R}^{80},\varvec{\beta }\in \mathbb {R}^{64},\varvec{\delta }\in \mathbb {R}^{80},\varvec{\gamma }\in \mathbb {R}^{9},\) and \(\varvec{\rho }\in \mathbb {R}^{6}\) represent the identity, facial expression, texture, illumination, and face pose [6], respectively.

According to the Morphable model [1], we transform the face coefficients to a 3D shape S and texture T of the face image as

$$\begin{aligned} \mathbf{S} =\mathbf{S} (\varvec{\alpha },\varvec{\beta })= \overline{\mathbf{S }}+\mathbf{B} _{id}\varvec{\alpha }+\mathbf{B} _{exp}\varvec{\beta }, \end{aligned}$$
(1)

and

$$\begin{aligned} \mathbf{T} =\mathbf{T} (\varvec{\delta })= \overline{\mathbf{T }}+\mathbf{B} _{t}\varvec{\delta }, \end{aligned}$$
(2)

where \(\overline{\mathbf{S }}\) and \(\overline{\mathbf{T }}\) are the average values of face shape and texture, respectively. \(\mathbf{B} _{t}\), \(\mathbf{B} _{id}\), and \(\mathbf{B} _{exp}\) denote the base vectors of texture, identity, and expression calculated by the PCA method. We set up the illumination model by assuming a Lambertian surface for faces, and estimate the scene illumination with Spherical Harmonics (SH) [27] to derive the illumination coefficient \(\varvec{\gamma }\in \mathbb {R}^{9}\). The 3D face pose \(\varvec{\rho } \in \mathbb {R}^{6}\) is represented by rotation \( \mathbf{R} \in \mathrm {SO} (3) \) and translation \( \mathbf{t} \in \mathbb {R}^{3}\).

To stabilize the rendered faces, a modified \(L_2\) loss function for the 3D face reconstruction is presented based on a paired training set

$$\begin{aligned} \ell _{r}=\frac{1}{L}\sum _{j=1}^{L}\frac{\sum _{i\in {M}}{A}^i\left\| I^{i}_{j}-R^{i}_{j}(B(\textit{\textbf{x}})) \right\| _{2}}{\sum _{i\in {M}}{A}^i}, \end{aligned}$$
(3)

where j is the paired image index, L is the total number of training pairs, i and M denote the pixel index and face region, respectively, I represents the sharp image, and A is a skin color based attention mask obtained by training a Bayes classifier with Gaussian Mixture Models [6]. In addition, x represents the LR (input) images, B(x) denotes the regressed coefficients obtained by the ResNet-50 with input x as input, and finally R denotes the image rendered with the 3D coefficients B(x). Rendering is the process to project the constructed 3D face onto the 2D image plane with the regressed pose and illumination. We use a ResNet-50 network to regress these coefficients by modifying the last fully-connected layer to 239 neurons (the same number of the coefficient parameters).

Fig. 4.
figure 4

The structure of the SFT layer. The rendered faces and feature vectors are regarded as the guidance for face super-resolution.

Coefficient Feature Transformation. Our 3D face priors consist of two parts: one directly from the rendered face region (i.e., the RGB input), and the other from the feature transformation of the coefficient parameters. The coefficient parameters \(\varvec{\alpha }, \varvec{\beta }, \varvec{\delta }, \varvec{\gamma }, \varvec{\rho }\) represent the identity, facial expression, texture, illumination, and face pose priors, respectively. The coefficient feature transformation procedure is described as follows: firstly, the coefficients of identity, expression, texture, and the element-concatenation of illumination and face pose (\(\varvec{\gamma } +\varvec{\rho }\)) are reshaped to four matrices by setting extra elements to zeros. Afterwards, these four matrices are expanded to the same size as the LR images (16 \(\times \) 16 or 32 \(\times \) 32) by zero-padding, and then scaled to the interval [0,1]. Finally, the coefficient features are concatenated with the priors of the rendered face images.

3.3 Spatial Attention Module

To exploit the 3D face rendered priors, we propose a Spatial Attention Module (SAM) to grasp the precise locations of face components and the facial identity. The proposed SAM consists of three parts: a spatial feature transform block, a residual channel attention block, and an upscale block.

Spatial Feature Transform Block. The 3D face priors (rendered faces and coefficient features) are imported into the spatial attention transform block [34] after a convolutional layer. The structure of the spatial feature transform layer is shown in Fig. 4. The SFT layer learns a mapping function \(\varTheta \) that provides a modulation parameter pair \((\mu ,\nu )\) according to the priors \(\psi \), such as segmentation probability. Here, the 3D face priors are taken as the input. The outputs of the SFT layer are adaptively controlled by the modulation parameter pair by applying an affine transformation spatially to each intermediate feature map. Specifically, the intermediate transformation parameters \((\mu ,\nu )\) are derived from the priors \(\psi \) by the mapping function:

$$\begin{aligned} (\mu ,\nu )=\varTheta (\psi ), \end{aligned}$$
(4)

The intermediate feature maps are modified by scaling and shifting feature maps according to the transformation parameters:

$$\begin{aligned} \textit{\textbf{SFT}}(\textit{\textbf{F}}|\varvec{\mu },\varvec{\nu })=\varvec{\mu }\otimes \textit{\textbf{F}}+\varvec{\nu }, \end{aligned}$$
(5)

where \(\textit{\textbf{F}}\) denotes the feature maps, and \(\otimes \) indicates element-wise multiplication. At this step, the SFT layer implements the spatial-wise transformation.

Residual Channel Attention Block. An attention mechanism can be viewed as a guide to bias the allocation of available processing resources towards the most informative components of the input [13]. Consequently, the channel mechanism is presented to explore the most informative components and the inter-dependency between the channels. Inspired by the residual channel network [42], the attention mechanism is composed of a series of residual channel attention blocks (RCAB) shown in Fig. 2. For the b-th block, the output \(\varvec{F_b}\) of RCAB is obtained by:

$$\begin{aligned} \varvec{F_b}=\varvec{F_{b-1}}+C_b(\varvec{X_{b}})\cdot \varvec{X_{b}}, \end{aligned}$$
(6)

where \(C_b\) denotes the channel attention function. \(\varvec{F_{b-1}}\) is the block’s input, and \(\varvec{X_{b}}\) is calculated by two stacked convolutional layers. The upscale block is progressive deconvolutional layers (also known as transposed convolution).

4 Experimental Results

To evaluate the performances of the proposed face super-resolution network, we qualitatively and quantitatively compare our algorithm against nine start-of-the-art super-resolution and face hallucination methods including: the Very Deep Super Resolution Network (VDSR) [17], the Very Deep Residual Channel Attention Network (RCAN) [42], the Residual Dense Network (RDN) [43], the Super-Resolution Convolutional Neural Network (SRCNN) [7], the Transformative Discriminative Autoencoder (TDAE) [38], the Wavelet-based CNN for Multi-scale Face Super Resolution (Wavelet-SRNet) [14], the deep end-to-end trainable face SR network (FSRNet) [4], face SR generative adversarial network (FSRGAN) [4] incorporating the 2D facial landmark heatmaps and parsing maps, and the progressive face Super Resolution network via face alignment network (PSR-FAN) [16] using 2D landmark heatmap priors. We use the open-source implementations from the authors and train all the networks on the same dataset for a fair comparison. For simplicity, we refer to the proposed network as Spatial Attention Module guided by 3D priors, or SAM3D. In addition, to demonstrate the plug-in characteristic of the proposed 3D facial priors, we propose two models of SRCNN+3D and VDSR+3D by embedding the 3D facial prior as an extra input channel to the basic backbone of SRCNN [7] and VDSR [17]. The implementation code will be made available to the public. More analyses and results can be found in the supplementary material.

4.1 Datasets and Implementation Details

CelebA [25] and Menpo [40] datasets are used to verify the performance of the algorithm. The training phase uses 162,080 images from the CelebA dataset. In the testing phase, 40,519 images from the CelebA test set are used along with the large-pose-variation test set from the Menpo dataset. The every facial pose test set of Menpo (left, right and semi-frontal) contains 1000 images, respectively. We follow the protocols of existing face SR methods (e.g., [4, 16, 35, 36]) to generate the LR input by the bicubic downsampling method. The HR ground-truth images are obtained by center-cropping the facial images and then resizing them to the 128 \(\times \) 128 pixels. The LR face images are generated by downsampling HR ground-truths to 32 \(\times \) 32 pixels (\(\times \)4 scale) and 16 \(\times \) 16 pixels (\(\times \)8 scale). In our network, the ADAM optimizer is used with a batch size of 64 for training, and input images are center-cropped as RGB channels. The initial learning rate is 0.0002 and is divided by 2 every 50 epochs. The whole training process takes 2 days with an NVIDIA Titan X GPU.

Fig. 5.
figure 5

Comparison of state-of-the-art methods: magnification factors \(\times \)4 and the input resolution 32 \(\times \) 32. Our algorithm is able to exploit the regularity present in face regions rather than other methods. Best viewed by zooming in on the screen.

Fig. 6.
figure 6

Comparison with state-of-the-art methods: magnification factors \(\times \)8 and the input resolution 16 \(\times \) 16. Best viewed by zooming in on the screen.

Table 1. Quantitative results on the CelebA test dataset. The best results are highlighted in bold.
Table 2. Quantitative results of different large facial pose variations (e.g., left, right, and semifrontal) on the Menpo test dataset. The best results are highlighted in bold.

4.2 Quantitative Results

Quantitative evaluation of the network using PSNR and the structural similarity (SSIM) scores for the CelebA test set is listed in Table 1. Furthermore, to analyze the performance and stability of the proposed method with respect to large face pose variations, three cases corresponding to different face poses (left, right, and semifrontal) of the Menpo test data are listed in Table 2.

CelebA Test: As shown in Table 1, VDSR+3D (the basic VDSR model [17] guided by the proposed 3D facial priors) achieves significantly better results (1 dB higher than the remaining best method and 2 dB higher than the basic VDSR method in \(\times \)8 SR) even for the large-scale parameter methods, such as RDN and RCAN. It is worth noting that VDSR+3D still performs slightly worse than the proposed algorithm of SAM3D. These results demonstrate that the proposed 3D priors make a significant contribution to the performance improvement (average 1.6 dB improvement) of face super-resolution. In comparison with 2D priors based methods (e.g., FSRNet and PSR-FAN), our algorithm performs much better (2.73 dB higher than PSR-FAN and 2.78 dB higher than FSRNet).

Fig. 7.
figure 7

Visual comparison with state-of-the-art methods (\(\times \)8). The results by the proposed method have fewer artifacts on face components (e.g., eyes, mouth, and nose).

Fig. 8.
figure 8

Ablation study results: Comparisons between our proposed model with different configurations, with PSNR and SSIM relative to the ground truth. (a) and (e) are the inputs. (b) and (f) are the SR results without using the rendered priors. (c) and (g) are the SR results without the Spatial Attention Module. (d) and (h) are our SR results.

Menpo Test: To verify the effectiveness and stability of the proposed network towards face pose variations, the quantitative results on the dataset with large pose variations are reported in Table 2. While ours (SAM3D) is the best method superior than the others, VDSR+3D also achieves 1.8 dB improvement compared with the basic VDSR method in the \(\times \)4 magnification factor. Our 3D facial priors based method is still the most effective approach to boost the SR performance compared with 2D heatmaps and parsing maps priors.

4.3 Qualitative Evaluation

Fig. 9.
figure 9

Qualitative evaluation with different ablation configurations: SRCNN+3D and VDSR+3D denote the basic method (SRCNN and VDSR) incorporating the 3D facial priors; Ours (SAM3D) means the Spatial Attention Module incorporating the 3D facial priors. Our 3D priors enable the basic methods to avoid some artifacts around the key facial components and to generate sharper edges.

The qualitative results of our methods at different magnifications (\(\times \)4 and \(\times \)8) are shown respectively in Figs. 5 and 6. It can be observed that our proposed method recovers clearer faces with finer component details (e.g., noses, eyes, and mouths). The outputs of most methods (e.g., PSR-FAN, RCAN, RDN, and Wavelet-SRNet) contain some artifacts around facial components such as eyes and nose, as shown in Figs. 1 and 7, especially when facial images are partially occluded. After adding the rendered face priors, our results show clearer and sharper facial structures without any ghosting artifacts, which illustrates that the proposed 3D priors help the network understand the spatial location and the entire face structure and largely avoid the artifacts and significant distortions in facial attributes which are common in facial landmark priors, because the attention is applied merely to the distinct points of facial landmarks.

5 Analyses and Discussions

Ablation Study: In this section, we conduct an ablation study to demonstrate the effectiveness of each module. We compare the proposed network with and without using the rendered 3D face priors and the Spatial Attention Module (SAM) in terms of PSNR and SSIM on the \(\times \)8 scale test data. As shown in Fig. 8(b) and (f), the baseline method without using the rendered faces and SAM tends to generate blurry faces that cannot capture sharp structures. Figure 8(c) and (g) show clearer and sharper facial structures after adding the 3D rendered priors. By using both SAM and 3D priors, the visual quality is further improved in Fig. 8(d) and (h). The quantitative comparisons between (VDSR, our VDSR+3D, and our SAM3D) in Tables 1 and 2 also illustrate the effectiveness of the proposed rendered priors and the spatial attention module.

To verify the advantage of 3D facial structure priors in terms of the convergence and accuracy, three different configurations are designed: basic methods (i.e., SRCNN [7] and VDSR [17]); basic methods incorporating 3D facial priors (i.e., SRCNN+3D and VDSR+3D); the proposed method using the Spatial Attention Module and 3D priors (SAM3D). The validation accuracy curve of each configuration along the epochs is plotted to show the effectiveness of each block. The priors are easy to insert into any network. They only marginally increase the number of parameters, but significantly improve the accuracy and convergence of the algorithms as shown in Supplementary Fig. 3. The basic methods of SRCNN and VDSR incorporating the facial rendered priors tend to avoid some artifacts around key facial components and generate sharper edges compared to the baseline methods without the facial priors. By adding the Spatial Attention Module, it helps the network better exploit the priors and easily enables to generate sharper facial structures as shown in Fig. 9.

Results on Real-World Images: For real-world LR images, we provide the quantitative and qualitative analysis on 500 LR faces from the WiderFace (x4) dataset in Supplementary Table 1 and Fig. 1.

Model Size and Running Time: We evaluate the proposed method and STOA SR methods on the same server with an Intel Xeon W-2123 CPU and an NVIDIA TITAN X GPU. Our proposed SAM3D, embedded with 3D priors, are more lightweight and less time-consuming, shown in Supplementary Fig. 2.

6 Conclusions

In this paper, we proposed a face super-resolution network that incorporates the novel 3D facial priors of rendered faces and multi-dimensional knowledge. In the 3D rendered branch, we presented a face rendering loss to encourage a high-quality guided image providing clear spatial locations of facial components and other hierarchical information (i.e., expression, illumination, and face pose). Compared with the existing 2D facial priors whose attentions are focused on the distinct points of landmarks which may result in face distortions, our 3D priors are explicit, visible and highly realistic, and can largely decrease the occurrence of face artifacts. To well exploit 3D priors and consider the channel correlation between priors and inputs, we employed the Spatial Feature Transform and Attention Block. The comprehensive experimental results have demonstrated that the proposed method achieves superior performance and largely decreases artifacts in contrast with the SOTA methods.