Keywords

1 Introduction

Face hallucination, as a representative of low-level vision tasks, is the process of reconstructing a clean face image from the degraded observation. It is not only a fundamental problem in face analysis, but can be used as a preprocessor for tasks such as face recognition [1], face alignment [2]. In practical applications, however, the face images captured by surveillance cameras are generally of poor quality and difficult to use directly.

Over the past few decades, many conventional methods [3,4,5,6] have been proposed to solve the problem. They assume the correlation between the degraded face image with clean ones, and focus on learning a mapping from degraded images to clean images. However, most methods involve a large number of optimization problems during the reconstruction phase, making it difficult to implement high-performance applications. Due to the complex environment of the degenerative process, the consistency of the hypothesis of degraded face images with clean images is not well. Therefore, the results produced are often unsatisfactory.

Recently, with the development of convolutional neural networks, many methods [7,8,9,10] based on deep learning have been used for image reconstruction. Face prior knowledge and spatial structure information are often used as additional information for face hallucination. Despite their high reconstruction quality, most of the methods typically suffer from two major drawbacks. First, there are huge differences in the structure of different regions, making it difficult to generate a mapping that satisfies all regions. Second, in the noisy environment, the prior information and spatial structure of face will be destroyed which makes it difficult to generate satisfactory results.

Therefore, how to reconstruct an HR face image in a noisy environment becomes a difficult problem. Inspired by the recent success of aggregation network in computer vision tasks [11, 17,18,19], we propose the Adaptive Aggregation Network to deal with noise face hallucination. Our network contains two branches: aggregation branch and generator branch. The aggregation branch can cluster face images into two robust regions in a data-driven manner through the similarity of the regression function. Then the generator branch can be used to make a specific face hallucination of the selected regions.

The main contribution of this paper is that we propose an effective model to deal with the face hallucination in noise environments. The noise face hallucination is often difficult to reconstruct a satisfactory result due to the complex degradation process and the destruction of the face prior structure. Compared with other methods, our method not only provides robust face structure information under noise conditions, but turns a complex face hallucination problem into two relatively simple sub-problems. The empirical results show that our designed network surpasses the state-of-the-art methods in terms of effectiveness and efficiency.

2 Related Work

2.1 Face Hallucination and Image Super-Resolution

Face hallucination is a special case of image super resolution, which introduces face prior structure information to reconstruct face images. Early techniques assumed that the face was in a controlled setting with small variations. Ma et al. [3] utilize face priors information to reconstruct HR face by solving a constrained least squares problem (known as Least Square Representation (LSR)). Yang et al. [4] thought that low-resolution and high-resolution faces have similar sparse priors and reconstruct HR faces through the low-dimensional projections. Later, on the basis of locality and sparseness, Jiang et al. [5, 6] proposed a Local Constraint Representation (LcR) method to obtain a better reconstruction face. However, these methods require the face to be landmark detection beforehand which often cannot achieve good results when the images are seriously degraded.

In recent years, many deep learning methods have been used for image hallucination and have achieved great progress. In particular, Dong et al. [7] first proposed a super-resolution convolutional network (SRCNN) for image reconstruction through equally performing sparse coding. Kim et al. [8] proposed a deep convolutional network to achieve better reconstruction performance by skipping connections and learning residuals between HR with LR. Zhou et al [10] proposed a bi-channel convolutional neural network for facial hallucination. They point out the importance of input image, and use full-connected layers to restore HR face. Tuzel et al. [9] added global face information to the network and reconstructed the face image by considering global and local constraints.

2.2 Adaptive Aggregation Network

Adaptive aggregation network developed in recent work and has benefited various tasks [11,12,13], such as object classification [11] and human pose estimation [12, 13]. Since contextual information is important for computer vision problems, most of these works attempt to acquire dense features adaptively by focusing on the top information. Recent proposed Residual Attention Network [11] achieves state-of-the-art results on image classification task. A deep network module capturing top information is used to adaptive aggregation module. The aggregation module is applied to the input image to get important regions and then feed to another deep network module for classification. Chen et al. [12] used a stacked hourglass network structure to fuse information from multiple-context to predict human pose, and benefits from global and local information.

3 Proposed Method

3.1 Problem Formulation

We denote the input noise face image and corresponding clean image as X and Y, respectively. The process of getting the noise LR image from HR image can be modeled as

$$\begin{aligned} X=DBY+N \end{aligned}$$
(1)

where D, B and N respectively denote downsampling operator, blur operator and additive noise operator.

Fig. 1.
figure 1

Examples of experimental results. (a) Target image. (b) LR face image with noise. (c) Result of directly training [8]. (d) Result of our method.

For a given LR image, the face hallucination network F is expected to predict a hallucinated face as similar as the ground truth HR image by minimizing the mean square error (MSE).

$$\begin{aligned} L=\frac{1}{N_{I}}||F(X)-Y||_{F}^{2} \end{aligned}$$
(2)

However, we found that the result obtained by directly training [8] on the image domain (direct network) is not satisfactory. In Fig. 1, we show an example of a hallucinated face image which is used in the training process. In Fig. 1(c) we see that the hallucinated image by directly training has severe smooth in some details. In general, we observe that the learned regression function is performed on the entire picture, which means it need take into account various situations. However, the optimal learned regression function in different regions is different. In other words, the regression function need deal with all regions, which makes it hard to learn well. As a result, the reconstruction results in some areas are relatively smooth.

3.2 The Network Architecture

In order to solve the problem, we propose an effective model for noise face hallucination. The detailed structure of the network is shown in Fig. 2. It is divided into two branches: aggregation branch and generator branch. According to the similarity of the regression parameters, the aggregation branch can adaptively aggregate face regions into two categories. Then the generator branch can be targeted to recover HR images for selected regions.

We denote the networks input as X. The network can be summarized as

$$\begin{aligned} L=\frac{1}{N_{I}}||( G_{1}(X,\xi _{1})-Y) M (X,\varPhi )+ (G_{2}(X,\xi _{2})-Y) (1-M (X,\varPhi ))|| _{F}^{2} \end{aligned}$$
(3)

where M, G represents the output of aggregation branch and generator branch and \(\xi ,\varPhi \) denotes the parameters to be learned. The aggregation branch aggregates the face regions into two categories as \(M (X,\varPhi )\) and \(1-M (X,\varPhi )\). Then each generator branch can be targeted to recover HR images for selected regions. Finally, the reconstructed faces of different generator branches are added to generate the final output.

Fig. 2.
figure 2

The detailed structure of network

In aggregation branch, the aggregation network can not only serve as a mask selector during forward inference, but also can guide generator branch gradient update during backward propagation. In the generator branch, the gradient for input image is:

$$\begin{aligned} \frac{\mathrm {d} ( G_{1}(X,\xi _{1})-Y) M (X,\varPhi )}{\mathrm {d}\xi _{1}}=M (X,\varPhi )\frac{\mathrm {d} G_{1}(X,\xi _{1})}{\mathrm {d}\xi _{1}} \end{aligned}$$
(4)

This property allows the generator module to better reconstruct the selected regions. The aggregation branch can prevent unrelated data to update the parameters of the generator branch.

In addition, the excellent recovery of the generator branches in turn causes the aggregation branches to cluster more similar structures into the region. Assuming that the parameters of our generator branch are fixed, the loss function of our network is:

$$\begin{aligned} \arg \min _{\varPhi } ||G_{1}(X,\xi _{1}) M (X,\varPhi )+ G_{2}(X,\xi _{2}) (1-M (X,\varPhi ))-Y|| _{F}^{2} \end{aligned}$$
(5)

In order to minimize the loss, our aggregation branch can get greater weight for better reconstruction regions, which means the aggregation branch cluster more similar structures into categories. Through the similar process of alternating minimization, our network is constantly optimized to generate better results.

Fig. 3.
figure 3

The detailed structure of aggregation branch

Aggregation Branch. Our aggregation branch adopts a similar Hour-Glass structure, which used to human pose estimation [12, 13], to cluster face regions. The detailed structure of the aggregation network is shown in Fig. 3. The network consists of multiple Feature Encoding blocks and Feature Decoding blocks. Each pair of feature encoding block and feature decoding block brings the feature representation into a new spatial scale, so that the whole network can process information on different scales. To effectively consolidate and preserve spatial information in different scales, the hourglass block uses a skip connection mechanism between symmetrical layers. Specifically, the feature information of input is quickly collected through multiple feature encoding blocks, and the feature decoding blocks amplifies the feature information to the same scale as the input. Finally a sigmoid layer normalizes the output range to [0, 1] to get a mask. Compared with CNN, the Hour-Glass structure can obtain a wider range of input information with less computational cost.

Generator Branch. For the generator branch, it be served to generate face images and can be adapted to any state-of-the-art network structures. Considering the success of residual network [14] in computer vision tasks, we choose the residual block as our network basic unit to hallucination face images. The generator branch consists of a cascade of multiple residual blocks. Each residual block contains two convolutional layers, then the input data is passed through a skip connection for element-wise sum with the output of the last convolutional layer.

4 Experiments

4.1 Dataset

We evaluated extensive experiments in Celebrity Face Attributes (CelebA) dataset [1]. The CalebA dataset contains 202,599 face images with 10,177 celebrity identities which is a very common dataset for face-related training. In our experiment, we first aligned the images with Mtcnn method [15] and crop the center image patches with size of \(128\times 128\) as the HR face images to be processed. Then we generate LR images by applied blur operation, down-sampling operation, and noise adding operation on the HR image. We set fixed Gaussian blur kernel b = 1.0, down-sampling factor 4 and we consider three noise levels \(\sigma = 5\), 15 and 25. We select 22k faces from the dataset, of which 20k face images are trained and the rest are used for testing.

4.2 Parameter Settings

In our aggregation network, the number of feature encoding blocks and feature decoding blocks is 4, and each residual block consists of 2 convolution layers with kernels size of \(3\times 3\). In addition to the feature map of input layer and each branch’s output layer is 3, the feature map of other layer is 64. For implementation, we train our model with the Tensorflow platform. The model is trained using the Adam optimization algorithm with an initial learning rate of 1e−3. We total train 50 epochs, and the later 20 epochs with learning rate of 1e−4. Training our network on celebA dataset takes about 6 h on 1 Titan X GPU.

Fig. 4.
figure 4

Comparison of the hallucinated HR images with noise level 25.

Table 1. Quantitative comparison under Gaussian noise.

4.3 Comparisons

We compare our approach with two types of methods: general image super-resolution methods and face hallucination approaches. For general image SR methods, we compare with SRCNN [7] and VDSR [8]. For face hallucination methods, we choose GLN [9] and BCCNN [10] as the contrast methods. Then we use the widely used PSNR (peak signal to noise ratio), SSIM (structural similarity) and FSIM (feature similarity) [16] to evaluate the reconstructed face.

4.4 Results

We use test sets to generate several types of LR surfaces with different noise level. Figure 4 shows the performance of our model and comparisons with other methods. It has been observed that Zhou’s BCCNN [10] cannot eliminate noise. The final result of BCCNN is partly from the input noise face image, so the network does not converge well to obtain better results. Tuzel’s GLN [9], like BCCNN, introduced the structural information of the face. However, in the noisy environment, the prior structural information of the face is not stable enough to converge well to obtain satisfactory results. Dong’s SRCNN [7] also cannot remove the noise because there are only 3 convolutional layers and the parameters of the network are too small to produce satisfactory results. Kim’s VDSR [8] makes the face clean and has more facial detail compare with SRCNN. However, this method uses the same regression function for all regions, resulting in poor reconstruction of some facial regions. Obviously, our method produces better results which not only removes the noise but preserves more face features information.

Table 1 shows the results of comparing our model with other state-of-the-art methods at different noise level. In terms of PSNR, SSIM and FSIM indicators, our model is much better than all comparison methods.

5 Conclusion

In this paper, we present a novel face hallucination framework which uses adaptive aggregation network to guide noise face hallucination. Our network contains two branches: aggregation branch and generator branch. Specifically, our aggregation branch can explore mapping relationships from LR to HR images in different regions, and aggregate the regions by the similarity of the mapping. Then generator branch can be used to make a specific face hallucination of the selected regions to get a better reconstruction result. The experimental results show that our model achieves state-of-the-art performance in noise face hallucination.