1 Introduction

Face Super-Resolution (SR) is an important preprocessing step for high-level vision tasks like facial detection and recognition. Robustness to real degradations like noise, blur, compression artifacts, etc. is one of the key aspects of the human visual system and hence highly desirable in machine vision applications as well. Incorporating this robustness in the Super-Resolution stage itself would ease all the downstream tasks. Unfortunately, most of the face SR methods are trained with a fixed degradation model (downsampling with a known kernel and adding noise) that is unable to capture the complexity and diversity of real degradations and hence performs poorly when applied on real degraded face images. This problem becomes more pronounced when the image is extremely small. Since most of the useful information is degraded, it further increases the ambiguity in reconstruction process. Previous methods such as [3, 32, 34] use facial heatmaps and facial landmarks as priors to reduce ambiguity. [30, 33] leverage autoencoders to build networks which are robust to synthetic noise and [18] leverage wavelet transform to train a network which is robust to gaussian noise. However, none of the above methods have been proven to be robust to real degradation except [3]. In [4], a Generative Adversarial Network (GAN) was trained to generate realistically degraded Low-Resolution (LR) versions of clean High-Resolution (HR) face images and another GAN was trained to super-resolve the synthetic degraded images to their corresponding clean HR counterparts. To the best of our knowledge, this is the only previous work which super-resolves real degraded faces without the aid of any facial priors. However, we observed that [4] produces visually different outputs for different degradations. This can be attributed to the fact that the network sees every degraded image independently and there is no explicit constraint to extract the same features from different degraded versions of the same image.

In this paper, we focus on incorporating robustness to degradations in the task of tiny face super-resolution without the need of a face specific prior and without a dataset of degraded LR-clean HR image pairs. Premised upon the observation that humans are remarkably adept at registering different degrdaded versions of the same image as visually similar images, we prepend a smooth feature extractor module to our Super-Resolution (SR) module. Since our feature extractor is smooth with respect to real degradations, its output does not vary wildly when we move from clean images to degraded images. The SR module which produces clean HR images from features extracted by the smooth feature extractor, thus, produce similar images regardless of the degradation. Features which remain smooth under degradations are also features that are common between clean and degraded LR. So, our network, in essence, learns to look at features which are similar between clean and degraded LR.

Following [4], we train a GAN to convert clean LR images to corresponding degraded LR images. One training iteration of our network involves two backpropagations. During the first backpropagation, we update parameters of both modules of our network to learn a super-resolution mapping from an interpolated LR (by combining clean and degraded LR) to its corresponding clean HR. The interpolation is carried out to avoid having the network overfit one of two LR domains (clean and degraded). During the second backpropagation, we minimize the Entropy Regularized Wasserstein Distance between features extracted from clean as well as degraded LR and those extracted from interpolated LR. The interpolation also helps in ensuring smoothness of the feature extractor.

During test time, we put an image (clean or degraded) through the feature extractor module first and then feed the extracted features to the SR Module to get the corresponding super-resolved image. Since the extracted features do not change significantly between clean and degraded images, the super-resolution output for a degraded image does not change significantly from that of a clean image. We perform tests to visualise the robustness of our network as well as smoothness of the features extracted by our feature extractor.

The main contributions of our work are as follows:

  • We propose a new approach for unpaired face SR where the SR network relies on features that are common between corresponding clean and degraded images.

  • To the best of our knowledge, ours is the first work that handles robustness separately from the task of super-resolution. This enables us to explicitly enforce robustness constraints on the network.

2 Related Works

Single Image Super-Resolution (SISR) is a highly ill-posed inverse problem. Traditional methods mostly impose handcrafted constraints as priors to restrict the space of solutions. With the availability of Large-scale Image Datasets and the consistent success of Convolutional Neural Networks (CNNs), learning (rather than handcrafting) a prior from a set of natural images became a possibility. Many such approaches have been explored subsequently.

2.1 Deep Single Image Super-Resolution

We classify all the deep Single Image Super-Resolution (SISR) methods in two broad categories - (i) deep Paired SISR and (ii) deep Unpaired SISR. In paired SISR, corresponding pairs of LR and HR images are available and the network is evaluated on its ability to estimate an HR image given its LR counterpart. Most of the available deep paired SISR networks are trained under a setting where LR images are generated by downsampling HR images (from datasets such as Set5, Set14, DIV2K [1], BSD100 [2] etc.) using a known kernel (often bicubic). These networks are trained using either a pixel wise Mean Squared Error (MSE) loss e.g. [13, 21, 26], \(L_1\) loss e.g. [36], Charbonnier loss e.g. [22] or a combination of pixel-wise \(L_1\) loss, perceptual loss [20] and adversarial loss [16] e.g. [10, 23, 29]. Even though these networks perform really well in terms of PSNR and SSIM, and the GAN based ones produce images that are highly realistic, these networks often fail when they are applied on real images with unseen degradations such as realistic noise and blur. To address this, RealSR [6] dataset was introduced in NTIRE 2019 Challenge [5] containing images taken at two different focal lengths of a camera. Networks like [14, 15, 19] were trained on this dataset and are therefore robust to real degradations.

On the other hand, in unpaired SISR, only the LR images are available in the dataset. In [35], a CycleGAN [37] was trained to denoise the input image and another one to finetune a pretrained super-resolution network. In [27], a CycleGAN was trained to generate degraded versions of clean images and a super-resolution network was then trained using pairs of synthetically degraded LR and clean HR images.

However, all these networks are meant for natural scenes and not faces in particular. Humans are highly sensitive to even the subtlest changes when it comes to human faces, making the task of perceptually super-resolving human faces a challenging and interesting one.

2.2 Deep Face SISR

General SR networks as the ones mentioned above, often produce undesired artifacts when applied on faces. Hence, paired face SR networks often rely on face-specific prior information to subdue the artifacts and make the network focus on important features.

Networks like [3, 10, 32, 34] rely on facial landmarks and heatmaps to impose additional constraints on the output whereas [12] leverage HR exemplars to produce high-quality HR outputs. On the other hand, networks like [9, 18] rely on pairs of LR and HR face images to perceptually super-resolve faces. Even though the above methods are somewhat robust to noise and occlusion, they are not equipped well enough to handle noises which are as complex and as diverse as those in real images. [30, 33] leverage capsule networks and transformative autoencoders to class-specifically super-resolve noisy faces but the noises are synthetic. As of yet, there seems to be no dataset with paired examples of degraded LR and clean HR images of faces available. As a result, in recent years, there has been a shift in face SISR methods from paired to unpaired. Recently, with the release of Widerface [31] dataset of real low-resolution faces and the wide availability of high resolution face recognition datasets such AFLW[28], VGGFace2[7] and CelebAMask-HQ [24], Bulat et al. [4] propose a training strategy where a High-to-Low GAN is trained to convert instances from clean HR face images to corresponding degraded LR images and a Low-to-High GAN is then trained using synthetically degraded LR images and their clean HR counterparts. This method is highly effective since it does not require facial landmarks or heatmaps for faces (as they are not available for real face images captured in the wild).

However, despite producing sharp outputs, it is not very robust as different outputs are obtained for different degradations in the LR images. In order to explicitly impose robustness, we introduce a smooth feature extractor module to extract similar features from a degraded LR image and its clean LR counterpart. This enabled us to get features that are more representative of the actual face in the image and is significantly less affected by the degradations in the input.

2.3 Robust Feature Learning

Our work builds on the existing methods in robust feature learning. Haoliang et al. [25] extract robust features from multiple datasets of similar semantic contents by minimizing Maximum Mean Discrepancy (MMD) between features extracted from these datasets. Cemgil et al. [8], achieve robustness by forcing Entropy Regularized Wasserstein Distance to be low between features extracted from clean images and their noisy counterparts. None of these works handle Super-Resolution where rigorous compression using an autoencoder may hurt the reconstruction quality. We propose a method of incorporating robust feature learning in super-resolution without requiring any face specific prior information.

3 Proposed Method

3.1 Motivation

Super-Resolution networks which are meant to be used on real facial images need to satisfy two criteria: (i) they need to be robust under real degradations, (ii) they should preserve the identity and pose of a face. Deep state-of-the-art super-resolution networks usually derive the LR images by bicubically downsampling HR images. Hence, an SR network trained on pairs of LR and HR images used for training fail to meet the first criterion. On the other hand, SR networks trained with real degradations fail to satisfy the second criterion. Noting the fact that the face recognition ability of us humans does not change very significantly with reasonably high degradation in images, it should be possible to find features that remain invariant under significant degradation and train a super-resolution network that would rely only on these features. Now, features which are robust to degradations would also be smooth under the said degradations. So, by enforcing explicit smoothness constraints on the extracted features, we can ensure robustness.

3.2 Overall Pipeline

We have a clean High-Resolution dataset \(Y_c\) and a degraded Low-Resolution dataset \(X_d\). We obtain clean Low-Resolution dataset, \(X_c\), corresponding to \(Y_c\), by downsampling every image in \(Y_c\) with a bicubic downsampling kernel. So every \(x_c\) in \(X_c\) is a downsampled version of some \(y_c\) in \(Y_c\), using the equation

$$\begin{aligned} x_c = (y_c *k)_{\downarrow s} \end{aligned}$$
(1)

where, k is the bicubic downsampling kernel and s is the scale factor. Following [4], we train a Degradation GAN, \(G_d\) to convert clean samples from \(X_c\) to look like they have been drawn from the degraded LR dataset \(X_d\). We call this synthetic degraded LR dataset \(\widehat{X_d}\) and samples in this dataset \(\widehat{x_d}\). So,

$$\begin{aligned} \widehat{x_d} = G_d(x_c,z) \in \widehat{X_d} \quad \forall \quad x_c \in X_c \end{aligned}$$
(2)

where \(z \in Z\) is an additional vector input which is sampled from a distribution Z to capture the one-to-many relation between HR and degraded LR images.

Our network basically comprises 2 modules - (i) Feature Extractor Module (f) and (ii) Super-Resolution Module (g). During training, we first sample an \(x_c\) from \(X_c\) and generate one of its degraded counterparts \(\widehat{x_d} = G_d(x_c,z)\) using \(G_d\). We then combine these two LR images with a mixing coefficient \(\alpha \)

$$\begin{aligned} x_{in} = \alpha x_c + (1-\alpha ) \widehat{x_d} \end{aligned}$$
(3)

where \(0<\alpha <1\). We, then, put \(x_{in}\) through the convolutional feature extractor f(x) and the SR module g(h) to estimate the corresponding clean HR output \(\widehat{y_c}\) and do a backpropagation.

$$\begin{aligned} h_{in} = f(x_{in}), \quad \widehat{y_c} = g(h_{in}) \end{aligned}$$
(4)

To ensure smoothness of f under real degradations, we extract features \(h_c\) and \(h_d\) from \(x_c\) and \(\widehat{x_d}\)

$$\begin{aligned} h_c = f(x_c) \quad h_d = f(\widehat{x_d}) \end{aligned}$$
(5)

and minimize the Entropy Regularized Wasserstein Distance (Sinkhorn distance) between \((h_c,h_{in})\) and \((h_d,h_{in})\) through another backpropagation. We recalculate \(h_{in}\) during this operation as well. Figure 1 shows a schematic diagram of our approach.

Fig. 1.
figure 1

The proposed approach.

Here, if we use \(\alpha =0\), since the entire network, during the first backpropagation, would be trained using pairs of synthetically degraded LR and clean HR samples, it may end up learning a mapping that would fail to preserve the identity of a face. However, if we take \(\alpha =1\), the network may exhibit preference to the domain of clean LR images. So, we needed an input LR image which is not as sharp as \(x_c\) but not as degraded as \(\widehat{x_d}\) either. Since the edges in \(x_c\) are much sharper than those in \(\widehat{x_d}\), \(x_{in}\) continues to appear reasonably clean even when \(\alpha < 0.5\). This is why we do not sample \(\alpha \) from a distribution since that might end up giving one domain advantage over the other and keep it fixed at 0.3 since \(\alpha =0.3\) appears to us to have struck the right balance between the two LR domains visually.

Also, using \(0<\alpha <1\), enables us to apply the smoothness constraint between \((h_c, h_{in})\) and \((h_d, h_{in})\) which is a better way to ensure smoothness than imposing smoothness constraint on pairs of \((h_c, h_d)\).

3.3 Modeling Degradations with Degradation GAN

Owing to the complex and diverse nature of real degradations, it is extremely difficult to mathematically model them by hand. So, following previous works [4, 27], we train a GAN (termed Degradation GAN) to model real degradations.

Generator. Our Degradation GAN Generator, shown in Fig. 3, \(G_d\), has 3 downsampling blocks, each consisting of a ResNet block followed by a \(3 \times 3\) convolutional with \(stride=2\), and 3 upsampling blocks each comprising ResNet blocks followed a Nearest Neighbour Upsampling layer and a \(3 \times 3\) convolutional block with \(stride=1\). The downsampling and upsampling paths are connected through skip connections. All the ResNet blocks used in Generator follow the structure described in Fig. 2. Our Generator takes a bicubic downsampled image \(x_c\) and an n dimensional random vector z sampled from a normal distribution. We expand each of the n dimensions of the random vector into a channel of size \(H \times W\) (filled with a single value) where H and W are the height and width of every image. We concatenate the expanded volume with the image and feed it to the generator.

Fig. 2.
figure 2

ResNet block used in Degradation GAN Generator.

Fig. 3.
figure 3

The overall architecture of the Degradation GAN Generator \(G_d\).

Critic. We use the same discriminator used in [23]. Since we train the degradation GAN as Wasserstein GAN [17], we replace the Batch Normalization layers with Group Normalization and remove the last Sigmoid layer. Following the nomenclature, we call it critic instead of discriminator.

Loss Functions. We train the degradation GAN as a Wasserstein GAN with Gradient Penalty (WGAN-GP) [17]. So, the critic is trained by minimizing the following loss function:

(6)

where, as in [17], the first term is the original critic loss and the second term is the gradient-penalty.

To maintain the correspondence between inputs and outputs of the generator, we add a Mean Square Loss (MSE loss) term to the WGAN loss in the objective function \(L_G\) of the generator.:

$$\begin{aligned} L_{G} = \lambda _{WGAN} L_{WGAN} + \lambda _{MSE} L_{MSE} \end{aligned}$$
(7)

where,

$$\begin{aligned} L_{WGAN} = -\mathbb {E}_{(x_c,z) \in (X_c,Z)}[G_d(x_c,z)] \end{aligned}$$
(8)

and

$$\begin{aligned} L_{MSE} = \Vert x_c - G_d(x_c,z)\Vert ^2 \end{aligned}$$
(9)

3.4 Super-Resolution Using Smooth Features

The main objective of our work is to design a robust SR network the performance of which does not deteriorate under real degradation. Our network has two modules (a) a fully-convolutional feature extractor f and (b) a fully-convolutional SR module g. The way we achieve robustness is by making the feature extractor smooth under degradations and making the SR module g rely solely on the features extracted by f. In [8], Cemgil et al. proposed a method to enforce robustness on the representations learnt by Variational Autoencoders (VAEs). They trained a VAE to reconstruct clean images and minimized the Entropy Regularized Wasserstein Distance between representations derived from a clean image and its noisy version.

There were three challenges in applying this method to Super-Resolution:

  1. 1.

    Autoencoders compress an input down to its most important components and ignore information like occlusion, background objects, etc. For accurate reconstruction of an HR image from its LR counterpart, it is important to preserve this information. Hence, we cannot perform a rigorous dimensionality reduction. On the other hand, if we decide to keep the dimensionality intact, it will make it harder to achieve robustness since there are too many distractors. So it is important to choose a reduction factor that will achieve the best trade-off between reconstruction and robustness.

  2. 2.

    They train their network for synthetic noise. However, real degradation involves signal-dependent noise, blur and a variety of other artifacts. So, we need a mechanism to realistically degrade images.

  3. 3.

    As we show in the supplementary material, despite smoothness constraint and despite the network being reasonably robust, naively applying their method on SR still leaves a gap between its performance on clean and degraded images. So, we need a better training strategy.

To address (1), we try a number of different dimensionality reduction choices \((1\times , 4\times , 16\times )\) for the features extracted by f and we observed that \(4\times \) dimensionality reduction attains the best trade-off. To address (2), we train a degradation GAN to realistically degrade clean images. To address (3), we interpolate between a clean image \((x_c)\) and one of its synthetically degraded counterpart \((G_d(x_c,z))\) using a mixing coefficient \(\alpha \) as shown in Eq. 3. We call this \(x_{in}\).

Feature Extractor f: Our feature extractor consists of 4 Residual Channel Attention (RCA) downsampling and 2 upsampling blocks. As shown in Fig. 4, there are 2 skip connections. It is a fully convolutional module which takes an LR image of dimension \(3 \times 16 \times 16\) at the input and produces a feature volume of dimension \(64 \times 4 \times 4\). In Fig. 4, ‘RCA, n64’ denotes an RCA block with 64 output channels and ‘Conv3x3, s2 p1 n64’ denotes a \(3 \times 3\) convolutional layer with \(stride=2\), \(padding=1\) and 64 output channels.

Fig. 4.
figure 4

Feature extractor f.

Super-Resolution Module g: Our Super-Resolution module consists of 6 upsampling blocks and 2 DenseBlocks as shown in Fig. 5. The upsampling blocks comprise a Pixel-Shuffle layer, a convolution layer, a Batch-Normalization layer and a PReLU layer. The DenseBlocks contain a number of Residual Channel Attention (RCA) blocks and Residual Channel Attention Back-Projection (RCABP) blocks connected in a dense fashion as in [19]. In Fig. 5, ‘Pixel Shuffle (2)’ denotes 2x pixel-shuffle upsampling layer and ‘RCABP, n64’ stands for an RCABP block with 64 heatmaps at the output.

Fig. 5.
figure 5

Architecture of SR module g and DenseBlock.

During one forward pass, we pass a minibatch of \(x_{in}\) through our feature extractor f to produce the feature volume \(h_{in}\). We put \(h_{in}\) through our Super-Resolution module g to produce a high resolution estimate \(\widehat{y_c}\) and do a back propagation through both g and f. This ensures that the features are useful for SR. Since \(x_{in}\) is neither as clean as \(x_c\) nor as severely degraded as \(\widehat{x_d}\), the possibility of our SR network being biased to any one of the domains is eliminated.

After the first backpropagation, we put one minibatch each of \(x_c, \widehat{x_d}\) and \(x_{in}\) (again) through f, as shown in Eq. 5, and calculate the Sinkhorn Distance [11] (which calculates the Entropy Regularized Wasserstein Divergence) between \((h_c, h_{in})\) and \((h_d, h_{in})\),

$$\begin{aligned} L_c = Sinkhorn(h_c, h_{in}), \quad L_d = Sinkhorn(h_d, h_{in}) \end{aligned}$$
(10)

Using a combination of \(L_c\) and \(L_d\) as a loss function, we backpropagate through f one more time to enforce smoothness under degradations.

Like our Degradation GAN, we train our robust super-resolution network (during the first back propagation) like a Wasserstein GAN. So, the objective function here is a combination of adversarial loss \((L_{adv})\), pixel-level \(L_1\) loss \((L_p)\) and a perceptual loss [20] \((L_f)\) computed between features extracted from the estimated \((\widehat{y_c})\) and ground-truth \((y_c)\) HR images through a subset of VGG16 network. Hence, the overall objective function optimized during the first back propagation is

$$\begin{aligned} L_{sr} = \lambda _p L_p + \lambda _f L_f + \lambda _{adv} L_{adv} \end{aligned}$$
(11)

where,

$$\begin{aligned} L_p&= \Vert y_c - \widehat{y_c}\Vert _1 \end{aligned}$$
(12)
$$\begin{aligned} L_f&= \Vert f_{vgg}(y_c) - f_{vgg}(\widehat{y_c})\Vert _1 \end{aligned}$$
(13)
$$\begin{aligned} L_{adv}&= -\mathbb {E}_{x_{in} \sim \widehat{\mathbb {P}_x}}[D_{sr}(g(f(x_{in})))] \end{aligned}$$
(14)

with \(f_{vgg}\) being a subset of VGG16 network, \(\mathbb {P}_x\) being the distribution described by \(x_{in}\) and \(D_{sr}\) being the critic comparing the generated HR images with the ground-truth HR images. The architecture of \(D_{sr}\) is same as the critic of degradation GAN and it is trained with the following loss function:

(15)

where \(\mathbb {P}_y\) is the distribution generated by the outputs of our network and \(\mathbb {P}_{\widehat{y}}\) is the distribution of samples interpolated between \(\widehat{y_c}\) and \(y_c\).

For the second back propagation, we optimize a combination of the Sinkhorn Distances mentioned earlier

$$\begin{aligned} L_{robust} = \lambda _c L_c + \lambda _d L_d \end{aligned}$$
(16)

Since the second backpropagation is only through f, it does not directly affect the mapping learnt by g and only makes f smooth under degradations.

4 Experiments

4.1 Training Details

We use two-time step update for both our Degradation GAN and Robust Super-Resolution Network. For both D and \(D_{sr}\), we start with a learning rate of \(4 \times 10^{-4}\) and decrease them by a factor of 0.5 after every 10000 iterations. For all the other networks \((G_d, f, g)\) we set the initial training at \(10^{-4}\) and decay it by a factor of 0.5 after every 10000 iterations.

For all networks, we use Adam Optimizer with \(\beta _1 = 0.0\) and \(\beta _2=0.9\). For every 5 updates of discriminators, we update the corresponding generator networks once. We try out a number of different values of \(\lambda \) and the ones that worked best for us are \([\lambda _{WGAN} = 0.05 , \lambda _{MSE} = 1 , \lambda _p = 1 , \lambda _f = 0.5 , \lambda _{adv} = 0.05 , \lambda _c = 0.3, \lambda _d = 0.7]\). For \(G_d\), we sample z from a \(16-\)dimensional multivariate normal distribution with zero mean and unit standard deviation.

4.2 Datasets

We train our network for \(4\times \) super-resolution (\(s=4\)). However, our robustness strategy is not scale dependent. For training our network, we used two datasets: one with degraded images and the other with clean images. To make the degraded image dataset, we randomly sample 153446 images from the Widerface [31] dataset. This dataset contains face images with a wide range of degradations such as: varying degrees of noise, extreme poses and expressions, occlusions, skew, non-uniform blur etc. We use 138446 of these images for training and 15000 for testing. While compiling the clean dataset, to make sure it is diverse enough in terms of poses, occlusions, skin colours and expressions, we combined the entire AFLW [28] dataset with 60000 images from CelebAMask-HQ [24] dataset and 100000 images from VGGFace2 [7] dataset. To obtain clean LR images, we simply downsample images from the clean dataset.

Fig. 6.
figure 6

Comparison of results.

Table 1. Comparison of PSNR/SSIM on Bicubic-Degraded Dataset

4.3 Results

To assess the accuracy as well as robustness of our work, we test our network on 3 different datasets - (i) Bicubically-Degraded Dataset, (ii) Synthetically-Degraded Dataset and (iii) Real-Degraded Dataset.

  1. 1.

    Bicubically-Degraded Dataset: To compile this dataset, we randomly sample 4000 HR images from the clean Facial Recognition Datasets as mentioned above. We bicubically downsample them to obtain paired LR-HR images. Evaluation on this dataset tells us about the reconstruction accuracy of our SR network.

    As shown in Fig. 6a, ESRGAN [29] performs best on this dataset since it was trained on bicubic downsampled images. Interestingly, the results of [4] appear to be a little different from the HR ground truth in terms of identity. We observe this in all our experiments. Also, their outputs contain a lot of undesired artifacts. Our outputs are faithful to the ground-truth HR and contain less artifacts. However, our method performs a little poorly in terms of PSNR and SSIM as shown in Table 1. However, since we focus primarily on the robustness part of the problem, the strength of our network becomes evident with the evaluation of robustness.

  2. 2.

    Synthetically-Degraded Dataset: This dataset contains the same HR images as the Bicubically-Degraded Dataset but we obtain 5 different synthetically degraded LR versions of each HR image using our degradation GAN.

    We perform two tests on this dataset:

    • Robustness Test: Here, for each HR image, we put all 5 degraded LR images through our SR network. This test shows us how the output changes for different degradation. The similarity between the outputs will tell us how robust our network is to realistic degradations which is the focus of our work.

      As shown in Fig. 7, these images are extremely degraded. ESRGAN [29] gives the worst performance on this dataset. [4] produces slightly different-looking faces for different degradations. Our method, however, produces outputs that look similar for all these degradations. This shows that our network is robust to realistic degradations.

    • Smoothness Test: This experiment enables us to visualise the smoothness of our feature extractor (f). Here, we combine every degraded LR image in the dataset with their bicubically downsampled counterparts using 5 different values of \(\alpha _{i}\) such as [0.0, 0.2, 0.4, 0.8, 1.0] to create a set of 5 different images \((\{x_{mix}\})\). Since \(\alpha \) is the coefficient we use to mix clean and corresponding degraded images, by gradually varing \(\alpha \) from 0 to 1 and noting the output, we get an idea of how adept our network is at maintaining its output as we gradually move from a clean image, through increasingly degraded images, to one of its realistically degraded versions. If our network manages to maintain its output without altering its overall appearance (changing pose, identity, etc.), it would mean that the learnt features are smooth and robust to degradations.

      $$\begin{aligned} x^i_{mix} = \alpha _i x_c + (1-\alpha _i) \widehat{x_d} \end{aligned}$$
      (17)

      Figure 8 shows a comparison of the output of our network with those of [4] and [29]. The outputs of ESRGAN [29] becomes increasingly worse as \(\alpha \) decreases. The outputs of [4] changes significantly as \(\alpha \) goes from 0 to 1, sometimes even producing different faces. The output of our network does not undergo any visually significant changes. This establishes the features learnt by the feature extractor are smooth under realistic degradation. Figure 7 shows that ESRGAN [29] consistently performs poorly in terms of robustness than the other two methods. This is expected since it was trained with bicubically downsampled LR images only. The behavior of [4] is interesting. In Fig. 7(a), (b) and (g), it is generating additional facial components that are unrelated to the content of the input. The performance of our network, as shown in (f) and (g), drops a little when the input is heavily degraded but the recognizable features do not change much.

  3. 3.

    Real-Degraded Dataset: This dataset contains 15000 images from the Widerface Dataset. Performance on this dataset will dictate how effective our method is in super-resolving real degraded facial images.

    As shown in Fig. 6b, our method is able to super-resolve real degraded faces. The outputs of [4] contain undesired artifacts and sometimes exhibit identity discrepancy as well. ESRGAN [29] is able to maintain the identity but the outputs are not sharp. Since we do not have ground-truth HR images for these LR images, we can not compute PSNR/SSIM. So, we use Fretchet Inception Distance (FID) as a metric to assess how close the output is to the target distribution of sharp images. Table 2 shows the FIDs of [4, 29] and our method computed over 15000 images. Lower FID denotes better adherence to target distribution and hence sharper output. As shown in Table 2, our method performs very close to [4] in terms of realness of the output and at the same time, maintains a fixed output under varying degradations. So, our method is robust and at the same time, effective on real degraded faces.

Fig. 7.
figure 7

Comparison of robustness.

Fig. 8.
figure 8

Visualizing smoothness for \(\alpha = [0, 0.2, 0.4, 0.8, 1.0]\).

Table 2. Comparison of FID.

5 Conclusion

We propose a robust super-resolution network that would give consistent output under a wide range of degradations. We train a feature extractor that is able to extract similar features from both bicubically downsampled images and their corresponding realistically degraded counterparts. We perform robustness test to put our claim of robustness to test and smoothness test to visualize the variation in extracted features as we gradually move from a clean to a degraded LR image. There is still room to improve our network for better performance in terms of PSNR/SSIM. In our future works, we will attempt to address this.