Keywords

1 Introduction

Recently, the task of single image super-resolution (SISR) has taken an interesting turn. Convolutional neural networks (CNNs) based models have not only been shown to reduce the distortions on full reference (FR) metrics for, e.g., PSNR, SSIM and IFC [3,4,5,6,7,8], but also to produce perceptually better images [4, 9]. The models trained specifically to reduce distortions fail at producing visually compelling results. They suffer from the issue of “regression-to-the-mean” as they mainly rely on minimizing the mean square error (MSE) between a high resolution image \(I_{HR}\) and an estimated image \(I_{est}\), approximated from its low resolution counterpart \(I_{LR}\). This minimization of MSE leads to the suppression of high frequency details in \(I_{est}\), entailing blurred and over-smoothed images. Therefore, FR metrics do not conform with the human perception of visual quality as illustrated in [10, 11] and mathematically analyzed in [12].

The newly proposed methods [4, 9, 13] made substantial progress in improving the perceptual quality of the images by building on generative adversarial networks (GANs) [14]. The adversarial setting of a generator and a discriminator network helps the generator in hallucinating high frequency textures into the resultant images. Since the goal of the generator is to fool the discriminator, it may hallucinate fake textures which are not entirely faithful to the input image. This fake texture generation can be clearly observed in an 8\(\times \) image super-resolution images. This behavior of GANs can be reduced using a combination of content preserving losses. This not only limits the ability of the generator to induce high quality textures but also makes it fall short in reproducing image details in the regions which have complex and irregular patterns such as tree leaves, rocks etc (Fig. 1).

Fig. 1.
figure 1

Visual Comparison of the recent state-of-the-art methods as measured by distortion and perceptual quality metrics with our texture based super-resolution network (TSRN) for 4\(\times \) SISR.

In the present paper we show that, in the task of SISR, perceptually high quality textures can be synthesized on the estimated images \(I_{est}\) using the Gram matrices based texture loss [1]. The loss was first employed by Gatys et al. in transferring realistic textures from a style image (\(I_s\)) to a content image (\(I_c\)). Despite the success of this method, the utility of texture transfer for enhancing natural images has not been studied extensively. This is because of the fact that while preserving the local spatial information of the textures, the texture loss discards the global spatial arrangement of the content image, rendering the semantic guidance of texture transfer a difficult problem.

We explore the effectiveness of Gram matrices in transferring and hallucinating realistic texture in the task of SISR. We show that despite its simplicity through the use of a single loss function, our proposed network yields favorable results when compared to state-of-the-art models that employ a mixture of loss functions and involve GANs that are notoriously difficult to train. In contrast, our model converges without the need of hand-tuned training schemes. We further build on this finding by providing external semantic guidance to control the texture transfer. We show that this scheme prevents the random spread of small features across object boundaries thus improving the visual quality of results especially in the challenging task of 8\(\times \) SISR. Furthermore, we demonstrate, that Gram matrices of deep features perform surprisingly well in measuring human perceived similarity between image patches.

2 Related Work

Super Resolution. Single image super-resolution (SISR) is the problem of approximating a high resolution (\(I_{HR}\)) image from its corresponding low resolution (\(I_{LR}\)) input image. The task is to fill in missing information in \(I_{HR}\) which involves the reconstruction and hallucination of textures, edges and low-level image statistics while remaining faithful to the low-resolution \(I_{LR}\) input. It is an under-determined inverse problem where different image priors have been explored to guide the upsampling of \(I_{LR}\) [15,16,17]. One of the earliest methods involved simple interpolation schemes [18], e.g. bicubic, Lanczos. Due to their simplicity and fast inference, these methods have been widely used, however they suffer from blurriness and can not predict high frequency details.

Much success has been achieved by using recent data-driven approaches where a large number of training examples are used to set the prior over the empirical distribution of data. These learning based methods that try to learn a mapping between \(I_{LR}\) to \(I_{HR}\) can be classified into parametric and non-parametric methods [19]. Non-parametric algorithms include neighborhood embedding algorithms [20,21,22,23], that seek for the nearest match in an available database and try to synthesize an image by simple blending of different patches. Prone to mismatch and misalignment in patches these methods suffer from rendering artifacts in the HR output [24]. Parametric methods include sparse models [17], regression functions [8] and convolutional neural networks (CNNs). Dong et al. [7] first employed a shallow CNN to perform SISR on a bicubic interpolated image and got impressive results, [25] successfully used a deep residual network. These CNN based methods use mean square error (MSE) as an optimization objective which leads to blurriness and fails to reconstruct high frequency details. Methods like [3, 4] tend to overcome this issue by minimizing perceptual losses in feature space. Ledig et al. [4] proposed SRResNet to show improvements in full-reference (FR) metrics. Follow-up work used a multi-scale optimized SRResNet architecture to win the NTIRE 2017 SISR Challenge [26] for 4\(\times \) super-resolution. Moreover, [6] uses a coarse-to-fine laplacian pyramid framework to achieve state-of-the-art results in 8\(\times \) super-resolution with respect to FR metrics.

More recently, GANs based methods [4, 9, 13] showed promising results by drastically improving the perceptual quality of images. In addition to the perceptual and adversarial losses used by [4], the patch-wise texture loss used by [9] helps synthesizing high quality textures. Our approach is different from [9], as we give up on the adversarial and perceptual loss terms. Moreover, we also don’t use patch-wise texture loss and show that a globally applied texture loss is enough for spatially aligning textures and generating photo-realistic high-quality images. [27] also used patches and manually derived segmentation masks to constrain the texture synthesis in \(I_{est}\). However, it highly relies on the efficiency of a slow patch matching algorithm and thus is prone to wrong matching of regions in \(I_{est}\) and \(I_{HR}\) which renders artifacts. The loss is also shown to be an important ingredient of a recent image-inpainting method [28]. A new deep features based contextual loss [29] is used by [30] to maintain the natural image statistics of \(I_{est}\). The method is conceptually similar to texture loss. More recently, a perceptual image enhancement challenge (PIRM) [31] made a huge step to promote perceptual enhancement in images.

2.1 Neural Texture Transfer

The concept of neural texture transfer was first coined by Gatys et al. [1]. The method relies on matching the Gram matrices of VGG-19 [32] features to transfer the texture of one image to another. Afterwards, much work has been done in order to improve the speed [3, 33] and quality [34, 35] of style transfer using feed forward networks and perceptual losses. Building on fast style transfer, [36, 37] proposed models to transfer textures from multiple style images. [35] showed improvement in style transfer by computing cross-layer Gram matrices instead of within-layer Gram matrices. Recently, Li et al. [38] has shown that matching the Gram matrices for style transfer is equivalent to minimizing MMD with the second order polynomial kernel. In addition to improving the style transfer mechanism, some work has been done to spatially constrain the texture transfer in order to maintain the textural integrity of different regions [39, 40]. Gatys et al. [40] demonstrated the spatial control of texture transfer using guided Gram matrices where binary masks are used as guidance channels in order to constrain the textures. Similar scheme was used by [34] in constraining style transfer. Instead of enforcing spatial guidance in the feature space of deep networks like these methods, we enforce it in pixel-space via customized texture loss which, unlike other methods, not only enables it to easily scale to multiple style images but also does not require semantic details at the test time.

Our main contributions are as follows:

  • We provide a better understanding of texture constraining mechanism via texture loss and show that SISR of high perceptual quality can be achieved by using this as an objective function. The results compare well with GANs based methods on 4\(\times \) SISR and outperform them on 8\(\times \) SISR.

  • Unlike GANs based methods, our method is easily reproducible and generates faithful textures especially in the constrained domain of facial images.

  • To further enhance the quality of 8\(\times \) SISR results, we formulate a novel semantically guided texture transfer scheme in order to avoid the intermixing of interclass textures such as grass, sky etc. The method is easily scalable to multiple style images and does not require semantic details at test time.

  • We also show that Gram matrices provide a better and richer framework to capture the perceptual quality of images. Using this, our off-the-shelf deep classification networks (without training) perform as well as the best performing (tuned and calibrated) LPIPS metrics [2].

3 Texture Loss

The texture transfer loss was first proposed in the context of neural style transfer [1], where both style \(I_{s}\) and content images \(I_{c}\) are mapped into feature space using a VGG-19 architecture [32], pre-trained for image classification on image-net. The feature maps of both \(I_{s}\) and \(I_{c}\) are denoted by \(F^l \in \mathbb {R}^{N_l \times M_l}\) and \(P^l \in \mathbb {R}^{N_l \times M_l}\) respectively, where \(N_l\) is the number of feature maps in layer l and \(M_l\) is the product of height and width of feature maps in layer l i.e. \(M_l = height \times width\). A Gram matrix is the inner product of vectorized feature maps. Therefore the Gram matrices for both \(F^l\) and \(P^l\) are computed as \(G_{i,j}^l = \mathbf F ^{T}_{i} \mathbf F _{j} \) and \(A_{i,j}^l = \mathbf P ^{T}_{i} \mathbf P _{j}\). The texture loss \(\mathcal {L}_{texture}\) is defined by the mean squared error between the feature correlations expressed by these Gram matrices.

$$\begin{aligned} \mathcal {L}_{texture} = \frac{1}{4N_l^2M_l^2} \sum _{i=1}^{N_l} \sum _{j=1}^{M_l} (G_{i,j}^l - A_{i,j}^l)^2 \end{aligned}$$
(1)

The loss tries to match the global statistics of \(I_{c}\) with \(I_{s}\), captured by the correlations between feature responses in layers l of the VGG-19. These correlations capture the local spatial information in the feature maps while discard their global spatial arrangement [41].

3.1 Constraining Texture Transfer

The above loss tries to match the global level statistics of \(I_{s}\) and \(I_{c}\) without retaining the spatial arrangement of the content image. However, we observe that if there exists a good feature space correspondence between \(I_{s}\) and \(I_{c}\) then the Gram matrices alone constrain the texture transfer such that it preserves the semantic details of the content image. The composition of Gram matrices makes use of the translational invariance property of the pre-trained VGG-19’s [32] convolutional kernels in mapping the textures correctly. We shed more light on this texture constraining mechanism and its translational invariant mapping in the appendix. Thus Gram matrices’ provide a stable spatial control such that the texture from \(I_{s}\) maps to the corresponding features on \(I_{c}\). Figure 2 shows texture transfer of a non-texture image for different initial approximates of \(I_{c}\) using iterative optimization approach by [1]. Second column depicts the results of vanilla style transfer [1] on a plain white image, 4\(\times \) upsampled image and an 8\(\times \) upsampled images respectively. In case of plain white image, the texture gets transferred in an uncontrollable fashion. This is the known phenomenon in image style transfer. However, the texture transfer on a 4\(\times \) and 8\(\times \) upsampled images shows consistency in texture mapping i.e. texture from \(I_{s}\) gets mapped to the correct corresponding regions of \(I_{c}\). We observe that the interpolated approximates \(I_{est}\) of \(I_{LR}\) are good enough for establishing feature-space correspondences and thus mapping the textures correctly.

Fig. 2.
figure 2

(a) shows \(I_{HR}\) (in insets) and a plain white, 4\(\times \) and 8\(\times \) upsampled versions of \(I_{HR}\) as \(I_{c}\). (b) vanilla neural texture transfer [1]. (c) neural texture transfer with semantic guidance.

In the Fig. 2, one can observe that the texture transfer for a 4\(\times \) interpolated image is much better than that for an 8\(\times \). The ambiguousness in texture transfer for an 8\(\times \) upsampled \(I_{LR}\) is because of the absence of enough content features to establish correspondences. Thus to better guide the texture transfer in 8\(\times \) SISR, we devise an external semantic guidance scheme. The third column in Fig. 2 shows the effectiveness of the semantically guided texture transfer. In comparison to the second column we can see that the texture is transferred in a more coherent fashion.

3.2 Texture Loss in SISR

In SISR we try to find a mapping between a low-resolution input image \(I_{LR}\) and a high-resolution output image \(I_{HR}\). As a function approximator we use a deep CNN. While recent state-of-the-art methods use a combination of various loss functions, our texture super resolution network (TSRN) is specifically trained to optimize for \(\mathcal {L}_{texture}\) in Eq. 1 which yields images of perceptually high quality for 4\(\times \) and 8\(\times \) super-resolution, Figs. 5 and 6.

3.3 SISR via Semantically Constrained Textures

In order to make full use of the texture loss based image super resolution, we also performed externally controlled semantic texture transfer. We enforce semantic details via loss function. For the implementation of semantic control of texture transfer, we use the ground truth segmentation masks provided by the recently released dataset MS-COCO stuff dataset [42].

Additional spatial control is provided by making use of the semantic information present inside an image. Instead of matching the global level statistics of an image we divide the image into r segments semantically. Each segment exhibits its own local level statistics which are different from the other segments of the same image. This facilitates us to match the local level statistics at an individual segment level. Also it helps in preserving the global spatial arrangement of the segments as the relative spatial information of each segment is considered before extracting them from the images.

Fig. 3.
figure 3

Scheme for semantically controlled texture transfer.

Our method gains inspiration from the spatial control of texture transfer based on guided Gram matrices (GGMs) [40] where binary segmentation masks are used to define which region of a style image would get mapped to the specific region of a content image. It uses r segmentation masks \(I_{seg}^r\) to compute guidance channels \((\mathbf T _l^r)\) for each layer l of a CNN by either down-sampling them to match the dimensions of each layer’s feature maps or by enforcing spatial guidance only on neurons whose receptive field lie inside the guidance region for better results. The guidance channels are then used to form spatially guided feature maps by the element-wise multiplication of texture image features and the guidance channels. This method of computing GGMs for training a deep architecture is not feasible, especially in our case where we have multiple segmentation masks for each image. We propose a simplification of this process by removing the need of guidance channels \((\mathbf T _l^r)\) and the explicit computation of spatially guided feature maps altogether. The r binary segmentation masks \(I_{seg}^r\) (having pixel value of 1 for the class of interest and 0 elsewhere) where each mask categorically represents a different region of an image are element-wise multiplied with the texture image \(I_{HR}\) and the estimated image \(I_{est}\) to give out \(I_{target}^r\) and \(I_{est}^r\) respectively, Fig. 3.

$$\begin{aligned} I_{target}^r = I_{HR} \circ I_{seg}^r \end{aligned}$$
(2)
$$\begin{aligned} I_{est}^r = I_{est} \circ I_{seg}^r \end{aligned}$$
(3)

These segmented images are then propagated to the VGG19 and Gram matrices of their feature maps are then computed in normal fashion. The method is flexible and relatively fast to enforce spatial guidance of texture transfer, especially when it has to be used for training a deep architecture. The texture loss is then performed individually for all the segmented images. Equation 4 shows the objective function formulation of the complete semantically controlled texture transfer. See abstract to check the effectiveness of our proposed semantically controlled fast style transfer.

$$\begin{aligned} \mathcal {L}_{texture} = \sum _{k=1}^{r} \frac{1}{4N_l^2M_l^2} \sum _{i=1}^{N_l} \sum _{j=1}^{M_l} (G_{i,j}^l(I_{target}^k) - A_{i,j}^l(I_{est}^k))^2 \end{aligned}$$
(4)

4 Architecture

For the implementation of TSRN, we employ a fully convolutional neural network architecture inspired by [9]. The architecture is efficient at inference time as it performs most feed forward computations on \(I_{LR}\) and is deep enough to perform texture synthesis. The presence of residual blocks facilitates convergence during training. Similarly to [9], we also add a bi-cubically upsampled version of \(I_{LR}\) to the predicted output such that the network is only required to learn the residual image. This helps to reduce color shifts during training as also reported by [9]. However, instead of using nearest neighbor up-sampling, we use a pixel resampling layer [43] because of its recent proven success in generative networks [44]. The method is also shown to be agnostic to model’s depth. See appendix for more details.

5 Implementation

We trained our network on MS-COCO [42], where we center crop image patches sized 256 \(\times \) 256 pixels. The patches are then bi-cubically down-sampled 4\(\times \) or 8\(\times \) to 64 \(\times \) 64 or 32 \(\times \) 32, respectively. We first pretrain our network by minimizing mean square error (MSE) for 10 epochs. We found this pre-training beneficial for the subsequent Gram matrix based optimization as it facilitates the detection of relevant features for texture transfer. After pretraining, we train our model using only 1 as an objective function for another 100 epochs. We found that the network converges after approximately 60 epochs. For the implementation of \(\mathcal {L}_{texture}\), we compute Gram matrices on layers conv2_2, conv3_4, conv4_4 and conv5_2 of a pre-trained VGG-19 architecture. To justify the selection of specific VGG-19’s layers for texture loss, we provide a qualitative and quantitative (LPIPS) analysis on SunHays dataset in Fig 4. We considered convolutional layers before each pooling layer except (conv1_2) as this layer, containing more pixel-level and less structural information, causes artifacts and over-smoothing in images. The selection of only higher layers tend to generate checkboard artifacts. In Fig. 4, all the networks are trained using the same architecture and procedure mentioned in the paper for 100 epochs. The network is trained with the learning rate of 0.0005 using ADAM as an optimizer. We use the PyTorch framework [45] to implement the model on a Nvidia Tesla P40 GPU. Inference time for 4\(\times \) and 8\(\times \) SISR is approximately 41 and 32 ms for a 1 mega-pixel image and 0.203 and 0.158 s for a 5 mega-pixel image on the GPU.

Fig. 4.
figure 4

Layer and loss ablation study on SunHays dataset [24]. Each column shows the effects of different VGG19 [32] layers on the visual quality of a restored image. Perceptual loss using deep features (F) generates blurred images (left most column) in comparison to Gram matrices (G) based restoration. The last row shows the mean LPIPS score on the dataset (lower score is better).

For our results on segmentation based super-resolution (TSRN-S), we pre-train on the MS-COCO dataset before we train on the MS-COCO stuff dataset using Eq. 4 as an objective function. The stuff dataset is particularly suited for our task as it not only contains the segmentation masks of object instances but also outdoor scenes like grass, sky, buildings etc. Statistically, these regions cover more than 60% [46] of images showing natural scenes. To reduce the computation time, we consider the binary segmentation masks of only six maximally represented classes in each image (based on their pixel count). Whereas the seventh mask covers the ‘others’ class, containing the remaining regions of the image. If there are less than six classes in an image then the ‘others’ class is replicated to give out seven masks per image.

6 Experimental Results

We evaluate both our proposed models, one with globally computed Gram matrices (TSRN-G) and semantically guided Gram matrices (TSRN-S).

Table 1. Top-1 and Top-5 image recognition accuracy on 4\(\times \) SISR images
Table 2. Top-1 and Top-5 image recognition accuracy on 8\(\times \) SISR images

6.1 Quantitative Evaluation

For quantitative comparison we follow [9] and report the performance in object recognition as a proxy for perceived image quality. Additionally, we report numbers for a recently proposed no-reference based method [12] and the learned full-reference image quality metric [2] that approximates perceptual similarity.

Object Recognition Performance. The perceptual quality of an image correlates very well with its performance on object recognition models which are trained on the large corpus of image-net, as corroborated by [9]. Recently, the same methodology of assessing image quality has been adopted by a competitionFootnote 1. Therefore, we perform our comparison with other methods utilizing the standard image classification models trained on ImageNet. We randomly pick 1000 images from the ILSVRC 12 validation dataset and super-resolve their downsampled versions using different super-resolution models. The performance is evaluated on how much recognition accuracy is retained by each model, compared to the baseline accuracy. Tables 1 and 2 show that our proposed TSRN model outperforms all other state-of-the-art SISR methods for both 4\(\times \) and 8\(\times \) super-resolution.

No-reference Image Quality Measure. A no-reference image quality assessment is proposed by [12] and is based on NIQE [47, 48]. Based on this method, our method obtained 2.227 perceptual index.

LPIPS. The Learned Perceptual Image Patch Similarity (LPIPS) metric [2] is a recently introduced full-reference image quality assessment metric which tries to measure the perceptual similarity between two images. The metric uses linearly calibrated off-the-shelf standard deep classification networks trained to measure the perceptual similarity of the images. The networks are trained on the very large Berkeley-Adobe Perceptual Patch Similarity (BAPPS) [2] dataset, containing human perceptual judgments. We use the pre-trained, linearly calibrated AlexNet and SqueezeNet networksFootnote 2. The networks are trained on patches sized 64 \(\times \) 64 pixels. Therefore, we also divide the images into patches of size 64 \(\times \) 64 pixels. For each image, we pick its shorter dimension and find the nearest possible value v divisible by 64, then we center crop an image of resolution v \(\times \) v. The cropped image is then further divided into patches of size 64 \(\times \) 64. We report the averaged perceptual similarity determined on those patches.

In Table 3 we use the recommended AlexNet (linear) and SqueezeNet (linear) models for measuring the perceptual quality. We found the quantitative evaluations to be consistent across numerous models that have been trained to improve either PSNR, SSIM scores such as SRResNet, LapSRN, SRCNN or the ones trained to improve perceptual quality such as SRGAN and ENet-PAT. TSRN consistently achieves better perceptual similarity scores than other methods (Table 4).

Table 3. Comparison for 4\(\times \) SISR on pre-trained AlexNet-linear and SqueezeNet-linear LPIPS metric [2]. Lower score is better.
Table 4. Comparison for 8\(\times \) SISR on pre-trained AlexNet-linear and SqueezeNet-linear LPIPS Perceptual Similarity Metric models. Lower score is better.

6.2 Visual Comparison

In Figs. 5 and 6 we show visual comparisons with recently proposed state-of-the-art models for both 4\(\times \) and 8\(\times \) super-resolution. Our TSRN model manages to hallucinate realistic textures and image details and compares favorably with the state-of-the-art.

Fig. 5.
figure 5

Visual Comparison of recent state-of-the-art methods based on distortion metrics and perceptual quality with our texture based 4\(\times \) image super-resolution.

Fig. 6.
figure 6

Visual comparison of recent state-of-the-art methods based on distortion metrics and perceptual quality with our texture based 8\(\times \) image super-resolution.

6.3 TSRN-Faces on CelebA Dataset

In addition to training on MS-COCO dataset [42], we also tested our proposed texture based super resolution method for CelebA faces dataset [52]. Our method yields visible improvements over other methods. More specifically we compare with Enhancenet-PAT [9] which employs GAN for enhancing textures. We observe that such method has a tendency to manipulate the overall facial features, thus not maintaining the integrity of the input image. In comparison, our method learns the texture mapping between a low resolution image (\(I_{LR}\)) and its high resolution counterpart (\(I_{HR}\)) thus generates visually plausible results.

7 Using Texture as a Perceptual Metric

In this section, we propose an improvement on LPIPS [2], a recently proposed perceptual similarity metric based on deep features. The method computes the distance between the deep features of two images in order to determine the perceptual similarity between them. We argue that Gram matrices that measure the correlations of the same deep features, provide a richer and better framework for capturing the perceptual representation of images than the features themselves. Therefore, instead of computing the distances between the features of a given convolutional layer, we compute the distance between their Gram matrices. For a pair of reference and distorted patches (\(x,x_0\)), we compute their normalized Gram matrices \(\hat{G}^l\) and \(\hat{A}^l\) \( \in \mathbb {R}^{C_l \times C_l}\), where C is the number of channels in layer l. We compute the distance between them using the same formulation as in Eq. 1 and then sum it up across all layers l, i.e.

$$\begin{aligned} d(x,x_0) = \sum _{l} \frac{1}{C_l^2} \sum _{i=1}^{C_l} \sum _{j=1}^{C_l} (G_{i,j}^l - A_{i,j}^l)^2 \end{aligned}$$
(5)

Using the features of “uncalibrated” pre-trained image classification networks, this Gram matrices distance achieves better 2AFC scores on the BAPPS validation dataset than the distances based on the features themselves. In Fig. 8, our results (Net-G) are comparable to the “calibrated” LPIPS models (specifically trained on BAPPS training datasets) and also outperform them in some benchmarks. For comparison, we adopted the same configuration of three reference models (SqueezeNet [50], AlexNet [51] and VGG-16 [32]) used by [2]. However, to get the best results we changed the number of layers for the distance computation, more specifically we did not use the feature activations before the first pooling layer and after the penultimate pooling layer of each model. This is because the texture from the lowest layers do not contain any structure in them whereas the last layers capture abstract and semantically more meaningful representations but lack in their ability to capture the perceptual details [41] (Fig. 7 and Table 5).

Fig. 7.
figure 7

Visual comparison of different networks trained on CelebA dataset [52] for 4\(\times \) SISR. TSRN yields visually faithful results to the original input image.

Fig. 8.
figure 8

Quantitative comparison between different methods for determining perceptual similarity on the BAPPS validation dataset [2]. Our Gram matrices based distance (Net-G) scores better than the feature based method (Net-F). Net-G results are comparable to calibrated *LPIPS metrics which are specifically trained on BAPPS training dataset, thus have an advantage.

Table 5. 2AFC scores (higher is better) for different methods using disparity in deep feature representations [2] and texture representations (ours) on BAPPS validation dataset. Values in bold are while the values in italic are . Our texture based scores from untrained supervised networks consistently perform better than the feature based scores and compare to *LPIPS metrics which are specifically trained on BAPPS training dataset, thus have an advantage over other untrained methods

8 Conclusion

Transferring texture via matching Gram matrices has been very successful in image style transfer, however their utility for natural image enhancement has not been studied extensively. In this work we demonstrate that Gram matrices are very powerful in capturing perceptual representations of images which makes them a perfect candidate for their use in a perceptual similarity metric like LPIPS. Exploiting this ability, we obtain image reconstructions of high perceptual quality for the task of 4\(\times \) and 8\(\times \) single image super-resolution. We further devise a scheme for external semantic guidance for controlling texture transfer which is particularly helpful for 8\(\times \) super-resolution. Our method is simple, easily reproducible and yet effective. We believe that texture loss can have far reaching implications in the future research of image restoration.