Keywords

1 Introduction

Single image super-resolution aims to recover a high-resolution (HR) image (Fig. 1(a)) for a given low-resolution (LR) image (Fig. 1(b)), which plays a key role in image enhancement [8, 12, 29]. Although numerous image super-resolution approaches have been proposed recently [4, 9,10,11, 25], the performance still remains unsatisfied in practice. This is because the high-frequency information from the high-resolution image is excessively missing when it degrades due to extreme illumination conditions, motion blur, etc. Hence, this motivates us to develop a robust super-resolution approach to particularly recover the high-frequency information and enhance the visual quality for the super-resolved static images.

Based on different types of optimization losses, the image super-resolution methods can be roughly divided into distortion-based [11, 22, 31] and perceptual-based [9, 10, 21, 25]. Specifically, the distortion-based methods aim to generate high PSNR images which typically minimize the discrepancy between the super-resolved images and the ground truth images in a pixel-wise manner. One major issue in these methods is that the pixel-level reconstruction loss likely results in blurred textures, ignoring the high-frequency details (Fig. 1(c)). To address this issue, the perceptual-based methods have been proposed to improve the visual quality of the super-resolved images. For example, Wang et al. [25] developed a ESRGAN method, the generative models typically use perceptual loss and adversarial loss to improve the perceptual quality. However, these methods likely generate fake textures and unnatural artifacts when recovering super-resolved images (Fig. 1(d),(e)). The underlying reason is that the discriminator likely produces bias supervision signal during the optimization process, which hardly captures texture details accurately. Moreover, the existing loss functions (e.g., perceptual loss [7], pixel-wise MSE loss) are hand-crafted which provide local perceptual supervision signals.

Fig. 1.
figure 1

Image super-resolution results of different methods. The distortion-based method SRCNN [4] generates blurred textures. The perceptual-based methods including ESRGAN [25], TPSR [10] generate unnatural artifacts. Our method achieves to recover sharper boundary and finer textures visually compared with others.

Besides a well-defined perceptual objective function, making full use of the self-similarity information in the image itself is also effective on improving the perceptual quality [18]. For example, Yang et al. [27] proposed to explicitly transfer similar high-frequency features from a given reference image, so that the produced textures are more reasonable rather than the conventional fake ones. However, the performance of this method is semantically sensitive to the reference, which degrades seriously when the irrelevant reference images are given. Besides, the local features are in fixed-location neighborhoods, which cannot adapt the spatial relevant textures. To fully exploit the global cues of the input image itself, we introduce an offset learning strategy, which takes in the non-local information by utilizing the self-similarity of the inputs. By doing this, we use feature similarity to propagate between non-local pixels to explore high-frequency information (such as edges). In parallel, it reduces the geometric distortions produced by GAN-based methods [9, 25].

In this work, we argue to jointly optimize both procedures of learning an optimal perceptual loss function and searching a reliable network architecture, which can further improve the perceptual quality of the super-resolved images (Fig. 1 (f)). To achieve this, we propose a non-local network routing (NNR) method for perceptual image super-resolution. Specifically, we leverage the neural architecture search which optimizes using reinforcement-based algorithm. To improve the visual quality of the super-resolved images, we develop a learnable reward to optimize an optimal perceptual loss for image super-resolution. Moreover, we design an offset learning strategy to adaptively capture spatial boundary information in a non-local manner. Extensive experiments on the widely-used datasets demonstrate the effectiveness of the proposed method quantitatively and qualitatively.

2 Related Work

Single Image Super-Resolution. Low-resolution images are affected by many degradation factors during the imaging process, such as motion blur, noise and downsampling. Shallow single image super-resolution approaches can be roughly divided into two categories: interpolation-based [13, 15], reconstruction-based [6, 19]. Interpolation-based methods recover high-resolution images by interpolation algorithm. For example, bicubic interpolation. However, these methods usually undergo accuracy shortcomings. To address this limitation, reconstruction-based methods have been proposed to adopt prior knowledge to restrict the possible solution space, which can restore sharp details. Nevertheless, these methods are usually time-consuming.

Recent years have witnessed that deep learning networks have been applied to address the nonlinear issue in image super-resolution [4, 9, 16, 18, 25], which learns a set of nonlinear mapping from low-resolution to high-resolution image in an end-to-end manner. The distortion-based methods aim to improve the fidelity of images which typically minimize the mean square error between the predicted pixel and the ground-truth pixel. For example, Dong et al. [4] proposed SRCNN, which is the first work that applies deep learning for image super-resolution. Mao et al. [16] proposed to use encoder-decoder design to super-resolve the image. Although these methods have achieved the promising performance, one major issue is that the pixel-wise loss results in smooth images due to a lack of high-frequency details. To address this issue, perceptual-based methods have been proposed to improve the visual quality. For example, SRGAN [9] used an adversarial loss to restore the high-frequency details for perceptual satisfaction. However, the generative models likely produce geometric distortion textures. Besides, the hand-designed perceptual loss are not optimal for image perceptual evaluation and efficient training. To address these problems, our proposed method optimizes the procedures of learning an optimal perceptual objective function.

Neural Architecture Search. Recent trends have been seen that neural architecture search(NAS) [33] is gradually introduced to many computer vision applications. The coarse-grained tasks include image classification [20], object detection [34]. The fine-grained tasks include semantic segmentation [14], image super-resolution [3, 10]. Auto-deeplab [14] proposed to search the network level structure in addition to the cell level structure, which aims to search the outer network structure automatically for semantic segmentation. Ulyanov et al. [23] argued that the structure of networks can be used as a structured image prior. Hence, Chen et al. [3] proposed to search for neural architectures which can capture stronger structured image priors for image restoration tasks. However, these methods mainly focus on searching for a network architecture and ignoring the image visual quality. Lee et al. [10] incorporated the NAS algorithm with GAN-based image super-resolution to improve the quality of perception while considering the computation cost. However, this method cannot fully exploit the global cues of image itself. To fully exploit the global cues of the input image itself, we exploit an offset learning strategy based on the self-similarity of images. Then we add the offset operation to the search space to further search for the perceptual-based super-resolution network. Besides, the GAN-based super-resolution method may likely produce fake textures duo to the unstable training. Thus our approach propose to optimize the perceptual loss function and perceptual-based super-resolution network simultaneously.

3 Methodology

In image super-resolution, we aim to restore a high-resolution image denoted by \({I}^\text {SR}\) based on the given low-resolution input denoted by \({I}^\text {LR}\). As demonstrated in Fig. 2, we develop a non-local network routing method. Technically, we leverage reinforcement learning algorithms and incorporate neural architecture search with the image super-resolution task. Furthermore, we design a learnable perceptual reward as loss function to produce optimal supervision signal for efficient training. Besides, we develop a search space by introducing spatial-adaptive offset operation, which aims to reason a reliable network for perceptual image super-resolution.

Fig. 2.
figure 2

Overall framework of our NNR method. We design a reward (LPIPS) and exploit neural architecture search algorithm (reinforcement learning-based with LSTM controller) beyond a search space. The offset operation is deployed in the search space to seek for a reliable network architecture.

3.1 Non-local Network Routing

Although traditional perceptual-based methods can significantly improve the perceptual quality of the super-resolved images, it will produce inconsistent artifacts and false textures. Moreover, the hand-designed perceptual loss function is easy to fall into local optima and cannot be considered as a strong supervision signal to train the optimal super-resolution network. To address this, we introduce NAS into image super-resolution task. The search algorithm is mainly based on reinforcement learning, which incorporates with LSTM as the controller. The action a specifies the generation of a neural network architecture. The state s is defined by a set of observed network architecture. We design a learnable reward (LPIPS) to jointly optimize both procedures of learning an optimal perceptual loss and routing a reliable super-resolution network architecture denoted by \(\boldsymbol{\omega }\). The LPIPS reward function is designed to measure the image patch similarity from feature space, which is defined as follows:

$$\begin{aligned} r(s^t,a^t)= -{l}_\text {LPIPS} \end{aligned}$$
(1)

Specifically, we define the LPIPS function [30] by the following equation:

$$\begin{aligned} {l}_\text {LPIPS}=\sum _{l}\frac{1}{H_lW_l}\sum _{h,w}\Vert w_l \odot (\hat{I}^\text {SR}_l - \hat{I}^\text {HR}_l) \Vert ^{2}, \end{aligned}$$
(2)

where \(\hat{I}^\text {SR}_l , \hat{I}^\text {HR}_l \in \mathbb {R}^{H_l \times W_l \times C_l} \) is the feature from l layers of the pre-trained network. \(w_l \in \mathbb {R}^{C_l}\) is used to scale the channel-wise activations .

The traditional PSNR is the distortion based metric, which is insufficient for estimating the visual quality of images. This is because the pixel-wise restraint results in over-smoothed results without sufficient high-frequency details. However, the LPIPS is a perceptual-based metric to measure the image patch similarity from feature space. It is mentioned in [30] that the perceptual similarity measurement of two images is more consistent with human judgment than PSNR. Therefore, we use the LPIPS as the perceptual reward to optimize an optimal perceptual loss.

Aside from learning an optimal perceptual loss function, we introduce an offset learning strategy to fully exploit the global cues of the input image itself. Moreover, the non-local feature representation is also effective on improving the perceptual quality of the super-resolved images. We explore the boundary information by the self-similarity of images. The captured high-frequency information such as spatial textures and edges reinforces the visual quality. In this way, the boundary information further resolves the geometric distortion. The offset strategy can be written as follows:

$$\begin{aligned} y(p_0) = \mathop { {\sum }}\limits _{p_n\in R} w(p_n) \cdot x(p_0 + p_n + \bigtriangleup p_n), \end{aligned}$$
(3)

where x is the input, y is the output feature map, and \(p_n\) enumerates the location in a regular grid R respectively. \(\left\{ \bigtriangleup p_n|n=1,...,N \right\} \) is the learnable offsets and the sampling performs on the offset locations \(p_n + \bigtriangleup p_n\).

To automatically search for a network architecture with promising perceptual performance, we design to plug the offset operation inside the search space. The offset operation adaptively learns a set of offsets from the input image itself. Then our search space is developed to perform micro-cell approach and the normal cell can be regarded as a feature extractor. In our approach, we aim to obtain high-frequency feature representation which is crucial for perceptual image super-resolution. As a result, our model focuses on selecting for the best architecture of the normal cell. We show the candidate operation \(Op\_{normal}\) of the normal cell search subspace as follows:

$$\begin{array}{l} \text {Op}\_\text {normal} =\left. \{ \text {Offset}, \right. \\ \left. \text {Dilated}\ \text {Conv} \left( k, n\right) \text {with}\ k=3,5, \right. \\ \left. \text {Separable}\ \text {Conv} \left( k, n\right) \text {with}\ k<=3,5, \right. \\ \left. \text {Residual}\ \text {Channel}\ \text {Attention}\ \text {Block}(\text {RCAB}), \right. \\ \left. \text {Identity} \right. \} \end{array}$$

For the normal cell, the search space is composed by the offset operation [32] and other several commonly used candidate operations including 3 \(\times \) 3 and 5 \(\times \) 5 dilated convolution, 3 \(\times \) 3 and 5 \(\times \) 5 separable convolution, residual channel attention block [31] and skip connection.

The upsampling cell is used to recover images with higher spatial resolution. We develop a search space with several upsampling operations. The candidate operation \(Op\_{upsampling}\) of upsampling search subspace can be expressed as

$$\begin{array}{l} \text {Op}\_\text {upsampling} =\left. \{ \text {Pixel}\ \text {Shuffle}\ \text {Layer}, \right. \\ \left. \text {Deconvolution}\ \text {Layer} \right. \\ \left. \text {Nearest{-}neighbor}\ \text {Interpolation} \right. \\ \left. \text {Bilinear}\ \text {interpolation} \right. \} \end{array} $$
figure a

3.2 Model Learning

For our optimization, we specify a small scale of epochs and higher batch size as the proxy task. In detail, we first leverage the proxy task to search the optimal architecture. Then we exploit the weight sharing strategy, which uses the weights of step t to initialize the model at step t + 1. We evaluate the searched network architecture by computing the LPIPS reward between the ground truth image and the super-resolved image. With the learnable perceptual reward, we exploit the policy gradient [26] to train the LSTM controller. Based on the learned policy \(\pi \)(\(\cdot \)), we obtain the best-performing network architectures and the optimal reward loss function simultaneously. Finally, we apply the full task to retrain the acquired best-performing super-resolution network architecture from scratch. We also use the LPIPS loss to train the searched super-resolution network. The previous works [2] mentioned that only using perceptual quality to constrain the network may produce undesirable artifacts. Hence, we incorporate \(\ell _1\) loss in our final optimization. The overall loss of training the searched network architecture can be expressed as follows:

$$\begin{aligned} {l}_{total}=\alpha \frac{1}{N}\sum _{i=1}^{N} | {I}^\text {SR}_i-{I}^\text {HR}_i |+\beta {l}_\text {LPIPS}, \end{aligned}$$
(4)

where \(\alpha \) and \(\beta \) are the trade-off weights and N is the total number of images respectively. Specifically, we specified the parameters \(\alpha \)=0.8 and \(\beta \)=0.2 of our model.

Algorithm 1 details the training procedure of our NNR.

4 Experiments

4.1 Evaluation Dataset and Metric

In our experiments, we used the DIV2K dataset as the training data, and the commonly-used SR benchmarks, namely Set5 [1], Set14 [28], BSD100 [17] and Urban100 [5] as the testing datasets. All experiments were performed with a scale factor 4x between low-resolution and high-resolution images. For data augmentation, we used horizontal flip, verticle flip and rotation randomly.

We evaluated the trained model under the learned perceptual image patch similarity (LPIPS) and the peak signal-to-noise ratio (PSNR). Accordingly, we used LPIPS to measure the perceptual quality of the super-resolved images, where the lower the LPIPS value indicates better image visual quality. The PSNR is distortion-based measures that pays more attention to the fidelity of images. Obviously, the higher the PSNR value and the smaller the image distortion. Following the standard settings in  [10], we evaluated PSNR and LPIPS on the Y channel and RGB image respectively.

4.2 Implementation Details

Our model was built based on the popular accelerated deep learning toolbox PyTorchFootnote 1. We conducted all experiments on a NVIDIA Tesla V100 GPU with 300 epochs for searching network architectures and 300 epochs for training networks. The batch-size was set to 16. The ADAM optimizer was for searching and SGD for training. Moreover, we use sample entropy regularization for robust and fast convergence in our NAS controller.

4.3 Derived Architecture

Figure  3 shows the normal cell and upsampling cell searched via our method respectively. As the figure shows, each cell contains four intermediate nodes and every node has two operations from previous nodes. For each cell, the nodes represent the feature map, and the edge is the searched operation. It can be concluded that the cell structure selection is controlled by our proposed reward, which achieves the highest reward during the optimization iterations.

Table 1. Quantitative results of our method on different datasets including Set5, Set14, BSD100, and Urban100. Note that higher is better for PSNR, lower is better for LPIPS. Our method achieves compelling performance especially under the perceptual LPIPS metric.
Fig. 3.
figure 3

Resulting routing of our NNR

4.4 Comparison with State-of-the-Art Methods

Quantitative Comparison. We compared our approach with folds of state-of-the-art perceptual driven super-resolution methods. In Table 1, we reported the PSNR, LPIPS on Set5, Set14, BSD100 and Urban100 under the evaluation setting. From the results, we made two-fold conclusion: (1) Our proposed learnable perceptual reward can capture sufficient high-frequency details which improves the visual quality of the super-resolved images. We also used the perceptual reward to route a reliable super-resolution network. In this manner, our method provides optimal supervision for texture recovery and trains the super-resolution network efficiently. However, the traditional perceptual loss is handcrafted which may be local for capturing high-frequency information. Meanwhile, the hand-designed super-resolution network architecture is redundant for recovering visual quality. (2) Benefiting from the offset operation, our method can capture non-local similar feature representation and further improve the visual quality of the super-resolved images. Since the Urban100 dataset contains more building structure images, we observe that our NNR method generalizes better and gives performance gain 0.015 in LPIPS on the dataset. Unfortunately, some methods such as NatSR [21] and TPSR [10] which enforce local features, likely resulting in the inability to obtain global information. Thus, the LPIPS performance of these methods is poor than us. In addition, we figure out that our NNR achieves competitive performance on PSNR and better performance on LPIPS especially on the perceptual dimension.

Besides, we performed ablation study to validate the effectiveness for different components of our NNR method. Specifically, we only applied PSNR as the reward to search the super-resolution network as Baseline model. We first validated the importance of the designed learnable perceptual reward, which only used the LPIPS reward to constrain the network training in w/ Reward model. Then we added the offest operation to the search space based on the w/ Reward model, which further validates the effectiveness of the offset learning strategy. Table 1. presents the quantitative comparison of our ablation study.

Fig. 4.
figure 4

Visualized results of our NNR versus conventional super-resolution approaches.

We see that w/Reward has a significant improvement compared with Baseline model, which demonstrates the effectiveness of the learnable perceptual reward. The reason lies on that the optimal loss function efficiently dominates the optimal super-resolution network routing. Furthermore, our NNR method also indicates improvements especially on LPIPS performance over w/ Reward. With this, the offset operation adaptively captures the relevant features from complex images, especially the spatial textures and edges. In summary, our results clarify the effectiveness integrated with both the learnable perceptual reward and offset learning strategy.

Qualitative Comparison. Finally, we compared our NNR method with other methods qualitatively. The traditional perceptual-based super-resolution methods produce inconsistent fake textures duo to bias supervision signals. As shown in Fig. 4, our method is more realistic than others. For example, for the stone statue’s head, the compared methods are accompanied with unpleasant artifacts, while our method generates sharper textures. This is mainly due to the optimal perceptual loss and reliable super-resolution network architecture. Besides, the offset learning strategy achieves the non-local edge information, which reduces the geometric distortion and enhances the discriminability of the boundary and texture information.

5 Conclusion

In this paper, we have proposed a non-local network routing (NNR) method for perceptual image super-resolution. We have designed a learnbale reward to select a reliable super-resolution network architecture with an offset learning strategy. Quantitative and qualitative results have shown the effectiveness of our NNR. Differential convolutions in frequency domain with NAS is a desirable direction.