Keywords

1 Introduction

Single image super resolution (SISR) is a classic ill-posed problem in computer vision community which aims at recovering a high resolution (HR) image from only one low resolution (LR) image. High resolution means that pixel density within an image is higher than its LR counterparts and therefore an HR image can offer more details that may be critical in various applications such as medical imaging [1, 2], aerial spectral imaging [3] and remote sensing imaging [4, 5], face recognition [6], security and surveillance [7] et al., where high-frequency details are usually critical and greatly desired.

Fig. 1.
figure 1

The structure of several residual blocks. C, B, R and + represent conv, batch normalization, ReLU and element-wise addition respectively. (a) the original residual block [18]. (b) SRResNet [14]. (c) EDSR/MDSR [16]. (d) the proposed residual block.

In recent years, many image super resolution (SR) methods based on deep learning techniques [8], especially convolutional neural networks (CNNs) and residual learning, have emerged and greatly promoted the best state of SR. Some of the most representative are SRCNN [9], DRCN [10], DRRN [11], VDSR [12], ESPCNN [13], SRResNet [14], LapSRN [15], EDSR/MDSR [16] and RDN [17] etc. Residual learning [18, 19] is a trick to increase the depth of the networks and thus improve the model performance. It was first proposed for image recognition and has been widely proved to be helpful for gradients propagation and model convergence, thus making it possible to build extremely deep networks. With the increased depth of networks, the expressive power and generalization ability of the models have also been improved. Though many methods based on residual learning (e.g. SRResNet [14], EDSR/MDSR [16] and RDN [17]) have achieved much better results than previous methods, the cost for getting further improvement of model performance becomes more and more expensive as the depth of the network increases. Therefore, it is useful to improve the efficiency of model parameters in the case of limited resources.

A key factor of residual learning that affects model training and performance is residual connection (or skip connection, shortcut [18]). The previous methods have a common feature in network structure design: residual learning is usually applied to the overall structure of the networks or building blocks, but not deep into the information paths of a residual block. Normally, a residual block is composed of a residual path (a identity mapping) and a main path (Fig. 1). In this work, we present a novel multilevel residual learning pattern for SISR, which we term ML-ResNet. In our model, residual connection is applied not only to the outermost layers and the internal residual blocks, but also to the main path within a residual block (Fig. 1(d)). Thus, the whole structure of the network exhibits the characteristic of multilevel residual learning.

We evaluate the proposed model on several benchmark datasets and compare it with some common block structures. The experimental results show that the multilevel residual structure has a stable performance improvement over the compared methods with equal model parameters. Moreover, we also empirically illustrate that simply increasing the number of building blocks does not achieve the expected performance gain, which implies that the optimal performance of the networks with different depth may correspond to different structures of building blocks. This observation might shed some light on the structural design of deep networks or building blocks.

2 Related Work

2.1 Super Resolution with Deep Learning

Dong et al. proposed the first SR model [9] based on CNNs in the modern sense and built an end-to-end mapping between the (bicubic) interpolated LR images and their HR counterparts. The further improvement based on this pioneering work mainly aimed at increasing network depth or sharing network weights at the beginning [10,11,12]. These methods use the interpolated version of the LR image as the input of their model, which is convenient for keeping the size of the output image consistent with the target HR image and works well for the fractional scaling factors. However, it hinders establishing end-to-end mappings from the original LR image to the corresponding HR image and suffers the computational and memory constraints as they operate feature maps in the HR image space. This problem can be solved by placing nonlinear mapping in the LR image space. There are two options for the purpose currently, i.e., transpose convolution (or deconvolution [20]) and efficient sub-pixel convolutional neural network (ESPCNN) [13]. As the amount of computation and memory occupancy are greatly reduced, Lim et al. [16] increased the depth and the width (the number of the feature maps’ channel) of their networks aggressively (32 residual blocks for EDSR and 80 residual blocks for MDSR).

Although these networks have made great breakthroughs in improving SR results, their performance gains are mainly achieved by increasing network depth and adjusting the structure of the entire network. Changes in the structure of residual blocks also aim at increasing the network depth to a certain extent. On the contrary, the target of this work is to promote the information flow through the entire network and improve the efficiency of the model parameters.

2.2 Residual Learning for Super Resolution

Residual Network (ResNet) [18] is initially proposed for image recognition, which is further applied to a wide range of computer vision problems such as image classification, object detection, image segmentation and image generation. Most of the methods mentioned in Sect. 2.1 apply residual learning, e.g., DRRN [11], VDSR [12], SRResNet [14], EDSR [16] and RDN [17] etc. An impressive work was presented in [21], named HelloSR. Inspired by the effectiveness of learning high frequency residuals for SR, HelloSR presented a novel stacked residual refined network which generated HR image by explicitly learning the multilevel residuals in the HR image space.

These methods employ residual learning in different ways. However, most of them adopt residual connections only between the outermost layers or the middle modules of their network, but not within the information paths of a building block. In this work, the outermost layers, the intermediate building blocks and the information pathes within a block are viewed as different levels of a network and residual learning is applied to all of these levels. Experiments show that this multilevel residual structure is helpful to improve the performance of the model when the network structure is relatively shallow.

Fig. 2.
figure 2

The overall structure of two networks used in this work. (a) the same structure as EDSR [16] but the number of residual blocks is limited to 4. (b) Extension of (a) with an external skip connection.

3 Multilevel Residual Networks

3.1 Overall Network Structure

The overall structure of ML-ResNet is outlined in Fig. 2. The networks consist of three typical parts: feature extraction network (FEN), nonlinear mapping network (NMN) and HR image reconstruction network (HRN). The FEN is applied to represent the input image as shallow features. These shallow features are then fed into a set of cascaded building blocks, i.e., NMN that produces deep features. Next, a pixel shuffle layer is concatenated to upsample deep features to match the expected size (e.g. SR\(\times \)2 or SR\(\times \)4). Finally, the upsampled features are delivered to HRN to generate the HR outputs.

Denote \(\mathbf {x}\) and \(\mathbf {y}\) as the input and the output of the entire network, \(\mathbf {x}_{i}\) and \(\mathbf {y}_{i}\) as the input and output of the sub networks or building blocks. Formally, the operation for shallow features extraction could be expressed as:

$$\begin{aligned} \mathbf {y}_{0} = {F}_\mathrm{{e}}(\mathbf {x}) \end{aligned}$$
(1)

where \({F}_\mathrm{{e}}(\cdot )\) denotes the first feature extraction network FEN. It extracts the shallow features and expands the dimension along with channel direction. The output of FEN is directly fed into NMN (\(\mathbf {x}_{0} = \mathbf {y}_{0}\)). Similarly, the operation for the whole nonlinear feature mapping network could be denoted as:

$$\begin{aligned} \mathbf {y}_{n} = {F}_\mathrm{{m}}(\mathbf {x}_{0}) \end{aligned}$$
(2)

where n denotes the number of the building blocks, and \(\mathbf {y}_{n}\) indicates the output of the nonlinear feature mapping function \({F}_\mathrm{{m}}(\cdot )\). Here, \({F}_\mathrm{{m}}(\cdot )\) includes all the building blocks within nonlinear feature mapping network and the subsequent conv layer, as shown in Fig. 2.

After the global skip connection (GSC), the input of HR image reconstruction network is \(\mathbf {x}_{n+1} = \mathbf {y}_{n} + \mathbf {x}_{0}\). In EDSR/MDSR [16], the final output of the entire network is as follow (Fig. 2(a)):

$$\begin{aligned} \mathbf {y} = {F}_\mathrm{{r}}(\mathbf {x}_{n+1}) = {F}_\mathrm{{r}}(\mathbf {y}_{n} + \mathbf {x}_{0}) \end{aligned}$$
(3)

where \({F}_\mathrm{{r}}(\cdot )\) denote HR reconstruction function that consists of a pixel shuffle followed by a conv layer. However, there is an external skip connection (ESC) before the final output of the proposed ML-ResNet, as shown in Fig. 2(b):

$$\begin{aligned} \mathbf {y} = \mathbf {x} + {F}_\mathrm{{r}}(\mathbf {x}_{n+1}) = \mathbf {x} + {F}_\mathrm{{r}}(\mathbf {y}_{n} + \mathbf {x}_{0}) \end{aligned}$$
(4)
Fig. 3.
figure 3

Detailed illustration of the proposed residual block. Each residual block consists of many sub residual blocks, which are composed of the basic Conv + ReLU operations.

3.2 Building Residual Blocks

ResNet is usually modularized and consists of a series of stacked blocks. In a residual block, the main path augments the expressive ability of the model, while the residual path promotes the information propagation through the entire network. Denote the input and the output of a residual block \(\mathcal {B}_{l}\) as \(\mathbf {x}_{l}\) and \(\mathbf {y}_{l}\) respectively. Then \(\mathcal {B}_{l}\) can be expressed in a general form [18]:

$$\begin{aligned} \begin{array}{l} \displaystyle \mathbf {y}_{l} = h(\mathbf {x}_{l}) + \mathcal {F}_\mathrm{{B}}(\mathbf {x}_{l}, \mathcal {W}_{l}) \\ \displaystyle \mathbf {x}_{l+1} = f(\mathbf {y}_{l}) \end{array} \end{aligned}$$
(5)

where \(h(\cdot )\) and \(\mathcal {F}_\mathrm{{B}}(\cdot )\) are the mapping function of residual path and the main path respectively. \(f(\cdot )\) is a function that converts the output of \(\mathcal {B}_{l}\) to the input of \(\mathcal {B}_{l+1}\). He et al. [19] theoretically explained that a compact information path (the identity mapping in Fig. 1) is helpful for easing optimization, i.e., \(h(\mathbf {x}_{l}) = \mathbf {x}_{l}\) and \(f(\mathbf {y}_{l}) = \mathbf {y}_{l}\). This is viewed as a contiguous memory mechanism [17] and most of the current SR models follow this principle.

However, most of the previous methods adopted direct nonlinear mapping in the main path \(\mathcal {F}_\mathrm{{B}}(\cdot )\). In this work, residual learning is also applied deep into the main path \(\mathcal {F}_\mathrm{{B}}(\cdot )\) of a residual block, as shown in Fig. 3 and Fig. 1(d). We call this ResNet-in-ResNet structure fine-grained residual learning, which is expected to promote data flow in the main path of a residual block. One can adjust the number of sub residual blocks (SRB) in a residual block (RB) and thus change the density of residual learning. If the NMN includes x residual blocks and each residual block contains y sub residual blocks, we call it ML-ResNet (BxSy).

3.3 Multilevel Residual Pattern

In addition to the fine-grained residual learning within residual blocks, we also introduce an external skip connection (ESC) between the outermost layers of the entire network, which we call coarse-grained residual learning. Thus, the residual pattern is applied to multiple abstract levels of the model and the whole network structure displays the characteristic of multilevel residual learning from fine to coarse grain. This multilevel residual structure is proved to be effective in our experiments, which is probably because it is related to the (multilevel) manifold simplification [22] although there is still no strict theoretical argument.

Interestingly, the experiments show that the external skip connection seems to have no obvious effect on the performance of the network with EDSR residual blocks (Fig. 1(c)), but it can slightly improve the performance of the model built with the proposed residual blocks (Fig. 1(d)). This also shows the validity of the multilevel residual structure to some extent.

4 Experiments

In this section, we first introduce some experiment settings. Next, we study the impact of residual density and the external skip connection on the performance of the model. The overall structure of the network is Fig. 2(b) and the reference structure is Fig. 2(a). The residual blocks shown in Fig. 1(b)−(d) are used for comparison. Finally, we compare the proposed model with several previous methods quantitatively and qualitatively. The performance is evaluated with PSNR and SSIM [23]. They are calculated with the built-in functions of Python skimage module during quick validation, but in the testing phase, we use different calculations for fair comparison.

4.1 Training Settings

DIV2K dataset [21, 24] is used to train and quickly validate the models (only the first 10 validation images of DIV2K are used). Several standard benchmark datasets are used for testing, including Set5 [25], Set14 [26], B100 [27], Urban100 [28] and DIV2K validation set. For training, the HR images are randomly split into 96 \(\times \) 96 RGB image patches and the size of LR patches are dynamically adjusted according to SR scales. Data augmentation and mean removal are the same as EDSR/MDSR [16].

Given a training dataset \(\mathcal {D} = \{\mathbf {x}^{i}, \mathbf {y}^{i}\}_{i = 1}^{|\mathcal {D}|}\), where \(|\mathcal {D}|\) is the number of training samples, \(l_1\) loss function is used for model training:

$$\begin{aligned} L(\varvec{\theta }) = \frac{1}{|\mathcal {D}|}\sum _{i = 1}^{|\mathcal {D}|}||\mathbf {y}^{i} - \hat{\mathbf {y}}^{i}||_{1} \end{aligned}$$
(6)

where \(\hat{\mathbf {y}}\) is the estimate of the model and \(\mathbf {y}\) is the corresponding target. \(\varvec{\theta }\) denotes the set of model parameters. It is worth noting that the number of parameters is the same for the compared architectures.

Fig. 4.
figure 4

The validation performance of the models with different residual density on the first 10 validation images of DIV2K (SR\(\times \)4).

Fig. 5.
figure 5

The validation performance of the models with different residual blocks shown in Fig. 1(b)−(d). Only the first 10 validation images of DIV2K are used for comparison (SR\(\times \)4).

The size of minibatch is 32 and that of filters is \(5\times 5\). The number of residual blocks and feature maps is 4 and 256 respectively. We trained the models with ADAM optimizer [30] by setting \({\beta }_{1} = 0.9\), \({\beta }_{2} = 0.999\) and \(\epsilon = 10^{-8}\). The piecewise constant decay is used for learning rate, i.e., it is initialized as \(10^{-4}\) and halved at every \(10^{5}\) iterations. All models are trained for \(5 \times 10^5\) iterations.

4.2 Residual Density

In our settings, there are multiple combinations of residual blocks and their sub residual blocks when the total number of conv layers is fixed, which forms different residual density of the NMN. For comparison, we used the structure in Fig. 2(a) and set the total number of conv layers in NMN to 8. Thus, we have 4 combinations: B1S8, B2S4, B4S2 and B8S1, where B and S represent the number of residual blocks and sub residual blocks respectively. However, B8S1 is invalid due to the degradation of model structure, as shown in Fig. 3.

From Fig. 4, it can be seen that B2S4 and B4S2 perform almost the same, but obviously better than B1S8. The result is stable in our repeated experiments. It is probably because that a residual network can be viewed as a collection of many paths of differing length [29] and different residual densities lead the actual depth of the entire network to be different. This implies that the optimal performance of different network depths may correspond to different structures of building blocks, and simply increasing the number of building blocks to increase the depth of the network may not achieve expected performance improvements.

Fig. 6.
figure 6

The validation performance of the models with and without ESC. Only the first 10 validation images of DIV2K dataset are used for comparison (SR\(\times \)4).

Fig. 7.
figure 7

Visual comparison with some previous SISR methods. (1) The first row shows image “butterfly” in Set5 with scale \(\times \)4. (2) The second and the third rows show image“img003” and “img043” in Urban100 with scale \(\times \)3.

4.3 Different Residual Blocks

Because Fig. 1(a) is mainly used for classification, detection and other high-level computer vision problems, we exclude this structure in our experiments. For all of the compared structures, we set 4 residual blocks in the entire network with two convolutional layers in each block for fair comparison.

As shown in Fig. 5 and Table 1, the proposed residual structure achieved the best SR performance. The residual block used in SRResNet [14] is obviously inferior than others. This is probably because the batch norm layer is not suitable for low-level computer vision problems. Although [16, 17] removed the batch norm layer and stated its shortcomings (e.g., requires more computational and memory resources), they did not verify it experimentally.

4.4 External Skip Connection

The impact of ESC on the performance of the models is studied in this subsection. EDSR and ML-ResNet residual blocks are used for comparison. The validation performance of different architectures on the first 10 validation images of DIV2K is shown in Fig. 6.

Figure 6 exhibits an interesting phenomenon, i.e., ESC seems to have no obvious effect on the performance of the network with EDSR residual blocks but it can slightly improve the performance of the model built with the proposed residual blocks. This shows the validity of the multilevel residual structure to some extent.

Table 1. Quantitative comparison between some previous methods and the proposed ML-ResNet. SRResNet (block\(\times \)4) and EDSR (block\(\times \)4) are also included here. The maximal values are bold, and the second ones are underlined (PSNR/SSIM).

4.5 Comparison with Other Methods

In this section, we compare the proposed method with several typical methods quantitatively and qualitatively. When evaluating on DIV2K-val, we followed the way of EDSR/MDSR [16] to compute PSNR and SSIM; when testing on other datasets, i.e., Set5, Set14, B100 and Urban100, we followed the calculation of DRCN [10]. Table 1 collects the quantitative results of the compared methods on the benchmark datasets, where SRResNet (block\(\times \)4) and EDSR (block\(\times \)4) are built with the structure shown in Fig. 2(a) and residual blocks shown in Fig. 1(b) and (c) respectively, but the number of residual blocks is limited to 4. The visual comparison is shown in Fig. 7. As we can see, ML-ResNet shows its superiority to the compared methods. It is worth noting that we only used B4S2 structure without ESC for comparison. Actually, the B2S4 structure perform better than B4S2 and ESC can further improve the performance of the model.

However, when we increase the network depth and make it have the same model parameters as the original EDSR, the performance of the proposed method is slightly worse than the original EDSR. This indicates that directly increasing the number of residual blocks to deepen the network will not get the desired performance improvement, and the multilevel residual structure promotes the propagation and the equilibrium of information flow through the network just when the network is relatively shallow.

5 Conclusion

In this paper, we studied several commonly used residual blocks for single image super resolution. Based on this, we proposed a new residual block structure and a multilevel residual learning pattern (ML-ResNet). The proposed ML-ResNet introduced fine-grained residual learning into the main path of a residual block and coarse-grained residual learning (ESC) between the outermost layers of the entire network. This multilevel residual structure seems to be helpful to simplify the structure of feature maps at multiple abstract levels of the deep model and promote the propagation and the equilibrium of information flow throughout the entire network. It shows superior performance over several compared structures when the entire network is relatively shallow. However, directly increasing residual blocks can not achieve the desired performance improvement, which may imply that the depth and internal structure of a network are related.