Keywords

1 Introduction

The task of single image super-resolution aims at restoring a high-resolution (HR) image from a given low-resolution (LR) one. Super-resolution has wide applications in many fields where image details are on demand, such as medical, remote sensing imaging, video surveillance, and entertainment. In the past decades, super-resolution has attracted much attention from computer vision communities. Early methods include bicubic interpolation [5], Lanczos resampling [9], statistical priors [15], neighbor embedding [4], and sparse coding [23]. However, super-resolution is highly ill-posed since the process from HR to LR contains non-invertible operation such as low-pass filtering and subsampling.

Deep convolutional neural networks (CNNs) have achieved state-of-the-art performance in computer vision, such as image classification [20], object detection [10], and image enhancement [3]. Recently, CNNs are widely used to address the ill-posed inverse problem of super-resolution, and have demonstrated superiority over traditional methods [4, 9, 15, 23] with respect to both reconstruction accuracy and computational efficiency. Dong et al. [6, 7] successfully design a super-resolution convolutional neural network (SRCNN) to demonstrate that a CNN can be applied to learn the mapping from LR to HR in an end-to-end manner. A fast super-resolution convolutional neural network (FSRCNN) [8] is proposed to accelerate the speed of SRCNN [6, 7], which takes the original LR image as input and adopts a deconvolution layer to replace the bicubic interpolation. In [19], an efficient sub-pixel convolution layer is introduced to achieve real time performance. Kim et al. [14] uses a very deep super-resolution (VDSR) network with 20 convolutional layers, which greatly improves the accuracy of the model.

The previous methods based on CNN has achieved great progress on the restoration quality as well as efficiency. However, there are some limitations mainly coming from the following aspects:

  • CNN based methods make efforts to enlarge the receptive field of the models as well as stack more layers. They reconstruct any type of contents from LR images using only single-scale region, thus ignore the various scales of different details. For instance, restoring the detail in the sky probably relies on a larger image region, while the tiny text may only be relevant to a small patch.

  • Most previous approaches learn a specific model for one single up-scale factor. Therefore, the model learned for one up-scale factor cannot work well for another. That is, many scale-specific models should be trained for different up-scale factors, which is inefficient both in terms of time and memory. Though [14] trains a model for multiple up-scales, it ignores the fact that a single receptive field may contain different information amount in various resolution versions.

In this paper, we propose a multi-scale super resolution (MSSR) convolutional neural network to issue these problems – there are two folds of meaning in the term multi-scale. First, the proposed network combines multi-path subnetworks with different depth, which correspond to multi-scale regions in the input image. Second, the multi-scale network is capable to select a proper receptive field for different up-scales to restore the HR image. Only one single model is trained for multiple up-scale factors by multi-scale training.

2 Multi-scale Super-Resolution

Given a low-resolution image, super-resolution aims at restoring its high-resolution version. For this ill-posed recovery problem, it is probably an effective way to estimate a target pixel by taking into account more context information in the neighborhood. In [6, 7, 14], authors found that larger receptive field tends to achieve better performance due to richer structural information. However, we argue that the restoration process is not only depending on single-scale regions with large receptive field.

Different kinds of components in an image may be relevant to different scales of neighborhood. In [26], multi-scale neighborhood has been proven effective for super-resolution, which simultaneously integrates local and non-local sparse priors. Multi-scale feature extraction [3, 24] is also effective to represent image patterns. For example, the inception architecture in GoogLeNet [21] uses parallel convolutions with varying filter sizes, and better addresses the issue of aligning objects in input images, resulting in state-of-the-art performance in object recognition. Motivated by this, we propose a multi-scale super-resolution convolutional neural network to improve the performance (see as Fig. 1): low-resolution image is first up-sampled to the desired size by bicubic interpolation, and then MSSR is implemented to predict the detail.

Fig. 1.
figure 1

The network architecture of MSSR. We cascade convolutional layers and nonlinear layers (ReLU) repeatedly. An interpolated low-resolution image goes through MSSR and transforms into a high-resolution image. MSSR consists of two convolution modules (Module-L and Module-S), streams of three different scales (Small/Middle/Large-Scale), and a reconstruction module with residual learning.

2.1 Multi-scale Architecture

With fixed filter size larger than 1, the receptive field is going larger when network stacks more layers. The proposed architecture is composed of two parallel paths as illustrated in Fig. 1. The upper path (Module-L) stacks \(N_L\) convolutional layers which is able to catch a large region of information in the LR image. The other path (Module-S) contains \(N_S\) (\(N_S<N_L\)) convolutional layers to ensure a relatively small receptive filed. The response of the k-th convolutional layer in Module-L/S for input \(h^k\) is given by

$$\begin{aligned} h^{k+1}=f^{k+1}\left( {h^k}\right) =\sigma \left( {W^{k+1}*h^{k}+b^{k+1}}\right) , \end{aligned}$$
(1)

where \(W^{k+1}\) and \(b^{k+1}\) are the weights and bias respectively, and \(\sigma \left( \cdot \right) \) represents nonlinear operation (ReLU). Here we denote the interpolated low-resolution image as x. The output of Module-L is \(H_L(x)=f^{N_L}(f^{N_L-1}(...f^{1}(x)))\), and the output of Module-S is \(H_S(x)=f^{N_S}(f^{N_S-1}(...f^{1}(x)))\).

For saving consideration, parameters between Module-S and the front part of Module-L are shared. Outputs of the two modules are fused into one, which can take various functional forms (e.g. connection, weighting, and summation). We find that simply summation is efficient enough for our purpose, and the fusion result is generated as \(H_f (x)=H_L (x)+H_S (x)\). To further vary the spatial scales of the ensemble architecture, a similar subnetwork is cascaded to the previous one as \(F\left( x\right) =H_f(H_f(x))\). A final reconstruction module with \(N_r\) convolutional layers is employed to make the prediction. Following [20], size of all convolutional kernels is set to \(3 \times 3\) with zero-padding. With respect to the local information involved in LR image, there are streams of three scales (Small/Middle/Large-Scale) corresponding to \(2 \times (N_S+N_S+N_r) + 1\), \(2 \times (N_S+N_L+N_r) + 1\) and \(2 \times (N_L+N_L+N_r) + 1\), respectively. Each layer consists of 64 filters except for the last reconstruction layer, which contains only one single filter without nonlinear operation.

2.2 Multi-scale Residual Learning

High-frequency content is more important for HR restoration, such as gradient features taken into account in [1, 2, 4]. Since the input is highly similar to the output in super-resolution problem, the proposed network (MSSR) focuses on high-frequency details estimation through multi-scale residual learning.

The given training set \(\{ {x_s^{\left( i \right) },{y^{\left( i \right) }}} \}_{\{i,s\} = \{1,1\}}^{\{N,S\}}\) includes N pairs of multi-scale LR images \(x_s^{\left( i \right) }\) with S scale factors and HR image \({y^{\left( i \right) }}\). Multi-scale residual image for each sample is computed as \(r_s^{\left( i \right) } = {y^{\left( i \right) }} - x_s^{\left( i \right) }\). The goal of MSSR is to learn the nonlinear mapping \(F\left( x \right) \) from multi-scale LR images \(x_s^{\left( i \right) }\) to predict the residual image \(r_s^{\left( i \right) }\). The network parameters \(\varTheta = \left\{ {{W^k},{b^k}} \right\} \) are achieved through minimizing the loss function as

$$\begin{aligned} \begin{array}{l} L\left( \varTheta \right) = \dfrac{1}{{2NS}}\sum \limits _{i = 1}^N {\sum \limits _{s = 1}^S {{{\left\| {r_s^{\left( i \right) } - F\left( {x_s^{\left( i \right) };\varTheta } \right) } \right\| }^2}} } \\ = \dfrac{1}{{2NS}}\sum \limits _{i = 1}^N {\sum \limits _{s = 1}^S {{{\left\| {{y^{\left( i \right) }} - \left( {x_s^{\left( i \right) } + F\left( {x_s^{\left( i \right) };\varTheta } \right) } \right) } \right\| }^2}} } \end{array} \end{aligned}$$
(2)

With multi-scale residual learning, we only train a general model for multiple up-scale factors. For LR images \(x_s^{\left( i \right) }\) with different down sampling scales s, even the same region size in LR images may contain different information content. In the work of Dong et al. [8], a small patch in LR space could cover almost all information of a large patch in HR. For multiple up-scale samples, a model with only one single receptive field cannot make the best of them all simultaneously. However, our multi-scale network is capable of handling this problem. The advantages of multi-scale learning include not only memory and time saving, but also a way to adapt the model for different down sampling scales.

3 Experiments

3.1 Datasets

Training Dataset. The model is trained on 91 images from Yang et al. [23] and 200 images from the training set of Berkeley Segmentation Dataset (BSD) [17], which are widely used for super-resolution problem [7, 8, 14, 18]. As in [8], to make full use of the training data, we apply data augmentation in two ways: (1) Rotate the images with the degree of \(90^{\circ }\), \(180^{\circ }\) and \(270^{\circ }\). (2) Downscale the images with the factor of 0.9, 0.8, 0.7 and 0.6. Following the sample cropping in [14], training images are cropped into sub-images of size \(41 \times 41\) with non-overlapping. In addition, to train a general model for multiple up-scale factors, we combine LR-HR pairs of three up-scale size (\(\times 2, \times 3, \times 4\)) into one.

Test Dataset. The proposed method is evaluated on four publicly available benchmark datasets: Set5 [1] and Set14 [25] provide 5 and 14 images respectively; B100 [17] contains 100 natural images collected from BSD; Urban100 [12] consists of 100 high-resolution images rich of structures in real-world. Following previous works [8, 12, 14], we transform the images to YCbCr color space and only apply the algorithm on the luminance channel, since human vision is more sensitive to details in intensity than in color.

3.2 Experimental Settings

In the experiments, the Caffe [13] package is implemented to train the proposed MSSR with Adam [16]. To ensure varying receptive field scales, we set \(N_L=9\), \(N_S=2\) and \(N_r=2\) respectively. That is, each Module-L in Fig. 1 stacks 9 convolutional layers, while Module-S stacks 2 layers. The reconstruction module is built of 2 layers. Thus, the longest path in the network consists of 20 convolutional layers totally, and there are streams of three different scales corresponding to 13, 27 and 41. Model weights are initialized according to the approach described in [11]. Learning rate is initially set to \(10^{-4}\) and decreases by the factor of 10 after 80 epochs. Training phase stops at 100 epochs. We set the parameters of batch-size, momentum and weight decay to 64, 0.9 and \(10^{-4}\) respectively.

3.3 Results

To quantitatively assess the proposed model, MSSR is evaluated for three different up-scale factors from 2 to 4 on four testing datasets aforementioned. We compute the Peak Signal-to-Noise Ratio (PSNR) and structural similarity (SSIM) of the results to compare with some recent competitive methods, including A+ [22], SelfEx [12], SRCNN [7], FSRCNN [8] and VDSR [14]. As shown in Table 1, we can see that the proposed MSSR outperforms other methods almost on every up-scale factor and each test set. The only suboptimal result is the PSNR on B100 of up-scale factor 4, which is slightly lower than VDSR [14], but still competitive with a higher SSIM. Visual comparisons can be found in Figs. 2 and 3.

Fig. 2.
figure 2

Super-resolution results of img099 (Urban100) with scale factor x3. Line is straightened and sharpened in MSSR, whereas other methods give blurry or distorted lines.

Fig. 3.
figure 3

Super-resolution results of ppt3 (Set14) with scale factor x3. Texts in MSSR are sharp and legible, while character edges are blurry in other methods.

As for effectiveness, we evaluate the execution time using the public code of state-of-the-art methods. The experiments are conducted with an Intel CPU (Xeon E5-2620, 2.1 GHz) and an NVIDIA GPU (GeForce GTX 1080). Figure 4 shows the PSNR performance of several state-of-the-art methods for super-resolution versus the execution time. The proposed MSSR network achieves better super-resolution quality than existing methods, and are tens of times faster.

Table 1. Average PSNR/SSIM for scale factors x2, x3 and x4 on datasets Set5 [1], Set14 [25], B100 [17] and Urban100 [12]. color indicates the best performance and color indicates the second best performance. (All the output images are cropped to the same size as SRCNN [7] for fair comparisons.)
Fig. 4.
figure 4

Our MSSR achieves more accurate and efficient results for scale factor x3 on dataset Set5 in comparison to the state-of-the-art methods.

4 Conclusion

In this paper, we highlight the importance of scales in super-resolution problem, which is neglected in the previous work. Instead of simply enlarge the size of input patches, we proposed a multi-scale convolutional neural network for single image super-resolution. Combining paths of different scales enables the model to synthesize a wider range of receptive fields. Since different components in images may be relevant to a diversity of neighbor sizes, the proposed network can benefit from multi-scale features. Our model generalizes well across different up-scale factors. Experimental results reveal that our approach can achieve state-of-the-art results on standard benchmarks with a relatively high speed.