Single Image Super-Resolution Using Multi-scale Convolutional Neural Network

Jia, Xiaoyi; Xu, Xiangmin; Cai, Bolun; Guo, Kailing

doi:10.1007/978-3-319-77380-3_15

Xiaoyi Jia¹⁹,
Xiangmin Xu¹⁹,
Bolun Cai¹⁹ &
…
Kailing Guo¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10735))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2947 Accesses
2 Citations

Abstract

Methods based on convolutional neural network (CNN) have demonstrated tremendous improvements on single image super-resolution. However, the previous methods mainly restore images from one single area in the low-resolution (LR) input, which limits the flexibility of models to infer various scales of details for high-resolution (HR) output. Moreover, most of them train a specific model for each up-scale factor. In this paper, we propose a multi-scale super resolution (MSSR) network. Our network consists of multi-scale paths to make the HR inference, which can learn to synthesize features from different scales. This property helps reconstruct various kinds of regions in HR images. In addition, only one single model is needed for multiple up-scale factors, which is more efficient without loss of restoration quality. Experiments on four public datasets demonstrate that the proposed method achieved state-of-the-art performance with fast speed.

This work is supported by the National Natural Science Foundation of China (61171142, 61401163, U1636218), the Science and technology Planning Project of Guangdong Province of China (2014B010111003, 2014B010111006), Guangzhou Key Lab of Body Data Science (201605030011).

Access provided by CONRICYT-eBooks. Download conference paper PDF

Cascaded Convolutional Neural Network for Image Super-Resolution

A Single Image Super-Resolution Algorithm Based on Dense Residual Convolutional Network

Article 01 January 2021

Single image super-resolution via deep progressive multi-scale fusion networks

Article 13 April 2022

Keywords

1 Introduction

The task of single image super-resolution aims at restoring a high-resolution (HR) image from a given low-resolution (LR) one. Super-resolution has wide applications in many fields where image details are on demand, such as medical, remote sensing imaging, video surveillance, and entertainment. In the past decades, super-resolution has attracted much attention from computer vision communities. Early methods include bicubic interpolation [5], Lanczos resampling [9], statistical priors [15], neighbor embedding [4], and sparse coding [23]. However, super-resolution is highly ill-posed since the process from HR to LR contains non-invertible operation such as low-pass filtering and subsampling.

Deep convolutional neural networks (CNNs) have achieved state-of-the-art performance in computer vision, such as image classification [20], object detection [10], and image enhancement [3]. Recently, CNNs are widely used to address the ill-posed inverse problem of super-resolution, and have demonstrated superiority over traditional methods [4, 9, 15, 23] with respect to both reconstruction accuracy and computational efficiency. Dong et al. [6, 7] successfully design a super-resolution convolutional neural network (SRCNN) to demonstrate that a CNN can be applied to learn the mapping from LR to HR in an end-to-end manner. A fast super-resolution convolutional neural network (FSRCNN) [8] is proposed to accelerate the speed of SRCNN [6, 7], which takes the original LR image as input and adopts a deconvolution layer to replace the bicubic interpolation. In [19], an efficient sub-pixel convolution layer is introduced to achieve real time performance. Kim et al. [14] uses a very deep super-resolution (VDSR) network with 20 convolutional layers, which greatly improves the accuracy of the model.

The previous methods based on CNN has achieved great progress on the restoration quality as well as efficiency. However, there are some limitations mainly coming from the following aspects:

CNN based methods make efforts to enlarge the receptive field of the models as well as stack more layers. They reconstruct any type of contents from LR images using only single-scale region, thus ignore the various scales of different details. For instance, restoring the detail in the sky probably relies on a larger image region, while the tiny text may only be relevant to a small patch.
Most previous approaches learn a specific model for one single up-scale factor. Therefore, the model learned for one up-scale factor cannot work well for another. That is, many scale-specific models should be trained for different up-scale factors, which is inefficient both in terms of time and memory. Though [14] trains a model for multiple up-scales, it ignores the fact that a single receptive field may contain different information amount in various resolution versions.

In this paper, we propose a multi-scale super resolution (MSSR) convolutional neural network to issue these problems – there are two folds of meaning in the term multi-scale. First, the proposed network combines multi-path subnetworks with different depth, which correspond to multi-scale regions in the input image. Second, the multi-scale network is capable to select a proper receptive field for different up-scales to restore the HR image. Only one single model is trained for multiple up-scale factors by multi-scale training.

2 Multi-scale Super-Resolution

Given a low-resolution image, super-resolution aims at restoring its high-resolution version. For this ill-posed recovery problem, it is probably an effective way to estimate a target pixel by taking into account more context information in the neighborhood. In [6, 7, 14], authors found that larger receptive field tends to achieve better performance due to richer structural information. However, we argue that the restoration process is not only depending on single-scale regions with large receptive field.

Different kinds of components in an image may be relevant to different scales of neighborhood. In [26], multi-scale neighborhood has been proven effective for super-resolution, which simultaneously integrates local and non-local sparse priors. Multi-scale feature extraction [3, 24] is also effective to represent image patterns. For example, the inception architecture in GoogLeNet [21] uses parallel convolutions with varying filter sizes, and better addresses the issue of aligning objects in input images, resulting in state-of-the-art performance in object recognition. Motivated by this, we propose a multi-scale super-resolution convolutional neural network to improve the performance (see as Fig. 1): low-resolution image is first up-sampled to the desired size by bicubic interpolation, and then MSSR is implemented to predict the detail.

2.1 Multi-scale Architecture

With fixed filter size larger than 1, the receptive field is going larger when network stacks more layers. The proposed architecture is composed of two parallel paths as illustrated in Fig. 1. The upper path (Module-L) stacks $N_L$ convolutional layers which is able to catch a large region of information in the LR image. The other path (Module-S) contains $N_S$ ($N_S<N_L$) convolutional layers to ensure a relatively small receptive filed. The response of the k-th convolutional layer in Module-L/S for input $h^k$ is given by

$$\begin{aligned} h^{k+1}=f^{k+1}\left( {h^k}\right) =\sigma \left( {W^{k+1}*h^{k}+b^{k+1}}\right) , \end{aligned}$$

(1)

where $W^{k+1}$ and $b^{k+1}$ are the weights and bias respectively, and $\sigma \left( \cdot \right) $ represents nonlinear operation (ReLU). Here we denote the interpolated low-resolution image as x. The output of Module-L is $H_L(x)=f^{N_L}(f^{N_L-1}(...f^{1}(x)))$, and the output of Module-S is $H_S(x)=f^{N_S}(f^{N_S-1}(...f^{1}(x)))$.

For saving consideration, parameters between Module-S and the front part of Module-L are shared. Outputs of the two modules are fused into one, which can take various functional forms (e.g. connection, weighting, and summation). We find that simply summation is efficient enough for our purpose, and the fusion result is generated as $H_f (x)=H_L (x)+H_S (x)$. To further vary the spatial scales of the ensemble architecture, a similar subnetwork is cascaded to the previous one as $F\left( x\right) =H_f(H_f(x))$. A final reconstruction module with $N_r$ convolutional layers is employed to make the prediction. Following [20], size of all convolutional kernels is set to $3 \times 3$ with zero-padding. With respect to the local information involved in LR image, there are streams of three scales (Small/Middle/Large-Scale) corresponding to $2 \times (N_S+N_S+N_r) + 1$, $2 \times (N_S+N_L+N_r) + 1$ and $2 \times (N_L+N_L+N_r) + 1$, respectively. Each layer consists of 64 filters except for the last reconstruction layer, which contains only one single filter without nonlinear operation.

2.2 Multi-scale Residual Learning

High-frequency content is more important for HR restoration, such as gradient features taken into account in [1, 2, 4]. Since the input is highly similar to the output in super-resolution problem, the proposed network (MSSR) focuses on high-frequency details estimation through multi-scale residual learning.

The given training set $\{ {x_s^{\left( i \right) },{y^{\left( i \right) }}} \}_{\{i,s\} = \{1,1\}}^{\{N,S\}}$ includes N pairs of multi-scale LR images $x_s^{\left( i \right) }$ with S scale factors and HR image ${y^{\left( i \right) }}$. Multi-scale residual image for each sample is computed as $r_s^{\left( i \right) } = {y^{\left( i \right) }} - x_s^{\left( i \right) }$. The goal of MSSR is to learn the nonlinear mapping $F\left( x \right) $ from multi-scale LR images $x_s^{\left( i \right) }$ to predict the residual image $r_s^{\left( i \right) }$. The network parameters $\varTheta = \left\{ {{W^k},{b^k}} \right\} $ are achieved through minimizing the loss function as

$$\begin{aligned} \begin{array}{l} L\left( \varTheta \right) = \dfrac{1}{{2NS}}\sum \limits _{i = 1}^N {\sum \limits _{s = 1}^S {{{\left\| {r_s^{\left( i \right) } - F\left( {x_s^{\left( i \right) };\varTheta } \right) } \right\| }^2}} } \\ = \dfrac{1}{{2NS}}\sum \limits _{i = 1}^N {\sum \limits _{s = 1}^S {{{\left\| {{y^{\left( i \right) }} - \left( {x_s^{\left( i \right) } + F\left( {x_s^{\left( i \right) };\varTheta } \right) } \right) } \right\| }^2}} } \end{array} \end{aligned}$$

(2)

With multi-scale residual learning, we only train a general model for multiple up-scale factors. For LR images $x_s^{\left( i \right) }$ with different down sampling scales s, even the same region size in LR images may contain different information content. In the work of Dong et al. [8], a small patch in LR space could cover almost all information of a large patch in HR. For multiple up-scale samples, a model with only one single receptive field cannot make the best of them all simultaneously. However, our multi-scale network is capable of handling this problem. The advantages of multi-scale learning include not only memory and time saving, but also a way to adapt the model for different down sampling scales.

3 Experiments

3.1 Datasets

Training Dataset. The model is trained on 91 images from Yang et al. [23] and 200 images from the training set of Berkeley Segmentation Dataset (BSD) [17], which are widely used for super-resolution problem [7, 8, 14, 18]. As in [8], to make full use of the training data, we apply data augmentation in two ways: (1) Rotate the images with the degree of $90^{\circ }$, $180^{\circ }$ and $270^{\circ }$. (2) Downscale the images with the factor of 0.9, 0.8, 0.7 and 0.6. Following the sample cropping in [14], training images are cropped into sub-images of size $41 \times 41$ with non-overlapping. In addition, to train a general model for multiple up-scale factors, we combine LR-HR pairs of three up-scale size ($\times 2, \times 3, \times 4$) into one.

Test Dataset. The proposed method is evaluated on four publicly available benchmark datasets: Set5 [1] and Set14 [25] provide 5 and 14 images respectively; B100 [17] contains 100 natural images collected from BSD; Urban100 [12] consists of 100 high-resolution images rich of structures in real-world. Following previous works [8, 12, 14], we transform the images to YCbCr color space and only apply the algorithm on the luminance channel, since human vision is more sensitive to details in intensity than in color.

3.2 Experimental Settings

In the experiments, the Caffe [13] package is implemented to train the proposed MSSR with Adam [16]. To ensure varying receptive field scales, we set $N_L=9$, $N_S=2$ and $N_r=2$ respectively. That is, each Module-L in Fig. 1 stacks 9 convolutional layers, while Module-S stacks 2 layers. The reconstruction module is built of 2 layers. Thus, the longest path in the network consists of 20 convolutional layers totally, and there are streams of three different scales corresponding to 13, 27 and 41. Model weights are initialized according to the approach described in [11]. Learning rate is initially set to $10^{-4}$ and decreases by the factor of 10 after 80 epochs. Training phase stops at 100 epochs. We set the parameters of batch-size, momentum and weight decay to 64, 0.9 and $10^{-4}$ respectively.

3.3 Results

To quantitatively assess the proposed model, MSSR is evaluated for three different up-scale factors from 2 to 4 on four testing datasets aforementioned. We compute the Peak Signal-to-Noise Ratio (PSNR) and structural similarity (SSIM) of the results to compare with some recent competitive methods, including A+ [22], SelfEx [12], SRCNN [7], FSRCNN [8] and VDSR [14]. As shown in Table 1, we can see that the proposed MSSR outperforms other methods almost on every up-scale factor and each test set. The only suboptimal result is the PSNR on B100 of up-scale factor 4, which is slightly lower than VDSR [14], but still competitive with a higher SSIM. Visual comparisons can be found in Figs. 2 and 3.

As for effectiveness, we evaluate the execution time using the public code of state-of-the-art methods. The experiments are conducted with an Intel CPU (Xeon E5-2620, 2.1 GHz) and an NVIDIA GPU (GeForce GTX 1080). Figure 4 shows the PSNR performance of several state-of-the-art methods for super-resolution versus the execution time. The proposed MSSR network achieves better super-resolution quality than existing methods, and are tens of times faster.

Table 1. Average PSNR/SSIM for scale factors x2, x3 and x4 on datasets Set5 [1], Set14 [25], B100 [17] and Urban100 [12]. color indicates the best performance and color indicates the second best performance. (All the output images are cropped to the same size as SRCNN [7] for fair comparisons.)

4 Conclusion

In this paper, we highlight the importance of scales in super-resolution problem, which is neglected in the previous work. Instead of simply enlarge the size of input patches, we proposed a multi-scale convolutional neural network for single image super-resolution. Combining paths of different scales enables the model to synthesize a wider range of receptive fields. Since different components in images may be relevant to a diversity of neighbor sizes, the proposed network can benefit from multi-scale features. Our model generalizes well across different up-scale factors. Experimental results reveal that our approach can achieve state-of-the-art results on standard benchmarks with a relatively high speed.

References

Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding (2012)
Google Scholar
Bevilacqua, M., Roumy, A., Guillemot, C., Morel, M.L.A.: Super-resolution using neighbor embedding of back-projection residuals. In: 18th International Conference on Digital Signal Processing (DSP), pp. 1–8. IEEE (2013)
Google Scholar
Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: DehazeNet: an end-to-end system for single image haze removal. IEEE Trans. Image Process. 25(11), 5187–5198 (2016)
Article MathSciNet Google Scholar
Chang, H., Yeung, D.Y., Xiong, Y.: Super-resolution through neighbor embedding. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004), vol. 1, p. 1. IEEE (2004)
Google Scholar
De Boor, C.: Bicubic spline interpolation. Stud. Appl. Math. 41(1–4), 212–218 (1962)
MathSciNet MATH Google Scholar
Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 184–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_13
Chapter Google Scholar
Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016)
Article Google Scholar
Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutional neural network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 391–407. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_25
Chapter Google Scholar
Duchon, C.E.: Lanczos filtering in one and two dimensions. J. Appl. Meteorol. 18(8), 1016–1022 (1979)
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206 (2015)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654 (2016)
Google Scholar
Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse regression and natural image prior. IEEE Trans. Pattern Anal. Mach. Intell. 32(6), 1127–1133 (2010)
Article Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of the Eighth IEEE International Conference on Computer Vision (ICCV 2001), vol. 2, pp. 416–423. IEEE (2001)
Google Scholar
Schulter, S., Leistner, C., Bischof, H.: Fast and accurate image upscaling with super-resolution forests. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3791–3799 (2015)
Google Scholar
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Timofte, R., De Smet, V., Van Gool, L.: A+: adjusted anchored neighborhood regression for fast super-resolution. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 111–126. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16817-3_8
Chapter Google Scholar
Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. IEEE Trans. Image Process. 19(11), 2861–2873 (2010)
Article MathSciNet Google Scholar
Zeng, L., Xu, X., Cai, B., Qiu, S., Zhang, T.: Multi-scale convolutional neural networks for crowd counting. arXiv preprint arXiv:1702.02359 (2017)
Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations. In: Boissonnat, J.-D., Chenin, P., Cohen, A., Gout, C., Lyche, T., Mazure, M.-L., Schumaker, L. (eds.) Curves and Surfaces 2010. LNCS, vol. 6920, pp. 711–730. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27413-8_47
Chapter Google Scholar
Zhang, K., Gao, X., Tao, D., Li, X.: Multi-scale dictionary for single image super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1114–1121. IEEE (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
Xiaoyi Jia, Xiangmin Xu, Bolun Cai & Kailing Guo

Authors

Xiaoyi Jia
View author publications
You can also search for this author in PubMed Google Scholar
Xiangmin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Bolun Cai
View author publications
You can also search for this author in PubMed Google Scholar
Kailing Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiangmin Xu .

Editor information

Editors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Bing Zeng
University of Chinese Academy of Sciences, Beijing, China
Qingming Huang
University of Ottawa, Ottawa, Ontario, Canada
Abdulmotaleb El Saddik
University of Electronic Science and Technology of China, Chengdu, China
Hongliang Li
Chinese Academy of Sciences, Beijing, China
Shuqiang Jiang
Harbin Institute of Technology, Harbin, China
Xiaopeng Fan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jia, X., Xu, X., Cai, B., Guo, K. (2018). Single Image Super-Resolution Using Multi-scale Convolutional Neural Network. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds) Advances in Multimedia Information Processing – PCM 2017. PCM 2017. Lecture Notes in Computer Science(), vol 10735. Springer, Cham. https://doi.org/10.1007/978-3-319-77380-3_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-77380-3_15
Published: 10 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77379-7
Online ISBN: 978-3-319-77380-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics