1 Introduction

Single image Super-Resolution (SISR) method has been widely applied in multiple computer vision related tasks including image compression [14, 25, 40], visual tracking [24, 30, 37], intelligent surveillance [23, 26, 41]. On account of the variety and uncertainty of the mapping make the recovering of high-resolution (HR) image from low resolution image diversely. Large scale-factor creates redundant details and has a challenging to recover local features. The obstinate information redundancy also causes the absences of high frequency details. Interpolation method [22] is implementable but the ideal high-resolution output of linear model usually unavailable. Sparse-based techniques [13, 39] have been applied to enhance image priors assume that any natural image patch could be sparsely represented by multiple atomic dictionaries. And the dictionaries are frequently generated from the given image patch pool database, not to mention its heavy computation costs which brought by the conventional OMP (orthogonal matching pursuit) reconstruction procedure. Similar to the sparse representation based methods, other techniques used to simulate the mapping from LR to HR similarly, includes neighbor embedding [7], random forest [32].

In this paper, our approach develops dense connections inspired by DenseNet [13]. Dense connections have following advantages: solve the phenomenon of gradient disappearance effectively, promote the feature propagation and reduce the number of parameters. Dense connections of each layer obtain more contextual information and model the complex functions of the nonlinear layer while enhances the efficiency of gradient transmission in the network and the utilization of contextual information. Further, employing a reasonable number of dense connections in a very deep convolution neural network (CNN) [35] can potentially increase reconstruction performance. These will be discussed later in Section 3.2. Large dimension formatted since all the feature outputs are concatenated to the input layer of the reconstructed network. Therefore, 1 × 1 CNNs has employed to reduce dimension before generating HR pixels. It also should be noted that our experiment results conduct an opposite conclusion towards the common agreement, i.e., there exists a positive relationship between the network depth and the performance improvements.

1.1 Contribution

Different from previous works, the newly introduced dense connections integrate low-level features into high-level features to provide rich information. Using the proposed structure, large image region contextual information can be utilized accurately.

In the image reconstruction network, the 1 × 1 CNNs parallelization structure reduces the dimension of the output from previous layers to expedite the operation. 1 × 1 CNNs organize information across channels while increase the ability of expressing the model. The direct processing of the original image reduces the computational complexity. Optimized the number of layers and filters of CNN to reduce computational costs is essential. The method has been evaluated on three publicly available benchmark datasets and outperformed the most advanced methods currently available. In Fig. 1, we compared the PSNR and running times of various methods. As shown in the figure, the DCDRN has optimal performance than state-of-the-art methods.

Fig. 1
figure 1

Compared to the state-of-the-art methods [4, 6, 11, 33, 36], our DCDRN can significantly improves both the PSNR and computation costs over the scale factor ×2 on the dataset bsd100 [29]

2 Related work

Prior to the advent of convolutional neural networks, mainstream methods for reconstructing high-resolution images were dominated by matrix operations that traversed dictionary atoms for fast super-resolution. Representative work is A+ [36] and SelfEx [11]. The RFL [33] then obtains a high-resolution image by training the random forest which has the leaf nodes with regression functions.

SRCNN [4], the earliest method takes 2 to 4 convolutional layers for super-resolution [5] to acquire more significant performance and asserts CNN can use the end-to-end manner to learn the mapping from LR to HR. Although SRCNN achieved good results by combining deep learning with super-resolution, it still has some limitations:

  • Over-reliance on information in small image domains;

  • Has poor ability to convergence during training;

  • The network structure only trains a single scale.

Hence, FSRCNN [6] replaced a large kernel into a small one and fine-tuning the final deconvolution layer result in a great faster than SRCNN. The authors of VDSR [16] concluded increase the depth of layer is more sensible than employ larger convolution filters and accelerated the training process by introducing residual learning and gradient clipping [2] into the network. VDSR proves residual network is particular suitable for solving super-resolution problems.

After extensive application of residual learning, the network structure tends to be deeply reconstructed. Deepened the network and reduced the number of parameters by sharing weights called DRCN [15] increased the computational complexity obviously. In Residual Encoder-Decoder Networks (RED) [27], the symmetrical structure that each convolutional layer has a deconvolution layer as a response has strong capacity to train very deep networks and achieve prominent performance. The algorithm enables inverse propagation signals transmit to the bottom layer immediately and settles the problem of gradient disappearance. Yet, the calculation gets more complexity and the time of training is longer than others because of the structure. Certainly, ResNet [10] and Highway Network [34] train more than 100 layers of network by bypassing the path between convolutional layers. The stochastic depth [12] randomly discards certain layers and features to improve the performance of the residual network. While above studies reflect the trend of “deeper is better”, Romano recommended Rapid and Accurate Image Super Resolution (RAISR) [31] invented a shallow and faster learning-based method. Although performance is slightly lower than [15, 16, 27, 38].

Recently, growing numbers of methods have constructed a novel network by combining several convolution neural networks for super-resolution. For instance, Image super-resolution using a dilated convolutional neural network (DCNN) [19] and a two-channel convolutional neural network for image super-resolution (SDSR) [20] used differentiated convolution and multi-channel convolution also achieved remarkable performance respectively. FractalNet [17] combines the parallel networks of several different depths recently proposed and creates many short paths in the network. Symmetric dense connections have been introduced in an encoding-decoding network [28] for image restoration tasks. These all have something in common: dense connections have established between layers for efficient training.

3 Proposed method

3.1 Model overview

From Fig. 2, the network mainly divided into two elements: feature extraction network and image reconstruction network. We cascade a pair of layers (convolution, activation) in the feature extraction network. After capturing the local and global features adequately, the whole outputs are stacked in the concat layers and send into the reconstruction network as inputs. The deconvolution layers are used to reconstruct the details of the image and finally low resolution (LR) Images through CNN layers converted to high resolution (HR) images.

Fig. 2
figure 2

Our network structure. Low resolution (LR) images through CNNs converted to high-resolution (HR) images. We employ 64 filters for every convolutional layer in feature extraction network. The blue area in the figure uses 3 × 3 CNNs, while the yellow part indicates the 1 × 1 CNNs

Figure 3 shows the loss feature information of model during the training process. Each pixel up-sampled undergoes a lot of calculations and misses more semantic information, especially with a growing scale factor. In the initial feature extraction network, we optimize the number of filters at each layer and transfer the extracted features directly to the image reconstruction part through dense connections. Each output was used as the input to the rest of the layers. In this paper, the activation layer is added after each convolutional layer to enhance the expression ability of the network while increasing the characteristic nonlinear conduction. The PReLU activation function used to address the possible “dying ReLU” problem [8]. Getting more local features while prevent learning a large negative deviation term for better performance. The Table 1 shows 7 group convolution layers make the best performance. These will future discussed later in Section 4.2.

Fig. 3
figure 3

The loss feature information of Set5

Table 1 The comparison table with different parameters(Scale Factor = 3, test dataset = Set 5). The italic part is the best performance of PSNR

The deconvolution layer proposed by Matthew D. Zeiler [42] is common to learn up-sampled filters in the reconstruction network. The input dimension is quite large since all the features connected before the reconstruction network. Superposition training of the deconvolution layer is essential in order to obtain better reconstruction performance. The proposed structure is similar to the Network in Network [21] contains one or more 1 × 1 CNNs. In Fig. 4b shows the feature map extracted in the image reconstruction section. The same as the rest of the deep learning network, the proposed model focuses on learning the remaining output, even in the case of shallow (less than or equal to 7) module.

Fig. 4
figure 4

Features obtained from different layers

3.2 Benefit of feature combination

As shown in Fig. 4, the output of feature maps is demonstrated accurately. The formula is: xl = Hl(|x0,x1, ⋯, xl − 1|), xl represents the input of l-th layer, Hl includes the weights of PReLU and 3 × 3 CNNs. |x0, x1, ⋯, xl − 1| indicates the output from layer 0 to layer l − 1. Each layer can directly obtain a gradient from the loss function to obtain the input signal to reduce the loss of the feature.

Dense connections are committed to improving network performance from the perspective of feature reuse and splice the features extracted from each layer and connected low-level and high-level feature-linked feature maps and then use as input to the image reconstruction network. The output of each layer in front is equivalent to the input of each subsequent layer. In addition, the combination of feature maps generated from feature extraction network used for SR reconstruction. In the experimental part, the SR results compared with different network parameters in Table 1.

3.3 The structure of dimension reduction

Explained previous section, the transposed layer of convolution (also known as the deconvolution layer) proposed by [42] has widely application in up-sampled. In order to obtain better reconstitution performance, it is necessary to increase the depth of the deconvolution layer.

We propose a parallel structure consists of three 1 × 1 CNNs to avoid the expensive calculation. In Fig. 2, A1 and B1 after the concat layer is mainly to reduce the input dimension and the network parameters before generating the HR image. The last occurrence of L performs a final dimensionality reduction operation on the advanced features to facilitate the addition of the interpolated image to reconstruct the final HR image. Advantages of the structure are:

  • Reduce the dimension of the input to the previous layer to perform less calculation with transferring more accurate information and less information loss, make more nonlinear combinations to enhance the potential representation of the network.

  • Reduce the number of CNNs or transpose CNNs filters significantly.

  • The calculation of 1 × 1 CNNs is at least 9 times less than 3 × 3 CNNs.

3.4 Model for multiple and large scales

Table 3 shows the results among different methods with large-scale factors. DCDRN yields more details than other existed methods indicate the algorithm has superiority when the scale factor becomes larger. Most methods with small scale factor work well and gradually deteriorate when the scale factor increases, because when the scale factor increases, the sub-pixel edge recovery is more difficult and the information loss is more serious. Almost all methods based on deep learning associate with bicubic interpolation, the proposed algorithm combines it with dense connections to make the image recovery more apparent. As a result, we have observed that multi-scale training can improve large-scale performance.

4 Experimental results

This section shows the concrete performance of our method on several datasets. First, we introduce several datasets for testing and training, and then give detail parameters of the algorithm. Finally compare our methods with some state-of-the-art methods.

4.1 Datasets for training and testing

4.1.1 Training dataset

Different methods of deep learning have various training dataset. Dataset we used is 91 images from Yang et al. [39] and 200 images from Berkeley Segmentation Dataset [1]. We take 291 images as training data and flipped the images horizontally and vertically shown in Fig. 5. During the training phase, employed dataset Set5 [3] to evaluate performance and supervise if the model is over-fitting or convergent. The proposed method takes a color (RGB) image convert to an YCbCr image and only processing the Y channel.

Fig. 5
figure 5

Augment result of “15,088” (bsd100) flipped horizontally, vertically

4.1.2 Testing dataset

Three benchmark datasets, Set5, Set14 [43] and bsd100, are used for various networks commonly. Timofte et al. used the Berkeley segmentation dataset of natural images in [29]. All the super-resolution methods in Table 3 test on these three datasets to make the comparison results more convincing.

4.2 Training parameters

The learning rate has been set as 0.001, reduced twice while loss stop to decrease after 5 training epochs, training will terminative if the learning rate is lower than 0.00002. Different from the rest of deep learning super-resolution models of other technologies, we have fixed 7 group of 3 × 3 CNNs in the feature extraction network and 64 outputs of each layer in order to reduce the parameters. Figure 4a shows the 64 feature maps extracted by the first layer during training. More detailed training process and specific parameters are shown in Table 2. Each training image is divided into 32 patches, using 64 patches as a mini-batch. A total of 118,384 batch data were generated of Set5. We optimized the regression target based on gradient descent of back-propagation [18]. L2 norm produce fuzzy predictions inevitably and the recovered high-resolution images tend to be more smoothly when training the network. Hence, the proposed DCDRN employ the L1 norm (the penalty factor is multiplied by 0.0001) to optimize the network model and set the momentum parameter to 0.9. The L1 norm refers to the difference between the absolute values of each element in the vector and called “Sparse Rule Operator”. It is used to improve the generalization ability by preventing over-fitting. L1-regularization reflects the mapping from LR to HR and obtains better hierarchical features.

Table 2 The training parameters of each layer (While the scale factor = 2)

Let x represents the interpolated low-resolution image and y be the high-resolution image. Given a training set, the aim is to learn a model f with an estimated value of \( \hat{y}=f(x) \). \( \hat{y} \) is the target high-resolution image we estimated. Most of the deep-learning-based super-resolution methods use Rectified linear Units (ReLU) as an activation function:

$$ \max \left(0,{x}_i\right)=\left\{\begin{array}{c}0, if\ {x}_i\le 0\\ {}{x}_i, if\ {x}_i>0\end{array}\right. $$

In the less than zero part, ReLU all abandoned that causes many features lost, increasing the uncertainty of the image recovery. Therefore, a new activation function PReLU introduced by He et al. [9] used in our model. The expression is:

$$ {y}_{ji}=\left\{\begin{array}{c}{x}_{ji}, if\ {x}_{ji}\ge 0\\ {}{a}_{ji}{x}_{ji}, if\ {x}_{ji}<0\end{array}\right. $$

where

$$ {a}_{ji}\sim U\left(l,u\right),l<u\&l,u\in \left[0,1\Big)\right. $$

Comparing the two equations, PReLU introduced a new parameter aji its value often set as 0.01 and make the value of the negative distribution retained. This method retains the original advantages of ReLU and obtains as many characteristics as possible. Above, employ PReLU as the activation function of the model. Each CNN layer is initialized with what He.et.al said. Also initialize the bias and PReLU to zero. Our training time took nearly 3 h on GTX 1080Ti.

4.3 Comparisons with state-of-the-art methods

Peak signal-to-noise ratio (PSNR) is used to compare the accuracy of the proposed method and other SR algorithms. Figure 6 shows the test results for different scale factors and the relationship between the number of iteration (k) and PSNR (dB) when using different scale factors. Numbers of experiments illustrate in Table 1 while looking for the best model. Changed the number of convolution layers, the number of filters and so on. Table 3 exhibits the comparison of PSNR for ×2, ×3, ×4, respectively. The italic indicates the best performance and the bold shows the second best. It is obviously that the proposed algorithm has the best performance for these datasets. Dense connections make the feature maps learning increased lead to some slightly bigger time consumption than FSRCNN. While the image quality is improved, also increases the test time. Comprehensive consideration of both accuracy and time, our model has the best performance. The implementation of each algorithm uses different platforms and libraries, it is unfair to simply calculate the training and testing time. Comprehensive overall performance is needed to judge the quality of a model. We keep the other settings the same as the benchmark model.

Fig. 6
figure 6

Steps (k) vs Performance (PSNR (dB))

Table 3 PSNR for scale factor with ×2, ×3 and ×4 on datasets Set5, Set14, bsd100. Italic indicates the best performance and bold indicates the second

The restoration of buildings is a very representative work in super resolution. In Figs. 7, 8, and 9, DCDRN restored a more pronounced edge profile with the buildings. Because of the dense connections, its features learn repetitively and guarantee quality of high resolution image. And feature maps are concatenating twice to ensure that the residual learning is fully performed, it obtains more contextual information and more conducive to the restoration of images in DCDRN. The performance can be observed by connecting the features of all layers indicate there are complementary information between SR different levels of feature mapping and have strong similarities. Finally, our method is the clearest for the outline recovery of bird and green peppers showed in Figs. 10 and 11. Through these images, the edges of other methods are blurred proved our method recovers better high resolution images.

Fig. 7
figure 7

Super-resolution results of “ img_045 ” (bsd100) with scale factor ×2. For the outline of the building, our method is more accurate and we can see the structure between the glasses more clearly

Fig. 8
figure 8

Super-resolution results of “ img_095 ” (bsd100) with scale factor ×3. Our method restored better details. Better reconstruct the outline of the windows, you can clearly see the structure between the wall and the windows

Fig. 9
figure 9

Super-resolution results of “ img_021 ” (bsd100) with scale factor ×2. Our method recovers sharp lines

Fig. 10
figure 10

Super-resolution results of “ img_002 ” (Set5) with scale factor ×3. The veins in the image are sharpest in our results

Fig. 11
figure 11

Super-resolution results of “ img_0012 ” (Set14) with scale factor ×4. For the edge of green pepper, our method is best

As expected, our network achieved better results than other networks after employing dense connections and dimension reduction which shows the combination of shallow features and deep features improve SR reconstruction performance significantly.

5 Conclusion and future works

The accurate super-resolution method based on convolution neural network proposed in this paper makes use of the dense connections and 1x1CNNs structure. The network for feature extraction is optimized structurally and both local and global features are merged through dense connections and transmitted to the reconstructed part of the network.

In the reconstructed network, the architecture of the 1 × 1 CNNs enables the model to obtain better reconstruction capabilities, and the amount of calculation is smaller than others. In addition, the needless preprocessed images and handled the original size of the image directly. With these conditions, the proposed model achieves better performance with less calculation. After numerous experiments and failures, the model has finally put forward. Another noteworthy aspect of this study is uses an integrated learning network. We combine the two hierarchical structure networks. Deep learning itself has a good ability to solve complex problems. Classic ensemble learning achieves better results with less computation even if the problem is non-negligible. Besides, the integrated model makes parallelization easier and faster. Gradient clipping is used to ensure the training stability. Our method greatly outperforms existing methods on the baseline image. We believe that the proposed DCDRN is easily applied to other image recovery problems such as denoising and defogging.