1 Introduction

Image Super-Resolution (SR), is the process of reconstructing an HR image from one or more LR image. SR is inherently a challenging ill-posed problem since there exist multiple HR image that corresponds to a single LR image. SR has its application in various image processing and computer vision tasks ranging from surveillance, medical imaging, object detection, satellite imaging to different image restoration and recognition tasks.

The SR algorithms can be broadly classified into two categories, one based on the number of input LR images the other based on the principle used to construct the HR image. Further, based on the principles used in the construction of SR image, the SR algorithms can also be categorized into three categories [24]: interpolation-based, model-based, and deep learning-based algorithms. Interpolation-based algorithms like bi-linear or bi-cubic interpolation use local information in an LR image to compute pixel values in the corresponding SR image, which are characterized by high computational efficiency. Prior knowledge is used in model-based algorithms such as the Maximum a Posteriori (MAP) to constrain the solution space whose performance is improved compared to the interpolation-based approach.

Supervised machine learning approaches learn the mapping function that maps LR images to it’s corresponding HR images from a large number of examples. The mapping function learned during the training phase is inverse of the down-sample function that is used to transform the HR image to its corresponding LR image, which can be known or unknown. With the help of known downgrade functions like bi-cubic, bi-linear down-sampling, the LR-HR pair can automatically be generated. This allows the creation of large training data-sets from a vast amount of freely available HR images which can be used for self-supervised learning (Figs. 1 and 2).

Fig. 1.
figure 1

Reconstructed Super-resolution images comparison with Original High-resolution and Low-resolution images.

This work proposes a GAN based SISR model that produces HR image with realistic texture details. Our contributions can be summarized as follows.

  1. 1.

    The generator architecture consists of multiple ECA blocks which emphasize on certain channels, along with it a 3-layer CNN network is added to the generator that extracts sufficient low-level features. The ECA block avoids dimensionality reduction step which destroys the direct relation with a channel and it’s weight, instead, we use 1D convolution to determine the cross channel interaction. The generator achieves state-of-the-art PSNRs when it is trained alone without discriminators.

  2. 2.

    The proposed SISR framework utilizes a dual discriminator network inspired by SRFeat [12] architecture, one that works on image domain that uses Mean Square Error (MSE) loss, the other that works on feature domain that uses perceptual feature loss.

  3. 3.

    During the GAN training phase, the generator and the feature discriminator are trained on perceptual loss [12], which utilizes a pre-trained VGG19 network to calculate the difference of feature map extracts between the original HR image and the generated HR image.

  4. 4.

    The performance of the system is measured based on objective evaluation indicators such as PSNR/SSIM on several public benchmarks data-sets Set5, Set14, and BSD100 [16]. To find the perceptually better image, Mean Opinion Score (MOS) is calculated with the help of 10 people.

Comparative study shows that the proposed architecture with low model complexity significantly improve the performance of the model in terms of PSNR/SSIM and MOS score over the state-of-the-art methods [12, 13, 17, 22, 25] on benchmark data-sets when trained with same parameters.

Fig. 2.
figure 2

Example of super-resolution images from different models.

2 Related Work

Before the application of deep learning in computer vision became popular in 2012 on-wards, the problem of super-resolution was approached using traditional computer vision techniques. Chang et al. in their work [3], used Bi-cubic interpolation for super-resolution. Bi-cubic interpolation and Lanczos re-sampling [15] are very computationally effective techniques, but they suffer from the drawback that they can’t produce an accurate super-resolution image. The Markov Random Field (MRF) [23] approach was first embraced by Freeman et al. to investigate the accessible real images. Similarly, scientists applied sparse coding techniques to SISR [8] problems. Random forest [14] additionally accomplished a lot of progress in the reconstruction of the SISR. Many use a combination of reconstruction based as well as a learning-based approach to decrease the artifacts produced by external training examples.

In recent years, Deep Learning based Image Super-resolution models has demonstrated noteworthy improvement to reconstruction based and other learning-based methods. Dong et al. [5] first presented CNN for SISR reason, from that point forward there have been different enhancements to SISR methods utilizing Deep learning-based methodology [6, 10, 12, 13, 17, 21, 22, 25]. SRCNN had its downside since it utilized a shallow three-layer architecture, thus highlight features couldn’t be caught. Kim et al. proposed VDSR [10], which was an improvement to SRCNN by increasing the number of convolution layers to 20. Influenced by ResNet engineering Lim et al. [13] presented EDSR model which had long and short skip associations that helped to prepare profound SISR systems. Also removing Batch Normalization (BN) layers in the lingering residual network, they improved computational advantages. Even though the models so far created a high Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) score, they couldn’t produce SR image with realistic feature details.

Ledig et al. [12] were probably the first to use GAN for the purpose of SR. A GAN [7] generally consists of two networks, one generates a fake/new image where the other one tries to find whether it’s fake or not. Ledig et al. [12] used SRResNet as generator of GAN. SFTGAN [21] demonstrated that it is conceivable to recoup sensible surfaces by tweaking highlights of a couple halfway layers in a single system adapted on semantic division likelihood maps. SRFeat [17] is another GAN-based SISR strategy where the creators previously proposed the double discriminator, one chip away at picture space and other on highlight areas to deliver perceptually practical HR picture. Xintao et al. introduced ESRGAN [22], which is an improved version of SRGAN, where they use Residual in Residual (RIR) generator architecture. Also, they added realistic discriminator that estimates the probability that the given real data is more practical than fake data. They enhanced the SRGAN model by using features before activation to calculate the perceptual loss of the generator during the adversarial learning phase.

Attention Mechanism. Recent studies show that the uses of attention mechanisms enhance the performance of CNN networks for various tasks. Attention mechanism was first introduced in CNN for image classification task [9, 19, 20]. Hu et al. [9] utilized channel-wise inter-dependencies among various feature channels. [4, 25] used channel attention mechanism for SISR purpose. The dimensionality reduction step used in channel attention mechanism makes correspondence between the channel and its weight indirect. This limitation was surmounted by [18] with the attention mechanism that introduced a local cross channel interaction method using 1D convolution.

Fig. 3.
figure 3

ECASR architecture.

3 Network Architecture

The network architecture as shown in Fig. 3 utilizes a GAN based approach with a generator that produces super-resolution images along with a dual discriminator architecture that classifies the produced SR images into real and fake classes which helps the generator to produce realistic-looking images. The efficient channel attention mechanism used by the generator allows the network to focus on certain channels by modeling their inter-dependencies. The feature maps of the CNN network are then passed through two sub-pixel convolution layers where HR images are generated by upsampling.

3.1 Generator

The LR image after passing through the first convolutional layer of the generator enters the high-level and low-level feature extractors

$$\begin{aligned} F_{0}=H_{SF}(I_{LR}), \end{aligned}$$
(1)

where \(F_{0}\), \(H_{SF}\) and \(I_{LR}\) denotes initial feature map, first convolutional layer and input low-resolution image respectively.

$$\begin{aligned} F_{HF}&= H_{ECA}(F_{0}) \end{aligned}$$
(2)
$$\begin{aligned} F_{LF}&= H_{3CN}(F_{0}) \end{aligned}$$
(3)

\(H_{ECA}\) denotes a deep feature extractor consisting of multiple ECA blocks which extract high-level features (\(F_{HF}\)) and \(H_{3CN}\) denotes the 3-layer shallow convolutional feature extractor that controls the flow of low-level features (\(F_{LF}\)) to the final output.

$$\begin{aligned} F_{Total}=F_{HF}+F_{LF} \end{aligned}$$
(4)

\(F_{Total}\) is the total feature after adding both the features maps element-wise, the total feature map is then passed through the upsampling layer.

$$\begin{aligned} I_{SR}&=H_{REC}(H_{UP}(F_{TF})) \end{aligned}$$
(5)

\(H_{UP}\) and \(H_{REC}\) denotes the upsampling and reconstruction layer respectively that generates the super-resolution image.

Channel Attention. The resultant weight of the channel attention mechanism introduced in SENet [9] also utilized by [4, 25] can be expressed as

$$\begin{aligned} \omega =\sigma (\mathcal {W}_{2}\delta (\mathcal {W}_{1}\mathcal {G}(\chi )), \end{aligned}$$
(6)

\(\mathcal {G}(\chi )\) is the channel-wise global average pooling (GAP) of the output of a convolution block, \(\chi \in \mathbb {R}^{W\times H\times C}\), where W, H and C are width, height and channel dimension. GAP can be expressed as

$$\begin{aligned} \mathcal {G}(\chi )=\frac{1}{WH}\sum _{i=1,j=1}^{W,H}\chi _{ij} \end{aligned}$$
(7)

where \(\delta (\cdot {})\) and \(\sigma (\cdot {})\) indicates the Rectified Linear Unit [1] and Sigmoid activation function respectively. In order to avoid high computational complexities dimension of the channels are first reduced into \((\frac{C}{r})\) and then transformed back into (C). This step of first reducing the dimension of channels into a low dimension space and then mapping it back to the original dimension makes indirect correspondence between the channel and it’s weight which degrades performance by losing valuable information. In this paper we use Leaky ReLU over ReLU activation which fixes the “dying ReLU” problem, as it doesn’t have zero-slope parts, also it provides additional benefit in training speed.

Efficient Channel Attention (ECA) [18]. Instead of using dimensionality reduction to reduce model complexity which destroys the direct correspondence between the channel and its weight, this work uses channel attention mechanism by efficiently using 1D convolution of kernel size k which. Given an output of a convolution block, \(\chi \), the resultant weight of efficient channel attention can be expressed as

$$\begin{aligned} \omega =\sigma (\mathcal {W}\mathcal {G}(\chi )), \end{aligned}$$
(8)

where \(\mathcal {W}\) is a C\(\times \)C parameter matrix. Let y = \(\mathcal {G}(\chi )\) and \(y\in \mathbb {R}^{C}\) where the weight of \(i^{th}\) channel (\(y_{i}\)) can be calculated by considering the interaction among \(y_{i}\) and its k neighbors that can be given as,

$$\begin{aligned} \omega _{i}=\sigma \bigg (\sum _{j=1}^{k}\omega ^{j}y_{i}^{j}\bigg ), y_{i}^{j}\in \varOmega _{i}^{k} \end{aligned}$$
(9)

where \(\varOmega _{i}^{k}\) indicates the set of k adjacent channels of \(y_{i}\), \(\omega _{i}\) is weight of \(y_{i}\) and \(\omega ^{j}\) is convolutional kernel. Equation (9) can be efficiently achieved by using 1D convolution as shown in Eq. (10)

$$\begin{aligned} \omega =\sigma (Conv1D_{k}(y)), \end{aligned}$$
(10)

The size of the kernal can be determined with the Eq. (11) as mentioned in [18].

$$\begin{aligned} k=\bigg |\frac{log_{2}(C)}{\gamma }+\frac{b}{\gamma }\bigg |_{odd}, \end{aligned}$$
(11)

here \(\gamma \) and b are constants, we use 2 and 1 respectively in our training process. \(|x|_{odd}\) denotes the nearest odd integer to x. Let \(F_{ECA-1}\) be the input to our ECA block, the weighted output based on channel attention mechanism can be finally expressed as

$$\begin{aligned} F_{ECA}=\omega \otimes (F_{ECA-1}) \end{aligned}$$
(12)

3.2 Discriminator

Fig. 4.
figure 4

Visual comparison of 4\(\times \) SR on BSD100.

The generator is coupled with a dual discriminator network which is similar to the architecture proposed by SRFeat [17]. One discriminator works on the image domain and the other works on the feature domain. The image discriminator is similar to the discriminator network used by SRGAN [12] while the feature discriminator has the same architecture as the image discriminator but the only difference is that the inputs to the feature discriminators are feature map extracts of the HR image and the super-resolved image from Conv5 of VGG-19 network. Both of them try to classify an image into a real and a fake class.

3.3 Loss Functions

A network that minimizes MSE loss tends to produce excessively smooth images. The generator network thus is pre-trained on Mean Absolute Error (MAE) loss that produces perceptually better-looking images to human eyes.

$$\begin{aligned} L_{MAE}=\frac{1}{WH}\sum _{i}^{W}\sum _{j}^{H}\vert I_{i,j}^{h}-I_{i,j}^{g}\vert \end{aligned}$$
(13)

Here W, H indicates the dimension of the image. The objective of utilizing GAN system here is to improve the perceptual quality. GAN structure can be explained as a minmax game where the generator tries to minimize the discriminator’s gain and the discriminator tries to minimize the generator’s gain which can be characterized as below:

$$\begin{aligned} \min _{g} \max _{d}(E_{y\sim P_{data}(y)}[log(d(y))]+E_{x\sim P_{x}(x)}\nonumber \\ [1-log(d(g(x)))]) \end{aligned}$$
(14)

Here g(x) is the output of a generator network g for x where x is irregular noise. On the other hand d(y) is the output of discriminator for y, where y is an example of genuine information distribution. \(P_{data}(y)\) refers to distribution of real data and \(P_{x}(x)\) is the distribution of generator output.

The discriminators, \(d^{i}\) and \(d^{f}\) denotes a pair of dual discriminators taking a shot at picture and feature area, separately. The generative adversarial procedure with a pre-trained generator and discriminators follows the loss-function characterized as:

$$\begin{aligned} L_{g}=L_{p}+\lambda (L_{a}^{i}+L_{a}^{f}) \end{aligned}$$
(15)

where \(L_{p}\) is a perceptual similarity loss, \(L_{a}^{i}\) is a image GAN loss for the generator \(L_{a}^{f}\) is a feature GAN loss for the generator and \(\lambda \) is the weight for the GAN loss terms. To prepare discriminators \(d_{i}\) and \(d_{f}\) , we minimize the loss of \(L_{d}^{i}\) and, \(L_{d}^{f}\) which corresponds to \(L_{a}^{i}\) and \(L_{a}^{f}\). The generator and discriminators are trained by thus limiting \(L_{g}\), \(L_{d}^{i}\) and \(L_{d}^{f}\). For adversarial learning, the discriminator uses loss functions introduced in [21]. There are by and large three-loss terms adding to the all-out loss, to be specific, perceptual similarity loss, image GAN loss, and GAN loss calculated based on features obtained by passing the super-resolution and high-resolution images through a VGG19 network. The description of each loss will be expressed as follows.

Perceptual Similarity Loss \(L_{p}\). This loss processes the difference between two pictures in the feature area, rather than the pixel space, prompting all the more perceptually fulfilling outcomes. The perceptual loss is characterized as:

$$\begin{aligned} \frac{1}{W_{m}H_{m}C_{m}}\sum _{i}^{W_{m}}\sum _{i}^{H_{m}}\sum _{i}^{C_{m}}(\phi _{i,j,k}^{m}(I^{h})-\phi _{i,j,k}^{m}(I^{g}))^{2} \end{aligned}$$
(16)

where \(W_{m}\), \(H_{m}\), \(C_{m}\) describes the dimensions of the m-th feature map extract from a pre-trained VGG network with \(\phi _{i,j,k}^{m}\) indicating the feature map obtained by the j-th convolution (after activation) before the i-th maxpooling layer within the VGG network.

Image GAN Losses \(L_{a}^{i}\) and \(L_{d}^{i}\). The image GAN loss term \(L_{a}^{i}\) for the generator and the loss function \(L_{d}^{i}\) for the image discriminator are characterized as:

$$\begin{aligned} L_{a}^{i}&=-log(d^{i}(I^{g})), \end{aligned}$$
(17)
$$\begin{aligned} L_{d}^{i}&=-log(d^{i}(I^{h}))-log(1-(d^{i}(I^{g}))), \end{aligned}$$
(18)

where \(d^{i}(I)\) is the yield of the image discriminator \(d^{i}\).

Feature GAN Losses \(L_{a}^{f}\) and \(L_{d}^{f}\). The element GAN loss term \(L_{a}^{f}\) for the generator and the capacity \(L_{d}^{f}\) for the element discriminator are characterized as:

$$\begin{aligned} L_{a}^{f}&=-log(d^{f}(\phi ^{m}(I^{g}))),\end{aligned}$$
(19)
$$\begin{aligned} L_{d}^{f}&=-log(d^{f}(\phi ^{m}(I^{h})))-log(1-(d^{f}(\phi ^{m}(I^{g}))), \end{aligned}$$
(20)

where \(d^{f}(\phi ^{m})\) is the yield of the feature discriminator \(d^{f}\).

4 Experiments

Experiments are done in two phases, in the first phase generator alone is trained on MAE loss, in the second phase the generator is trained on perceptual loss along with the dual discriminators. The performance of the pre-trained generator is evaluated and compared with other state-of-the-art approaches in terms of PSNR and SSIM score. Finally, the results obtained from the GAN based SISR network are evaluated and compared on PSNR/SSIM and MOS score, which proves the efficiency of our network.

Table 1. Comparison of PSNR/SSIM score with and without Low-level feature extractor after 64000 iterations.

The training data-sets are obtained by bi-cubic down-sampling of the HR images. The cropped HR images are of the size 296 \(\times \) 296, while the LR images are of the size 74 \(\times \) 74, which are then again normalized to [–1,1] intensity. DIV2K [2] data-set is used for training which consists of 800 HR training images and 100 HR validation images. The data-set is augmented into 160000 images by random cropping, rotating (\(90^{\circ }, 180^{\circ }\), and \(270^{\circ }\)) and horizontally flipping. Publicly available data-sets Set5, Set14, BSD100 are used for validation. We train the generator network on Nvidia Titan X GPU for \(3.2\times 10^6\) iterations with a batch size of 16 which is optimized by Adam [11] optimizer. The learning rate is initialized \(1\times 10^{-4}\) with a decay of 0.5 for every \(1\times 10^{5}\) iterations.

Table 2. Comparison with PSNR oriented pre-trained models SRGAN, EDSR, ESRGAN, SRFeat, RCAN and our proposed model ECASR on benchmark data [12, 13, 17, 22, 25]. [4\(\times \) upscaling]
Table 3. Comparison of GAN-trained models, SRGAN [12], SRFeat [17], ESRGAN [22] and the proposed model ECASR on benchmark data-set.

5 Results

In order to prove the effectiveness of the shallow low-level feature extractor, we do an ablation study. The network is trained and tested with and without the low-level feature extractor. Results are tabulated in Table 1. It shows that the use of the low-level feature extractor helps in achieving a high PSNR/SSIM score.

Table 4. MOS ratings of SRGAN, ESRGAN, SRFeat [12, 17, 22] and our Proposed Model ECASR on BSD100 [16].

The performance of the generator network is evaluated and compared to bi-cubic interpolation and other state-of-the-art models which proves the efficiency of our proposed network. Quantitative results are summed in Table 2 that shows our model produces comparable results. EDSR and RCAN have higher values for Set14 and BSD100 which is because of their deeper architecture, if we stack up more ECA blocks in our network it should necessarily produce a higher benchmark score. Though the generator produces images that have high PSNR/SSIM values, they lack in perceptual quality. We then compare the performance of our GAN trained network as shown in Table 3 and Fig. 4 provides visual examples that show considerably better performance. The dip in the PSNR/SSIM score of our GAN trained network is presumably for the competition between the MSE based content loss and adversarial loss. We further obtained MOS ratings from a group of 10 people, who rated the images from 1 to 5 based on the reconstruction quality of the images, higher the better. Table 4 shows our proposed model has better MOS ratings compared to SRGAN, SRFeat, and ESRGAN.

6 Conclusion

In this paper, we have proposed a new GAN based SR scheme. In the proposed scheme, ECA method has been used, probably for the first time to solve SR problem. ECA tends to produce higher PSNR results with a smaller number of parameters. Considering low-level features with high-level features for super-resolution, both the fine texture details as well as overall description of the image are captured accurately. Dual discriminator helps the generator to create photo-realistic SR images. Comparison with various models with the same training data-set and parameters show the superiority of our proposed method. There is always a trade-off between model complexity and quality. Stacking up more ECA layers into our model, expected to result better SR images but it will also increase the training complexity.