1 Introduction

Infrared imaging systems, which are widely utilized in the military, medical, and public security areas, can record environmental information under challenging weather such as darkness, rain, and fog. Compared with the million pixel resolution of visible light sensors, the resolution of infrared imaging systems is usually far lower than the resolution required for practical applications. However, increasing the resolution of infrared imaging systems by hardware, such as reducing pixel size or expanding the detector matrix, increases the production cost significantly. More importantly, in some scenarios, such as the military, where volume and weight are typically limiting factors, increasing the resolution through hardware is not practical. Therefore, using software to improve the resolution of infrared images is the most promising technical approach.

Currently, visible image super-resolution (SR) method progressed markedly due to the rapid development of deep learning technology [1]. The single-image SR methods based on deep learning may be grouped into four categories according to the input image characteristics, the network structure, the method of feature extraction, and the way of processing information. (1) The first methods are based on interpolating, in which initial image is first scaled to the output image size by interpolation and then refines the details using a deep network. SRCNN [2] used a deep neural network first time for SR reconstruction. It only employed a three-layer network, but the effect is far better than the traditional methods. VDSR [3] adopted a residual net to build a 20-layer model with enlarged receptive field, resulting in multiple SR images. Based on this idea, some excellent models emerged, such as IRCNN [4], Memnet [5], DRCN [6], DRRN [7], and SDSR [8]. (2) The second methods employ a low-resolution (LR) image directly, avoiding the loss of detail caused by interpolation and drastically reducing the calculation time. The FSRCNN [9] constructed a fast SR network with a deconvolution layer, small convolution kernels and shared deep layers. RED [10] employed a symmetrical convolutional–deconvolution layer. ESPCN [11] extracted features in LR space and enlarged the image to the target size by a sub-pixel convolutional layer. SRResNet [12] and EDSR [13] extracted features in the LR space with residual learning and enlarged LR features by a sub-pixel convolutional layer. Luying Li [14] proposed an unsupervised face super-resolution via gradient enhancement and semantic guidance. (3) The third methods adopt dense net technique which overcomes the sparseness of effective features layer-by-layer. SRDenseNet [15] used a dense net to obtain SR images with a better visual effect. The paper [16] proposed a joint restoration convolutional neural network for low-quality image super-resolution. The paper [17] proposed a single-image SR method based on local biquadratic spline with edge constraints and adaptive optimization in transform domain. Zhang et al. [18] extracted local features by a residual dense network. DenseNet [19] used high-frequency information to augment the dense network (SRDN), which pays more attention to high-frequency regions such as edges and textures. (4) The fourth methods employ the GAN net. SRGAN [20] raised the quality of SR to a new level based on a GAN model. ESRGAN [21] increased the speed of model training and improved the quality of SR.

However, there are several problems when using the above methods directly for infrared image SR. That because these visible image SR methods do not take into account the unique characteristics of infrared image, such as: Infrared image often has low resolution, weak contrast, and few details [22, 23]; the fine geometric structure in the infrared image is easy to be destroyed in the process of super-division resulting in distortion [24, 25]; and water vapor absorption and atmospheric scattering causes blur, showing the characteristics of haze in infrared image.

Considering the above problems, this paper presents a new method for infrared image SR. The main contributions of this paper are as follows:

  1. (1)

    To obtain high-quality SR image from single-frame infrared image, we propose a dual-branch deep neural network. The image SR branch reconstructs the SR image from the initial infrared image using a basic structure similar to the ESRGAN. The gradient SR branch removes haze, extracts the gradient map, and reconstructs the high-resolution gradient map. To reduce the complexity and calculation, the gradient SR branch directly uses the intermediate-level features extracted in the image SR branch.

  2. (2)

    Since infrared image has lower contrast and less detail than visible image, enhancing the detail information in the original image is important for infrared image SR. To emphasize contrast of initial IR image, a haze removal method based on a dark-channel prior model [26] is adopted before gradient extraction block.

  3. (3)

    To fuse the gradient SR map into the image SR map more naturally, this paper adopts a fusion block based on attention mechanism.

  4. (4)

    To preserving fine geometric structure, we designed a gradient L1 loss and gradient GAN loss, which supervises the generator training as a second-order constraint.

2 Methods

2.1 Overall process

The method includes two branches, the image SR branch and gradient SR branch, as shown in Fig. 1. The image SR branch reconstructs a SR image from an initial infrared image using a basic structure similar to ESRGAN; the gradient SR branch removes the haze first, then extracts the gradient map, and reconstructs the SR gradient map. To obtain more naturally SR image, a fusion block based on attention mechanism is adopted. To reduce calculations during the gradient SR process, several intermediate-level features from the image SR branch are used directly.

Fig. 1
figure 1

Overall framework of our method

The image SR branch uses a network similar to ESRGAN [21] constructed with the basic residual-in-residual dense block (RRDB) module to reconstruct an SR image from the initial infrared image. The gradient extraction block extracts the gradient map using the Sobel or Laplacian operator. The following sections will introduce the haze removal module, the gradient SR branch, and the fusion block in detail. To better preserve the structure, we add gradient L1 loss and gradient GAN loss.

2.2 Haze removal of infrared image

Since infrared images have lower contrast and less detail than visible images [27], enhancing the detail information in the original image is important for infrared image SR. However, infrared images are usually blurred and show haze characteristics in visually because of water vapor absorption and atmospheric scattering [28]. So, a haze removal based on dark-channel prior model is adopted to emphasize contrast of initial IR image before gradient extraction block.

The dark-channel prior model [26] is a haze removal method of visible image which has RGB three channels. Its basic hypothesis is that the haze patch has very low intensity at least one color channels in most non-sky patches. In other words, the minimum intensity in such a patch should be very low. It expressed by a mathematical model as follows:

$$ J^{d} (x) = \mathop {\min }\limits_{c \in (r,g,b)} \left( {\mathop {\min }\limits_{y \in \Omega (x)} J^{c} (y)} \right) \to 0 $$
(1)

where \(J^{c}\) is the color channel of the image \(J\), \(\Omega (x)\) is a local patch centered at (\(x\),\(y\)), and cϵ(r,g,b) is a pixel in the patch \(\Omega (x)\). The dark-channel prior model says that if \(J\) is a haze-free outdoor image, then, except for the sky region, the intensity of \(J^{d}\) is low and tends to zero. Since the infrared image only has one channel, Eq. (1) can be simplified as follows:

$$ J^{d} (x) = \mathop {\min }\limits_{y \in \Omega (x)} J(y) \to 0 $$
(2)

According to the above dark-channel prior model (1), we can model the transmission and simplify its estimation as follows:

$$ t(x) = (1 - \omega )\mathop {\min }\limits_{y \in \Omega (x)} \frac{I(y)}{A} $$
(3)

where \(I(y)\) is the original image with haze, which is known, \(A\) is the global atmospheric light value, which is unknown, and\(\omega\) is the rate of haze removal in the interval [0,1], whose default value is 0.95.

In practice, a simple method can be used to estimate the atmospheric light \(A\) with the following steps:

  1. (1)

    Picking the top 0.1% brightest pixels from the dark channel. These pixels are most haze opaque.

  2. (2)

    Among these pixels, the pixels with highest intensity in the input image are selected as the atmospheric light \(A\).

After \(A\) and \(t(x)\) are estimated, the haze-free image can be recovered from the haze removal model using the following formula:

$$ J(x){ = }\frac{I(x) - A}{{\max \left( {t(x),t_{0} } \right)}} $$
(4)

where \(I(x)\) is the input image, A is the estimated global atmospheric light, and \(t(x)\) is the estimated transmission within the window using Eq. (3). \(t_{0}\) is the lowest transmission, which means that a small amount of haze is preserved in very dense haze regions. A typical value of \(t_{0}\) is 0.1.

2.3 Gradient SR branch

The gradient SR branch recovers SR gradient map from the LR gradient map with several intermediate-level features from the image SR branch. The recovered SR gradient map will be sent into the fusion block to the final SR image. The network structure of the gradient SR branch is shown in Fig. 2.

Fig. 2
figure 2

The structure of gradient SR branch

Consistent with the image SR branch has 23 RRDBs, the gradient SR branch consists of 22 Grad–Conv block, three independent 3 × 3 Conv blocks, and one 4 times upsampling block. Each Grad–Conv block integrates the output of the previous Grad–Conv block and the output of the current RRDB block to produce next level gradient feature. The motivation of such a scheme is that the well-designed ESRGAN can carry rich structural information, which is important for the recovery of gradient maps.

The Grad–Conv block locates between two RRDBs to extract high-level features from the gradient map. The Grad block can be a residual block or a bottleneck block. The two structures have no obvious advantages or disadvantages, and both can be utilized as practical examples. The two network structures of the Grad block are shown in Fig. 3.

Fig. 3
figure 3

Two structures of Grad block

2.4 Fusion block

To fuse the gradient SR map into the image SR map more naturally, this paper adopts a fusion block based on attention mechanism, which is shown in Fig. 4. First, the gradient SR map and the image SR map are put into attention block, respectively, to obtain the corresponding weights; then, the map enhancement is performed with their weights, respectively, to obtain the fused maps based on each attention; finally, the maps are fused with averaging rule to obtain the final fused image.

Fig. 4
figure 4

Fusion block

The calculation equation is as follows:

$$ I_{{{\text{ir}}}} = w_{{{\text{ir}}}} \times I_{{{\text{ESRGAN}}}}^{{{\text{SR}}}} + w_{{{\text{gr}}}} \times I_{{{\text{Grad}}}}^{{{\text{SR}}}} $$
(5)
$$ I_{{{\text{gradient}}}} = \varphi_{{{\text{ir}}}} \times I_{{{\text{ESRGAN}}}}^{{{\text{SR}}}} + \varphi_{{{\text{gr}}}} \times I_{{{\text{Grad}}}}^{{{\text{SR}}}} $$
(6)
$$ I_{{{\text{final}}}}^{{{\text{SR}}}} = (I_{{{\text{ir}}}}^{{}} + I_{{{\text{gradient}}}} )/2 $$
(7)

where \(w_{{{\text{ir}}}}\) and \(w_{{{\text{gr}}}}\) denote the weights of \(I_{{{\text{ESRGAN}}}}^{{{\text{SR}}}}\) and \(I_{{{\text{Grad}}}}^{{{\text{SR}}}}\), respectively, in the IR image attention block, and similarly, \(\varphi_{{{\text{ir}}}}\) and \(\varphi_{{{\text{gr}}}}\) denote the weights of \(I_{{{\text{ESRGAN}}}}^{{{\text{SR}}}}\) and \(I_{{{\text{Grad}}}}^{{{\text{SR}}}}\), respectively, in the gradient image attention block. \(I_{{{\text{ir}}}}^{{}}\) and \(I_{{{\text{gradient}}}}\) denote, respectively, the maps enhancement in the IR image attention block and the gradient image attention block. \(I_{{{\text{final}}}}^{{{\text{SR}}}}\) is the final fused SR image.

The detail of the attention block is shown in Fig. 5. It is noting that the larger size convolution kernels are decomposed into depth-wise convolution (DWC), depth-wise dilation convolution (DWDC), and 1*1 convolution, which reduces the number of parameters while maintaining a larger receptive field and improves efficiency. To calculate weights, soft-max and average pooling are adopted in the attention block.

Fig. 5
figure 5

Attention block

2.5 Loss function

To preserve geometric structure, we add gradient loss including gradient L1 loss and gradient GAN loss. So, the loss function includes image branch loss and gradient branch loss. The composition structure is as follows (Fig. 6):

  1. (1)

    L1 loss

The \(L_{I}^{{{\text{pix}}}}\) loss represents the absolute error between the SR image and real high-resolution image.

$$ L_{I}^{{{\text{pix}}}} = E_{I} ||G(I^{{{\text{LR}}}} ) - I^{{{\text{HR}}}} ||_{1} $$
(8)

where \(I^{{{\text{LR}}}}\) is the initial LR image, \(I^{{{\text{HR}}}}\) is the high-resolution (HR) image, \(G( \cdot )\) is the generator whose output is the SR image, \(\left\| \cdot \right\|_{1}\) is the L1 norm, and \(E_{I} ( \cdot )\) is the sum of pixels in the image \(I\).

  1. (2)

    Perceptual loss

The perceptual loss \(L_{I}^{{{\text{per}}}}\) characterizes the error between the i-th layer feature \(\phi_{i}\) of the SR image and the corresponding layer feature of the real high-resolution image:

$$ L_{I}^{{{\text{per}}}} = E_{I} ||\phi_{i} (G(I^{{{\text{LR}}}} )) - \phi_{i} (I^{{{\text{HR}}}} )||_{1} $$
(9)

where \(\phi_{i} ( \cdot )\) denotes the i-th layer output of the image SR model.

  1. (3)

    Image GAN loss

The image SR branch is an adversarial network (GAN). The image SR branch discriminator \(D\) and generator \(G\) are optimized using a two-player game as follows:

$$ L_{I}^{{{\text{Dis}}}} = - E_{I} [\log (1 - D(I^{{{\text{SR}}}} ))] - E_{I} [\log (D(I^{{{\text{HR}}}} ))] $$
(10)
$$ L_{I}^{{{\text{Adv}}}} = - E_{I} [\log D(G(I^{{{\text{LR}}}} ))] $$
(11)

where \(I^{{{\text{SR}}}}\) is the SR infrared image.

  1. (4)

    Gradient L1 loss

Since gradient map can reflect structural information, we use it as a second-order constraint to supervise the generator training. With the supervision in both image and gradient domains, the generator can not only learn fine appearance, but also avoid fine geometric constructure distortion. The gradient L1 loss \(L_{{{\text{GM}}}}^{{{\text{pix}}}}\) characterizes the absolute error between the generated SR gradient feature map and the HR gradient feature map.

$$ L_{{{\text{GM}}}}^{{{\text{pix}}}} = E_{{{\text{GM}}}} ||M(G(I^{{{\text{LR}}}} )) - M(I^{{{\text{HR}}}} )||_{1} $$
(12)

where \(M( \cdot )\) is the gradient extraction.

  1. (5)

    Gradient GAN loss

To discriminate whether a gradient branch is from the HR gradient map, the gradient discriminator is defined as follows:

$$ L_{{{\text{GM}}}}^{{{\text{Dis}}}} = - E_{{{\text{GM}}}} [\log (1 - D(M(I^{{{\text{SR}}}} )))] - E_{{{\text{GM}}}} [\log (D(M(I^{{{\text{HR}}}} )))] $$
(13)

To supervise the generation of SR results by adversarial learning, the adversarial loss of the gradient branch is defined as follows:

$$ L_{{{\text{GM}}}}^{{{\text{Adv}}}} = - E_{{{\text{GM}}}} [\log D(G(M(I^{{{\text{LR}}}} )))] $$
(14)
  1. (6)

    Overall loss

Combining the above various types of losses, the overall loss function is obtained as follows:

$$ \begin{aligned} L & = aL_{I}^{{{\text{pix}}}} + bL_{I}^{{{\text{per}}}} + cL_{I}^{{{\text{Dis}}}} + dL_{I}^{{{\text{Adv}}}} \\ & \quad + eL_{{{\text{GM}}}}^{{{\text{pix}}}} + fL_{{{\text{GM}}}}^{{{\text{Dis}}}} + gL_{{{\text{GM}}}}^{{{\text{Adv}}}} \\ \end{aligned} $$
(15)

where \(a,b,c,d,e,f,g\) are the weighting parameters which meet the condition:

$$ a + b + c + d + e + f + g = 1 $$
(16)
Fig. 6
figure 6

Composition structure of the loss function

3 Dataset and experiment analysis

3.1 Dataset and experimental settings

In our experiments, we use the public infrared image dataset titled “A dataset for infrared detection and tracking of dim-small aircraft targets underground/air background,” downloaded from http://www.csdata.org/p/387/. The dataset contains 22 sequence images from 22 independent videos, totaling 16,177 frames. Each frame is in the 3–5 μm mid-infrared band, having 256 × 256 resolution, 8-bit depth, 193 KB size, and ​​bmp format.

Before training, we extracted 3235 frames at five frame intervals from the 16,177 frames. The samples are down-sampled by ¼ to a resolution of 64 × 64, constituting the LR images, while the original images with 256 × 256 resolution are the corresponding HR images. So far, we have built a dataset containing 3235 LR-HR image pairs. In the experiment, we randomly selected 70% of the image pairs in the dataset are used for model training, while the remaining 30% are used for model testing.

3.2 Infrared image haze removal experiment

The window radius \(\Omega\) for haze removal is 5, the rate of haze removal \(\omega\) is 0.95, and the lowest transmission \(t_{0}\) is 0.1. The first and third columns of Fig. 7 show the images before and after the haze removal, respectively. Seem from the results, the contrast of image after haze removal is improved obviously compared with the original. More crucially, some blurred details become clearer.

Fig. 7
figure 7

Experiment results of haze removal

Gradient maps are also extracted using the Sobel operator. The second and fourth columns of Fig. 7 show gradient maps before and after the haze removal. The fourth column evidently has more gradient information. This suggests that more details are recovered by removing haze.

3.3 Qualitative comparison

We compare our method to other single-frame SR methods, such as bicubic, SRResNet [12], SRGAN [20], and ESRGAN [21]. Figure 8 shows the 484th frame of data15, which contains a tower. The tower in the result of the bicubic SR is still blurry. Although the tower in the results of SRResNet, SRGAN, and ESRGAN gradually becomes clear, they have different degrees of geometric distortion. The geometric structure of the tower in our method preserving better compared with above methods. In terms of visual effects, our results are more natural and realistic than other methods.

Fig. 8
figure 8

A steel tower in 484th frame of data15 [4 × super-resolution]

Figure 9 shows the 35th of data20 containing a house. The SR result of the bicubic is still blurry. Though the house edges in the results of SRGAN and ESRGAN become sharp, there are still noticeable geometric distortions, as indicated by red ellipses. The house edge in the SRResNet and our method is clear and with no distortion. But on finer objects, such as distant buildings marked with red boxes, our method is cleaner and more natural.

Fig. 9
figure 9

A house in 35th frame of data20 [4 × super-resolution]

More comparisons are shown in Figs. 10, 11, and 12. Our method’s better performance is primarily attributable to the following three aspects: (1) More details are recovered by haze removal; (2) gradient map is extracted and used to guide image SR; and (3) the gradient losses are added, which imposes secondary constraints on the image SR for preserving structural information. As a result, our outcome is more natural and realistic, and preserves geometric structure better.

Fig. 10
figure 10

An aircraft in 348th frame of data16 [4 × super-resolution]

Fig. 11
figure 11

A road in 26th frame of data6 [4 × super-resolution]

Fig. 12
figure 12

A steel tower and a road in 35th frame of data19 [4 × super-resolution]

3.4 Quantitative comparison

We use PSNR (peak signal-to-noise ratio), SSIM (structure similarity [29]), and PI (perceptual index [30]) to quantitatively evaluate the SR performance. The value range of SSIM is [− 1 1]. The higher the PSNR and SSIM indicators are, the better, while the lower the PI is, the better.

A traditional method cubic interpolation, a filter-based method SelfExSR [31], and three deep learning methods (SRResNet [12], SRGAN [20], and ESRGAN [21]) are chosen for comparison. The results are summarized in Table 1. The best results for each indicator are shown in bolditalic, and the next best are in italic bold. Our technique is best for PSNR and SSIM, as indicated in the table, whereas ESRGAN is best for PI. In case of PI, our method is somewhat inferior to ESRGAN; however, the difference is not discernible. Our method outperforms other methods on the SSIM, about 0.03 higher than the second SRResNet.

Table 1 Quantitative comparison performed on the public infrared image dataset

3.5 Ablation experiments

To verify the effectiveness of each part of the model, ablation experiments are also conducted on the dataset introduced in Sect. 3.1. We also use three indicators (PSNR, SSIM, and PI) to quantitatively evaluate the SR performance. The best results for each indicator are shown in bolditalic, and the next best are in italic bold.

In the first experiment, we cut off the gradient branch and remove the fusion block. It is essentially a single-branch network that is ESRGAN. In the second experiment, we only remove the haze removal block. In the third experiment, we only remove the gradient SR branch. In the fourth experiment, we simply replace the fusion block with concat block, that is, the SR results of the two branches are directly added together. The last is the complete version of our method. The results of the ablation experiments are shown in Table 2.

Table 2 Results of the ablation experiment

When removing the haze removal module, the PSNR declined 0.31, the SSIM only slight declined 0.0085, and the PI raised 0.15. It proved that haze removal is beneficial for infrared image SR. When removing the gradient SR branch, the three indicators all became worse obviously. This may be because the gradient map is still LR while the image is HR. There will be a wrong correspondence between the LR gradient map and the HR image. When replacing the fusion block with concat block, the PSNR declined 0.16, the SSIM only slight declined 0.0166, and the PI raised 0.124. The fusion block is also beneficial for infrared image SR.

3.6 Computational cost analysis

Our experiment was carried out in the following environment: GPU 2080Ti, Intel(R) Xeon(R) CPU E5-2660 v2 2.20 GHz, RAM 32.0G; 64-bit Windows OS. When SR reconstructing an image from 256 × 256 pixels to 1024 × 1024 pixels, the average time required by ESRGAN is 0.233 s, while ours is 0.257 s. Though the computational cost of our method is slightly more expensive than the ESRGAN, it is still within the acceptable range.

4 Conclusions

The visible image SR methods do not perform well when directly used for infrared image SR. It is because that infrared image has weaker contrast and fewer details than visible image. To get high-quality SR from single-frame infrared images, this paper proposes a dual-branch deep neural network. The image SR branch reconstructs the SR image from the initial infrared image using a basic structure similar to the enhanced SR generative adversarial network (ESRGAN). The gradient SR branch removes haze, extracts the gradient map, and reconstructs the SR gradient map. To obtain more natural SR image, a fusion block based on attention mechanism is adopted between these branches. To preserve the geometric structure, gradient L1 loss and gradient GAN loss are defined and added.

Experimental results on a public infrared image dataset demonstrate that, compared with the current SR methods, the proposed method is more natural and realistic, and can better preserve the structures.

In the future, we will study how to generate SR gradient images with only a single branch using gradient map as the guided filter.