1 Introduction

As an information-enhanced technique, image fusion plays a vital role in computer vision [1, 2]. Specifically, image fusion aims to extract and integrate feature information from different source images to generate a single fused image, providing more comprehensive information for the human visual system and applications [3, 4]. Based on different types of source images, image fusion can be mainly divided into five categories [5]: multi-focus image fusion [6], infrared and visible image fusion [7], multi-exposure image fusion [8], multi-modal medical image fusion [9], and remote sensing image fusion [10]. Among these, infrared and visible image fusion is one of the essential branches since the high complementarity between source images [11].

Generally speaking, visible images contain high resolution and abundant texture details, but they are susceptible to the environment, such as low illumination environment and foggy atmosphere, resulting in the ability of depicts target drops significantly [12] (Fig. 1a). Compared with visible images, infrared images are advantageous in representing salient targets because of the thermal radiation [13]. However, infrared images have shortcomings such as poor texture and low resolution (Fig. 1b). Therefore, the fusion of infrared and visible images can obtain more comprehensive scene information, as shown in Fig. 1c. The fused image contains both significant infrared targets and preserves abundant visible detail information, potentially improving the recognizability of targets under various environments [14].

Fig. 1
figure 1

a Visible image; b infrared image; c fusion result of our method

Infrared and visible image fusion methods can be divided into pixel-level, feature-level, and decision-level-based methods [15]. Some traditional pixel-level-based methods can fuse infrared and visible images, but minimum artifacts still exist [16]. Currently, with the development of multiscale theory, many corresponding algorithms have been proposed, such as TE-MST [17], discrete wavelet transform (DWT) [18], MGFF [19], and ADF [20]. Meanwhile, some sub-space-based methods have also been proposed, such as FPDE [21], PCA [22], and ICA [23].

Some feature-based methods are also applied in infrared and visible fusion tasks. Li et al. proposed a novel fusion strategy (IVFusion) [24], which obtained the feature weight map by calculating the structural similarity between the pre-fusion image and the source images. In our previous work, we extracted salient features in infrared images through a significant target analysis method to highlight the saliency of fused targets [25]. Other feature exact filters such as CBF [26] and GFF [27], are also applied in infrared and visible image fusion tasks.

The decision-level-based method can facilitate the process of designing fusion rules. Zhao et al. proposed an infrared and visible image fusion method based on the Bayesian model [28], which can automatically obtain the adaptive fusion weight. Panigrahy et al. [29]. proposed a parameter adaptive unit-link dual-channel PCNN model to automatically measure the activity level and demonstrated the proposed model is more effective than the corresponding manual settings. Meanwhile, with the development of deep learning theory, many fusion methods based on neural networks have been proposed in recent years. Li et al. [30] proposed a novel end-to-end residual fusion network (named RFN-Nest) in which the detail-preserving and feature-enhancing loss functions are carefully designed to preserve the detail of the fused image. Ma et al. [31] proposed FusionGAN, which first applied GAN to infrared and visible image fusion tasks, avoiding the difficulty of manually designing fusion rules. Li et al. [32]. proposed a novel coupled GAN structure to conduct the fusion task and they use the pre-fused image as the label to guide the training process. Tan et al. [33]. designed a flexible spatiotemporal fusion model based on conditional GAN, which can alleviate the time dependence between references and prediction images. Based on GAN and dense blocks, Fu et al. [34] proposed a perceptual FusionGAN network. By concatenating visible images at each layer and designing structural similarity and gradient loss functions, the fused image is consistent with perception. These deep learning-based methods can also automatically extract deep features. However, challenges still exist, such as lacking training data and ground truth.

Infrared and visible image fusion aim to obtain a fused image with salient targets and preserve rich texture detail information. However, the quality of infrared and visible images collected by sensors is usually different, especially those visible images taken from low-illumination conditions are subjected to losing many details and cannot obtain a good fusion result. Moreover, some current infrared and visible image fusion algorithms mainly focus on the decomposition mode and fusion rules, such as MGFF [19], IVFusion [24], and our previous work [25], so the robustness of the low-illumination visible images needs to be improved.

Based on the above analysis, a novel adaptive visual enhancement and high-significant target detection-based infrared and visible image fusion scheme is proposed. First, the illumination component is estimated by the fast bright-pass bilateral filtering, and then enhance the visible image by an adaptive gamma enhancement method. Second, the infrared patch-tensor model (IPT) is constructed to extract the coarse infrared targets. Then the accurate salient regions map is generated by the iterative optimization method based on the guided filter. Third, decomposing the infrared and visible images into base and detail layers by the efficient hybrid \(\ell_{1} - \ell_{0}\) model, the base layers are fused by the visual saliency map strategy, and the detail layers are fused by the visual difference feature map strategy. The final fused image is obtained by merging the fused base layer, detail layers, and infrared targets.

The main contributions of our work can be summarized as follows:

  1. (1)

    A visible image enhancement algorithm (named BFAC) is proposed. BFAC is based on bright-pass bilateral filter and adaptive-gamma correction. Unlike some previous fusion algorithms that directly fuse infrared and visible images, BFAC is applied to preprocess the visible image before fusion. Qualitative and quantitative experimental results demonstrate that BFAC can effectively enhance the hidden details in visible images while avoiding the overexposure problem, leading to the final fusion results containing rich scene information and consistent with the human visual system.

  2. (2)

    An infrared targets extraction algorithm (named IGPT) is proposed. The algorithm first obtains coarse saliency regions by calculating the tensor rank through the IPT model and then generates the accurate saliency region map through iterative guided filtering and convergence conditions. Compared with previous saliency detection networks, this algorithm can extract infrared targets accurately, thereby enhancing the saliency of infrared targets in the final fused image.

  3. (3)

    Based on BFAC and IGPT, we propose a novel infrared and visible image fusion scheme based on the hybrid \(\ell_{1} - \ell_{0}\) decomposition model. The qualitative and quantitative analysis demonstrates that the final fusion results are superior to some state-of-the-art methods with richer texture details and salient infrared targets, which can potentially improve the recognizability of the target under complex environments.

The rest of this paper is organized as follows. In Sect. 2, corresponding preliminaries and methods are discussed. Section 3 presents the proposed method in detail. The experimental results and discussions are provided in Sect. 4. Section 5 draws a conclusion of this work.

2 Preliminaries

This section briefly introduces the preliminary theories and methods, including the Retinx model and Infrared-Patch-Tensor model (IPT).

2.1 Retinx model

The Retinex model is widely applied in image enhancement tasks [35, 36], as it is consistent with the human visual system. The idea of Retinex-based methods can be generally summarized as three steps. First, decomposing the observed image into the illumination component and reflectance component. Second, enhancing the illumination component. Then, the final enhanced image can be obtained by combing the enhanced illumination and reflectance component.

Bright-pass bilateral filtering (BPBF) [37] belongs to the Retinex theory and can be used to estimate the illumination component \(L_{s}\) of a source image \(VI_{s}\). Which can be mathematically expressed as:

$$ L_{s} (i) = \frac{1}{{\eta_{i} }}\sum\limits_{j \in \Omega } \omega (j)\Phi (VI_{s} (i - j) - VI_{s} (i))VI_{s} (i - j), $$
(1)

where

$$ \eta_{i} = \sum\limits_{j \in \Omega } {\omega (j)\Phi } (VI_{s} (i - j) - VI_{s} (i)). $$
(2)

\(\Omega\) is the neighborhood of the performed aggregation.

The spatial kernel \(\omega\) is:

$$ \omega (i) = \exp \left( - \frac{{||i||^{2} }}{{2\theta^{2} }}\right), $$
(3)

The range kernel \(\Phi\) in (1) is:

$$ \Phi (t) = \left\{ {\begin{array}{*{20}l} {\exp \left( { - \frac{{t^{2} }}{{2\sigma^{2} }}} \right)} & {{\text{ if}}\; \, t \ge 0,} \\ {0} & {\text{otherwise,}} \\ \end{array} } \right. $$
(4)

where \(\theta\) and \(\sigma\) are standard deviations.

Based on BPBF, a Fourier approximation algorithm called FBPBF is proposed in [38], which replaces the range kernel \(\Phi\) with a \(K\)-term Fourier approximation. As a result, the FBPBF can effectively estimate the illumination component with a low computing cost.

2.2 Infrared-patch-tensor model

The infrared patch-tensor (IPT) model converted the detection of infrared targets as a matrix decomposition and optimization problem, which assumed the background is low-rank and the salient regions are sparse. Accurately approximating the background tensor rank is a key in the IPT model.

In [39], kong et al. proposed a novel IPT model base on nonconvex tensor fibered rank approximation to detect the salient regions of infrared images. The proposed model can be mathematically defined as:

$$ \begin{gathered} \mathop {\min }\limits_{{{\mathcal{B}},{\mathcal{T}}}} ||{\mathcal{B}}||_{LogTFNN} + \lambda ||{\mathcal{T}} \odot \mathcal{W}||_{1} + \beta ||{\mathcal{B}}||_{HTV} , \hfill \\ {\text{s}}.{\text{t}}{.}||{\mathcal{I}} - {\mathcal{B}} - {\mathcal{T}}||_{F} \le \delta \hfill \\ \end{gathered} $$
(5)

where \({\mathcal{I}}\) is the infrared patch tensor, \({\mathcal{B}}\) is the background patch tensor, and \({\mathcal{T}}\) is the target patch tensor. In (5), the infrared patch images are converted to tensors to improve the data processing efficiency. Then, the tensor fibered nuclear norm based on Log operation (\(LogTFNN\)) is used to approximate the tensor fibered rank. In order to suppress the sparse interference of corners and edges, the prior information is added by the local structure tensor \(\mathcal{W}\), \(\odot\) is the Hadamard product. Factor \(\lambda\) is used to balance the sparse and low-rank components. Besides, a novel hypertotal variation regularization term (HTV) [40] is also applied to constrain background noise and \(\beta\) is the corresponding regularization factor. The model in (5) can be further expanded as:

$$ \begin{aligned} & \mathop {\min }\limits_{{{\mathcal{B}},{\mathcal{T}}}} \sum\limits_{k = 1}^{3} {\alpha_{k} {\mathcal{H}}_{k} ({\mathcal{X}}_{k} ) + \lambda ||{\mathcal{T}} \odot {\mathcal{W}}} ||_{1} + \beta ||{\mathcal{Z}}||_{HTV} \\ & \quad + \frac{1}{2\mu }||{\mathcal{I}} - {\mathcal{B}} - {\mathcal{T}}||_{F}^{2} , \\ & {\text{s}}.{\text{t}}{.}\left\{ {\begin{array}{*{20}l} {{\mathcal{B}} = {\mathcal{X}}_{k} ,k = 1,2,3} \hfill \\ {{\mathcal{Z}} = {\mathcal{B}}} \hfill \\ \end{array} } \right. \end{aligned} $$
(6)

where mode—\({\text{mode-}}k{\text{t{-}SVD}}\)[41] is used to calculate the tensor rank, the variable \(k\) ranges from 1 to 3, \(\alpha_{k}\) is a weighting factor and satisfies the condition \(\sum\nolimits_{k = 1}^{3} {\alpha_{k} = 1}\),\({\mathcal{H}}_{k} ({\mathcal{X}}_{k} ) = ||{\mathcal{B}}||_{{\text{log}} {-} {\text{mode}} {-} {\text{k}} {-} {\text{TFNN}}}\), \({\mathcal{Z}}\) is the auxiliary variable, and \(\mu\) is the penalty factor.

In order to solve this optimization problem, the alternating direction method of multipliers (ADMM) [42] framework is applied, which converts problem (6) to solving five subproblems iteratively, then the coarse background and salient targets can be obtained by reconstructing \({\mathcal{B}}\) and \({\mathcal{T}}\).

3 Methodology

The framework of the proposed method is shown in Fig. 2 which consists of three sub-modules: visible image adaptive enhancement module, infrared target extraction module, decomposition, and fusion module.

Fig. 2
figure 2

The framework of the proposed method

First, the illumination component is estimated by the fast bright-pass bilateral filtering, and then we correct the brightness of the visible image by calculating the adaptive gamma value so as to obtain the enhanced visible image with an excellent visual effect. Second, the infrared patch-tensor model is constructed to extract the coarse infrared targets; then, the accurate target map is generated through iterative guided filtering and convergence conditions. Third, decomposing the infrared and visible images into base and detail layers by the efficient hybrid \(\ell_{{1}} { - }\ell_{{0}}\) model, the base layers are fused by the visual saliency map, and the detail layers are fused by the visual difference feature map. The final fused image is obtained by merging the fused base layer, detail layers, and infrared targets.

3.1 Bright-pass bilateral filter and adaptive-gamma correction-based image enhancement (BFAC)

As mentioned in the preliminaries section, given an observed image \(VI_{s}\), the FBPBF algorithm [38] can estimate the illumination component \(L_{s}\) by a Fourier approximation of bright-pass bilateral filtering. Then multiply the gamma correction with the illumination component to obtain the enhanced image \(VI\). This process can be presented in (7).

$$ {\text{VI}} = \frac{{VI_{s} }}{{L_{s} }} \times 255\left( {\frac{{L_{s} }}{255}} \right)^{\rm T} , $$
(7)

where \({\rm T} = {1 \mathord{\left/ {\vphantom {1 \gamma }} \right. \kern-0pt} \gamma }\), in the above-mentioned FBPBF algorithm, the \(\gamma\) is set to 2.2, enhancing the image uniformly. However, for the infrared and visible image fusion task, the characteristics of the collected visible images are diverse. It is impractical to enhance all the visible images with a constant coefficient, as shown in (b2), (b3), (b4), (b5), and (b6) in Fig. 3.

Fig. 3
figure 3

The result of visible image enhancement. The first row is the original visible images; the second row is the enhanced results of FBPBF [38]; the third row is the enhanced results of the proposed method (BFAC)

Overexposure problems arise in FBPBF. Therefore, we proposed an efficient algorithm to achieve adaptive enhancement of visible images.

In (7), the illumination component \(L_{s}\) is enhanced by the gamma correction function:

$$ F_{\gamma } (x) = (x)^{{\frac{1}{\gamma }}} , $$
(8)

where \(x\) is the normalized pixel value in \(L_{s}\) which ranges from 0 to 1, and the range of \(F_{\gamma }\) is \((0,1)\) too. Figure 4 (p1) shows the plot of the gamma correction function \(F_{\gamma } (x)\) with the different values of \(\gamma\), when \(\gamma\) is increased, the value of \(F_{\gamma }\) gradually increases, and the enhancement effect gradually strengthens. So, we can control the degree of luminance by adaptively adjusting \(\gamma\).

Fig. 4
figure 4

(p1): The plot of \(F_{\gamma } (x)\) with different \(\gamma\). when \(x\) remains constant, the value of \(\gamma\) continues to increase, the value of \(F_{\lambda } (x)\) increases too. (p2): The relationship between \(I_{Mean}\) and \(\gamma_{ad}\). Where the highlighted dots represent the visible images (a1), (a2), (a3), (a4), (a5), (a6) in Fig. 3

Specifically, we define \(I_{Mean}\) as the normalized mean value of pixels. Given a low-illumination image with small \(I_{Mean}\), \(\gamma\) should be larger to make a higher level of enhancement. Given a normal-illumination image with large value of \(I_{Mean}\), \(\gamma\) should be smaller to avoid overexposure problems. On the other hand, according to Weber-Fechner law [43], the brightness perception of the human visual system is logarithmically related to illuminance. Therefore, we define an optimization function in (9) to obtain the adaptive \(\gamma_{ad}\), Then obtain the adaptive enhanced image according to (7).

$$ \gamma_{ad} = ln\left( {1 + \frac{1}{{I_{Mean} }}} \right). $$
(9)

In this paper, we select 21 pairs of visible images captured in different scenes to verify the effectiveness of the proposed algorithm. The relationship between \(I_{Mean}\) and \(\gamma_{ad}\) is illustrated in Fig. 4 (p2), when \(I_{Mean}\) increased, \(\gamma_{ad}\) gradually decreased, the enhancement effect gradually weakens, which means our algorithm can adaptively enhance visible images according to the actual situation.

Figure 3 shows the enhanced results by different methods. Figure 3 (a1)–(a6) are the visible images (\(I_{Mean}\) from low to high), Fig. 3 (b1)–(b6), (c1)–(c6) illustrate the enhanced results of the FBPBF algorithm and the proposed method. We can observe that the FBPBF cannot enhance the visible images adaptively, resulting in overexposure (such as Fg.3 (b2)–(b6)). On the contrary, our method has two distinct advantages. First, for the low-illumination images (Fg.3 (a1)–(a3)), our method can enhance the detailed information and maintain the naturalness of the images. Second, for the normal-illumination visible image (Fig. 3 (a4)–(a6)), our results maintain consistent properties with the original visible images, avoiding the distortion of detail and contrast compared to the FBPBF algorithm (Fig. 4).

Figure 5 shows the quantitative comparison results on 21 pairs of visible images. Lightness order error (LOE) can measure the lightness distortion of enhanced results [44]. The smaller the value of LOE, the more natural the enhanced result is. VIFP is consistent with the human visual system and is used to measure the distortion of visual information [45]. The larger the value, the better visual information contains.

Fig. 5
figure 5

Quantitative comparison results on 21 pairs of visible images

From the statistical results, our method can achieve the best average value on LOE and VIFP, which demonstrates our results can enhance the detailed information from low-illumination images while maintaining the naturalness of enhanced images, achieving the adaptive enhancement of visible images. The total calculation process can be expressed by Algorithm 1.

figure f

3.2 Iterative guided and infrared patch-tensor based target extraction (IGPT)

As highlighted by the rectangles in Fig. 6, infrared images can efficiently represent salient targets compared with visible images. Therefore, the preservation of infrared targets generally needs to be considered during the fusion process. Based on this motivation, an iterative guided and infrared patch-tensor based target extraction algorithm (IGPT) is proposed.

Fig. 6
figure 6

Extracted infrared targets. The first row is the visible images, the second row is the infrared images, the third row is the results of IGPT, the fourth row is the result of IPT [39], and the last row is the result of BASNet [49]

As discussed in the section of preliminaries, we first applied the infrared patch-tensor model (IPT) to coarsely extract infrared targets as shown in the fourth row of Fig. 6. It can be seen that, problems such as isolated pixels and discontinuous edges exist in the results of IPT model.

In order to address the above problems, a threshold strategy [46] is devised to remove those small and insignificant pixels. Specifically, when the area of the pixel region is smaller than the threshold, these pixels are reversed. The threshold is a user-defined value that can be determined according to different tasks, and we analyzed it in Sect. 4.6. In this paper, we set the threshold to \(0.0002 \times H \times W\), where \(H\) and \(W\) are the height and weight of the infrared image, respectively. In addition, considering the excellent structure transferring property of the guided filter [47], we further proposed an iterative guided scheme to preserve as much edge details as possible of infrared targets. The total calculation process of IGPT can be summarized in Algorithm 2.

figure g

In the iterative guided scheme, the dynamic optimization model between the guidance \(IR\) and the infrared target map \(T_{IRi}^{n}\) in \(n\)-th iteration is defined as follows:

$$ T_{IRi}^{n} = a_{k}^{n} IR_{i} + b_{k}^{n} \, \forall i \in \omega_{k} , $$
(10)

where \(i\) is the index of pixels, \(n\) is the number of iterations,\(\omega_{k}\) is a square window with radius \(r\),\(a_{k}^{n}\) and \(b_{k}^{n}\) are coefficients in \(n\)-th iteration which can be obtained by solving the following cost function:

$$ E(a_{k}^{n} ,b_{k}^{n} ) = \sum\limits_{{i \in \omega_{k} }} {((a_{k}^{n} IR_{i} + b_{k}^{n} - T_{IRi}^{n - 1} )^{2} + \varepsilon a_{k}^{n} )} $$
(11)

where \(T_{IRi}^{n - 1}\) is the filtering input, \(\varepsilon\) is a regularization parameter, and the solutions of \((a_{k}^{n} ,b_{k}^{n} )\) is presented as (12) and (13) which is based on the previous output \(T_{IR}^{n - 1}\).

$$ a_{k}^{n} = \frac{{\frac{1}{|\omega |}\sum\nolimits_{{i \in \omega_{k} }} {IR_{i} T_{IRi}^{n - 1} - \mu_{k} \overline{{T_{IRi}^{n - 1} }} } }}{{\sigma_{k}^{2} + \varepsilon }}, $$
(12)
$$ b_{k}^{n} = \overline{{T_{IRi}^{n - 1} }} - a_{k}^{n} \mu_{k} , $$
(13)

where \(|\omega |\) is the number of pixels in \(\omega_{k}\),\(\overline{{T_{IRi}^{n - 1} }}\) is the mean of \(T_{IRi}^{n - 1}\) in \(\omega_{k}\),\(\mu_{k}\) and \(\sigma_{k}^{2}\) are the mean and variance of \(IR\) in \(\omega_{k}\).

By analyzing the above definition, it can be concluded that if a pixel is in a “High variance” area, it means this pixel belongs to the edge area, then we have \(\sigma_{k}^{2} > > \varepsilon\), so \(a_{k}^{n}\) will become far from 0. And the edge structure of the infrared target in \(IR\) will gradually transfer to the \(T_{IR}^{n}\) with increasing \(n\). If a pixel is in the “Flat patch” area, it means pixels in this area are almost constant, then we have \(\sigma_{k}^{2} < < \varepsilon\), so \(a_{k}^{n}\) is approximate to 0 (more details can be referred to [47]), then based on (10) and (13) the infrared target map \(T_{IR}^{n}\) can be obtained by:

$$ T_{IRi}^{n} \approx \overline{{T_{IRi}^{n - 1} }} ,{\text{ if}} a_{k}^{n} \approx 0. $$
(14)

According to (14), it can be further deduced that:

$$ \left. {\begin{array}{*{20}c} {T_{IR}^{n} \approx \overline{{T_{IR}^{n - 1} }} } \\ {T_{IR}^{n - 1} \approx \overline{{T_{IR}^{n - 2} }} } \\ \end{array} } \right\} \Rightarrow T_{IR}^{n} \approx T_{IR}^{n - 1} ,{\text{ if }}n{\text{is\;large\;enough}}{.} $$
(15)

which means the edge details of the target in \(IR\) image have been transferred to the infrared target map \(T_{IR}^{n}\) as much as possible, and the final infrared target map \(T_{IR}\) can be obtained.

As shown in Fig. 7 (a2)–(b4). As the number of iterations increases, Targets become more and more salient, and the edges also become sharper. In order to find the convergence condition, we define an evaluation function as (16).

$$ {\text{SSIM}}_{x,y} = \sum\limits_{x,y} {\frac{{2\mu_{x} \mu_{y} + c_{1} }}{{\mu_{x}^{2} + \mu_{y}^{2} + c_{1} }} \cdot \frac{{2\sigma_{x} \sigma_{y} + c_{2} }}{{\sigma_{x}^{2} + \sigma_{y}^{2} + c_{2} }} \cdot \frac{{\sigma_{xy} + c_{3} }}{{\sigma_{x} \sigma_{y} + c_{3} }}} , $$
(16)
Fig. 7
figure 7

(a1): Infrared image; (a2)–(b4): finely extracted target map \(T^{iter}\) with different number of iterations. The scatter plot represents the structural similarity of two images with adjacent iterations, and the highlighted dots represent the target map \(T^{iter}\) on the left side; as iter increases, \(SSIM\) increases too and then begins to converge

where SSIM (structural similarity index measure)[48] is used to measure the structural similarity between images \(x\) and \(y\),\(\mu\) is the mean value, \(\sigma_{x}\) and \(\sigma_{y}\) are the standard deviations of \(x\) and \(y\), \(\sigma_{xy}\) is the covariance, \(c_{1}\),\(c_{2}\) and \(c_{3}\) are stable factors. A lager \(SSIM\) indicates the more similar the two images are in terms of brightness, structure, and contrast information. The curve of \(SSIM\) with different iterations is shown in Fig. 7. It can be seen that as iter increases, \(SSIM\) increases too, and then begins to converge. We conduct lots of experiments with different infrared images and found that the curve tends to converge with 50 iterations.

Figure 6 shows the infrared target map obtained by different methods. Compare with IPT [39] and BASNet [49], our method has two distinct advantages. First, the proposed method can accurately obtain the infrared target. Such as in Kaptein_1123 (The third column in Fig. 6) the result obtained by our method contains both the chimney and the man. Second, as the iterative guided scheme is adopted, our method is superior in edge retention. As highlighted by the red rectangles, the results obtained by our method have a better visual effect as continuous and fine boundaries are preserved, which can increase the target saliency of the fused images effectively.

3.3 Decomposition and fusion

In this section, we will discuss how the infrared and visible images are decomposed and the process of obtaining the final fused image. For convenience, the adaptive enhanced visible image is denoted by \(VI\), the infrared image is denoted by \(IR\), the decomposed base layers of visible and infrared image are denoted by \(B_{VI}\) and \(B_{IR}\), respectively, the detail layers of the visible and infrared images are denoted by \(D_{VI}^{i}\) and \(D_{IR}^{i}\), respectively, where \(i\) is the scales ranges from 1 to 2 in this paper. The whole process of decomposition and fusion can be summarized as the following four steps:

Step1: Decompose the \(VI\) and \(IR\) into base layers and detail layers (\(B_{VI} ,B_{IR} ,D_{VI}^{1} ,D_{VI}^{2} ,D_{IR}^{1} ,D_{IR}^{2}\)) by the hybrid \(\ell_{1} { - }\ell_{{0}}\) decomposition model.

The hybrid \(\ell_{{1}} { - }\ell_{{0}}\) decomposition model can be defined as:

$$ \left\{ {\begin{array}{*{20}l} {b^{{1}} = {\text{model}}_{{\ell_{1} { - }\ell_{{0}} }} (s)} \hfill \\ {d^{{1}} = s - b^{{1}} } \hfill \\ {b^{{2}} = {\text{model}}_{{\ell_{{1}} }} (b^{{1}} )} \hfill \\ {d^{{2}} = b^{{1}} - b^{{2}} } \hfill \\ \end{array} } \right.{ ,} $$
(17)

This model can decompose the image into one base layer \(b^{{2}}\) and two detail layers \(d^{{{\kern 1pt} {1}}}\) and \(d^{{{\kern 1pt} {2}}}\), \(s\) represents the source image, \({\text{model}}_{{\ell_{1} { - }\ell_{{0}} }} ( \cdot )\) represents the optimization in (18).

$$ \min_{b} \sum\limits_{i = 1}^{N} {\{ (s_{i} - b_{i} )^{2} + \lambda_{1} \sum\limits_{{j = \{ x,y\} }} {|\partial_{j} b_{i} | + \lambda_{2} \sum\limits_{{j = \{ x,y\} }} {F(|\partial_{j} d_{i} |)} \} } } , $$
(18)

where \(i\) is the pixel index and \(N\) is the number of pixels in the image, \(\partial_{j}\) is the partial derivative in \(x\) and \(y\) directions. The first term \((s_{i} - b_{i} )^{2}\) controls the base layer close to the source image, the second term indicates the \(\ell_{1}\) gradient prior term imposed on the base layer, and the third term means the \(\ell_{0}\) gradient prior term imposed on the detail layer by a function \(F(x)\).

$$ F(x) = \left\{ {\begin{array}{*{20}c} {1,} & {x \ne 0} \\ {0,} & {else} \\ \end{array} .} \right. $$
(19)

\({\text{model}}_{{\ell_{{1}} }} {(} \cdot {)}\) is the simplification of \({\text{model}}_{{\ell_{1} { - }\ell_{{0}} }} ( \cdot )\) which can be represented by (20).

$$ \min_{{b^{{2}} }} \sum\limits_{i = 1}^{N} {\{ (b_{i}^{1} - b_{i}^{2} )^{2} + \lambda_{3} \sum\limits_{{j = \{ x,y\} }} {|\partial_{j} b_{i}^{2} |} \} } , $$
(20)

In the hybrid decomposed model, two main parameters (\(\lambda_{1}\) and \(\lambda_{2}\)) would affect the decomposed result, and we set \(\lambda_{1}\) to 0.3 and \(\lambda_{2}\) to 0.003, according to the previously published study [50].

Step 2: Fusion of base layers.

To obtain the fused image with a natural visual effect consistent with the human visual system. The base layer of the fused image should keep a proper contrast. However, in the traditional multi-scale-based fusion algorithms, most adopt the “average” fusion rule [17, 51], which will cause energy loss and result in a poor fusion result. In this paper, we adopt a fusion strategy based on the visual saliency map [52] to solve the above-mentioned problems.

The visual saliency map can be defined as follows:

$$ V = |I_{q} - I_{1} | + |I_{q} - I_{2} | + \cdots + |I_{q} - I_{n} |, $$
(21)

where \(I_{q}\) is the intensity of a pixel \(q\) in the image \(I\), the number of pixels ranges from 1 to \(n\). Then, \(V\) is normalized in \([0,1]\). The visual saliency map can retain the mainly interesting regions of images.

The fusion weight is obtained based on the visual saliency map \(V\), and the base layer of the fused image is obtained by:

$$ B_{F} = \frac{{(1 + V_{IR} - V_{VI} )B_{IR} + (1 + V_{VI} - V_{IR} )B_{VI} }}{2}. $$
(22)

Step 3: Fusion of detail layers.

In Sect. 3.2, an infrared target extraction algorithm (IGPT) is proposed, which can maintain the salient infrared targets in the final fused image. Therefore, the key to designing the detail layer fusion rules should be to preserve more valuable visible detail information while preserving the basic saliency of infrared features. Based on the above analysis, we proposed a fusion strategy based on the visual different feature map [53], which is defined as follows:

$$ R^{i} = \left\{ {\begin{array}{*{20}l} {|D_{IR}^{i} | - |D_{VI}^{i} |,|D_{IR}^{i} | - |D_{VI}^{i} | > 0} \hfill \\ {0,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} otherwise,} \hfill \\ \end{array} } \right. $$
(23)

where \(D_{IR}^{i} ,D_{VI}^{i}\) is the detail layers obtained by (17), \(i\) is the scale ranges from 1 to 2, then \(R^{i}\) is normalized in \([0,1]\). And the fusion weight of detail layers is obtained by (24), \(Gaus( \cdot )\) is the Gaussian filter used to reduce noise pixels. The fused detail layers can be obtained by (25).

$$ C^{i} = Gaus(R^{i} ), $$
(24)
$$ D_{F}^{i} = C^{i} D_{IR}^{i} + (1 - C^{i} )D_{VI}^{i} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2. $$
(25)

Step 4: Obtain the Final fused image.

After obtaining the fused base layer and detail layer, the pre-fused image can be reconstructed by

$$ F_{pre} = B_{F} + D_{F}^{1} + D_{F}^{2} $$
(26)

And the final fused image \(F\) is obtained by

$$ F = (1 - T_{IR} ) \times F_{pre} + T_{IR} \times IR $$
(27)

where \(T_{IR}\) is the infrared target map obtained by the proposed IGPT method described in Algorithm 2.

4 Experiments and analysis

4.1 Experimental configuration

The experimental images consist of 21 pairs of infrared and visible images taken from the classical TNO Image Fusion Dataset (TNO). The proposed method is compared with nine related state-of-the-art fusion methods, including IVFusion [24], RFN-NEST [30], Bayesian [28], perceptual FusionGAN (abbreviated as PerGAN) [34], FusionGAN [31], MGFF [19], FPDE [21], ADF [20] and CBF [26]. The relevant parameters of the algorithms above are set according to the original papers. All experiments are implemented in MATLAB R2019b on a 3.40 GHz Intel(R) Core (TM) i7–6700 CPU with 8.00 GB RAM.

In the quantitative comparisons section, five evaluation metrics are adopted: Mean value (ME), pixel-based visual information fidelity (VIFP) [45], Gradient-based fusion metric (\(Q^{AB/F}\)) [54], Chen-Blum metric (\(Q_{CB}\)) [55], and Chen-Varshney metric (\(Q_{CV}\)) [56]. These metrics are used to evaluate the performance of the fusion algorithm objectively. This paper first enhances the visible image and then fuses it with the infrared image. So we use the adaptively enhanced visible image, infrared image, and fused image to calculate these evaluation indicators.

Among these indicators, ME calculates the arithmetic mean of all pixels, representing the average brightness that human eyes can perceive. VIFP measures the visual fidelity between the fused image and the source image. \(Q^{AB/F}\) is used to estimate the edge information in the fused image. The greater the value, the more edge information from the source image remains. \(Q_{CB}\) is designed based on the human visual system and can closely match human perceptual evaluations. For the above indicators, a greater value means a better quality of the fused image. As for \(Q_{CV}\), a human perception inspired indicators that measure the visual difference between the source image and fused image, the smaller the \(Q_{CV}\) value, the better the fusion performance.

4.2 Visual comparison and analysis

We first exhibit some intuitive experimental results on three pairs of classic images (‘Road’, ‘Kaptein_1654’, and ‘Nato_camp’) as shown in Figs. 8, 9, and 10, respectively. The results of the other six groups are shown in Fig. 11. From these results, it can be seen that all the algorithms have their peculiarities, and our method has three distinct advantages. First, the proposed method can depict scene information as much as possible, especially for those images captured in low illumination conditions; our method can effectively enhance the hidden details. Second, the proposed can also maintain the saliency of the infrared targets, which are beneficial to some object tracking tasks. Third, our method can obtain natural fusion images suitable for the human visual system.

Fig. 8
figure 8

Fusion results of the “Road” by different methods

Fig. 9
figure 9

Fusion results of the “Kaptein_1654” by different methods

Fig. 10
figure 10

Fusion results of the “Nato_camp” by different methods

Fig. 11
figure 11

Fusion results of six pairs of infrared and visible images by different methods

Specifically, Fig. 8 shows the experimental results of ‘Road’, where the visible image is taken in a low illumination condition. The infrared targets in the fusion results of RFN-Nest, Bayesian, FusionGAN, MGFF, FPDE, and ADF are not significant enough. At the same time, although the fusion results of IVFusion, PerGAN, and CBF maintain the high contrast of the infrared targets, some valuable visible details are lost (such as the billboard and traffic signs). Compared with these algorithms, our method can highlight the targets and contain more visible details, providing comprehensive scene information.

Moreover, for some video images, our method also performs well. Figure 10 is one frame of the ‘Nato_camp’ video sequence. The infrared image conveys the message that a man is coming near to something, and the visible image depicts rich scene information (such as the trees and fence). Our method is advantageous as more natural details are contained, which can improve the reliability of some tracking tasks. Fusion results of other groups are exhibited in Fig. 11, where the same phenomena are shown (As magnified by the rectangles).

4.3 Quantitative comparisons

As mentioned in the experimental configuration section, five well-known evaluation metrics (ME, VIFP,\(Q^{AB/F}\),\(Q_{CV}\)\(Q_{CB}\)) are adopted to check the fusion performance objectively. Among these metrics, the larger values of ME, VIFP, \(Q^{AB/F}\) and \(Q_{CB}\), the better quality of the fused image. As for \(Q_{CV}\) which measure the difference between the source image and the fused image, the smaller the value, the better performance of the fused result.

The detailed scores for each metric on ‘Road’, ‘Kaptein_1654’, and ‘Nato_camp’ are tabulated in Tables 1, 2, and 3, respectively. The best value of each metric is shown in bold, and the second best is underlined and in italic. In most cases, our results rank in a leading position.

Table 1 Quantitative comparison results on ‘Road.’
Table 2 Quantitative comparison results on ‘Kaptein_1654’
Table 3 Quantitative comparison results on ‘Nato_camp’

The results of quantitative comparisons on 21 pairs of infrared and visible images are summarized in Fig. 12, and the average values are given in Table 4. From the statistical results. Our method achieves the best average score on ME, VIFP, \(Q^{AB/F}\),\(Q_{CV}\) and \(Q_{CB}\). These results demonstrate that our method can reduce image distortion as much as possible while enhancing visible details and highlighting infrared targets, making the fusion results consistent with the human visual system.

Fig. 12
figure 12

Comparison of our method with nine state-of-the-art methods on 21 infrared and visible image pairs

Table 4 Average quantitative comparison results on 21 testing image pairs

4.4 Ablation study

Infrared and visible fusion aims to generate a fused image with prominent infrared targets and abundant scene texture information, so the BFAC and IGPT algorithms are proposed. The BFAC is used to enhance details in the visible images adaptively, and IGPT is used to highlight the salient infrared targets. An ablation study is conducted to verify the effectiveness of the proposed methods shown in Fig. 13 and Table 5, where ‘a’ is the fusion method of removing the proposed adaptive visible image enhancement module (BFAC) and infrared target extraction (IGPT) module. ‘b’ is the fusion method of removing the BFAC module.’c’ is the fusion method of removing the IGPT module. ‘d’ is the proposed method.

Fig. 13
figure 13

Effectiveness illustration of BFAC and IGPT

Table 5 Quantitative comparison results of the different ablation experiments on ‘Kaptein_1654’

As seen in Fig. 13, when removing the BFAC module, the detailed information in the grass is reduced compared with the proposed method (as highlighted by the yellow arrows). When removing the IGPT module, the saliency of the infrared target is reduced compared with the proposed method (as highlighted by the red rectangles). The objective indicators of the fused images are shown in Table 5. The proposed method achieves the best value on ME, VIFP, \(Q^{AB/F}\) and \(Q_{CB}\). The results of the total test images are summarized in Fig. 14, and the average values are shown in Table 6. From the statistical results, our method achieves the best score in metrics: ME, VIFP, \(Q^{AB/F}\),\(Q_{CV}\) and \(Q_{CB}\), which demonstrated the analysis objectively.

Fig. 14
figure 14

Quantitative comparison of the different ablation experiments on 21 pairs of infrared and visible images

Table 6 Average quantitative comparison results of the different ablation experiments on 21 pairs of infrared and visible images

4.5 Parameter analysis

In Sect. 3.2.1, we applied a threshold strategy to remove some isolated pixels in the coarse result. Specifically, when the area of the pixel region is smaller than the threshold, these pixels are reversed. The threshold \(th\) can be presented by

$$ th = t \times H \times W $$
(28)

where \(H\) and \(W\) are the height and weight of the infrared image, respectively. \(t\) is a user-defined value. In this paper, we conduct a series of experiments to obtain the proper \(t\) value and find the range of \(t\) is from 0 to 0.01 in our experimental dataset. Due to the space constraint, only two sets of infrared images are exhibited for the analysis. In Fig. 15, when \(t\) is equal to 0, all pixels are retained, as the \(t\) value continues to increase, it can be seen that the isolated pixels gradually decreased. Particularly, when \(t\) is set to 0.0002, all the isolated pixels are revised, and the extracted infrared targets tend to stabilize until \(t\) is enlarged to 0.01. Therefore, we set the threshold \(th\) to \(0.0002 \times H \times W\). It is worth noting that, \(th\) is mainly used to delete some isolated pixels in images, and it is a user-defined value that can be determined according to different tasks.

Fig. 15
figure 15

The changes of isolated pixel points when \(t\) gradually increases

5 Conclusion

In this paper, a novel adaptive visual enhancement and salient targets analysis-based fusion scheme is proposed. First, in order to address the problem that visible images are easily affected by illumination, a visual adaptive enhancement algorithm for visible images (named BFAC) is proposed. Based on the bright-pass bilateral filter, the algorithm corrects the brightness of the visible image by calculating the adaptive gamma value to obtain the enhanced visible image with an excellent visual effect. Secondly, an infrared target extraction algorithm (named IGPT) is proposed. The algorithm obtains the infrared target by calculating the tensor rank, iteratively guided filtering and solving the convergence condition. Compared with the previous IPT model and BASNet, our IGPT has more accurate results as significant infrared targets and continuous edges are contained. Third, an efficient hybrid \(\ell_{1} { - }\ell_{{0}}\) model decomposes the infrared and visible image into base and detail layers, then fuses them by weight strategy. The final fused image is obtained by merging the fused base layers, detail layers, and infrared targets.

Qualitative and quantitative experimental results demonstrate that the proposed method is superior to 9 state-of-the-art image fusion methods as more valuable texture details and significant infrared targets are preserved. However, in the stage of the infrared target extraction, an ADMM framework is applied to iterative solve the optimization model, resulting in more computing resources and the average running time of our method is about 7 s. In our future studies, we will focus on some light-weight networks and efficient optimization algorithms to make the proposed method more suitable for some real-time industrial applications.