1 Introduction

Image fusion plays an essential part in many applications such as computer vision, satellite cloud images, medical images, target detection, military, remote sensing and so on [12, 22]. The fusion of visible and infrared images is a significant research focus in the image fusion field. The infrared image (IR) consists of the thermal radiation characteristic, and can capture the target hidden under low-light conditions and recognize the objects in the camouflage. Although infrared imaging sensor is not affected by the various lightings or bad weather conditions, the obtained image lacks adequate background details. On the contrary, the visible light image is obtained by the spectral reflection of the object, and usually contains more textures and detail information of background, and also has higher spatial resolution, thus the visible image has a better visual quality than the infrared image [37]. Image fusion technique is to extract meaningful information from multiple images under the same scene, or different kinds of image sensors under diverse modes. The composite image synthesizes the advantages of the visible and infrared images and highlights localization of the target in the infrared image.

Currently, multi-scale geometric transform methods applied to image fusion have been studied extensively. Among the tools of multi-scale geometric transform, such as discrete wavelet transform (DWT) [20], Laplacian pyramid (LAP) [31], contourlet transform (CT) [5]. In order to lead to better frequency selectivity and regularity than CT, and remove pseudo-Gibbs phenomena along the edges to some extent, non-subsampled contourlet transform (NSCT) was proposed by Da Cunha et al. [2]. In comparison with other decomposition methods, NSCT requires a larger amount of computation. To reduce the computational complexity of NSCT, non-subsampled shearlet transform (NSST) was proposed by Easley et al. [7], NSST has the shift-invariance of non-subsampled processes and inherits the perfect properties from shearlet and wavelet, such as the characteristics of anisotropy, computing speed. Therefore, NSST has an advantage in obtaining more information for image fusion.

In addition, artificial neural network has become a research hotspot [24,25,26,27]. Pulse couple neuron network (PCNN) is a new generation of artificial neural network, which was developed by Johnson et al. [14], and owns some superior characters, such as coupling and pulse synchronization. It has been widely applied in image segmentation, image enhancement, pattern recognition, and so on [41]. Xin Jin et al. [11] proposed an images fusion based on NSST and PCNN. However, PCNN has a large number of parameters which are always set as constants by human experience leading to the lack of universality. In order to solve these problems, a modified neural network model called spiking cortical model (SCM) was proposed by Kong et al. [16], which devised a novel scheme based on SCM and NSST, and overcome the shortcoming of parameters setting and utilized the intensity distribution of pixels to optimize the iterative number. Meanwhile a large number of intelligent algorithms had been applied to parameters optimization of the neural network, such as genetic algorithm-PCNN (GA-PCNN) [43], particle swarm optimization-PCNN (PSO-PCNN) [13], and artificial bee colony-PCNN (ABC-PCNN) [3]. Commonly, these single objective optimization algorithms have only one fitness function, which ignore the influence of other factors so those algorithms do not achieve the best result in image fusion field.

Recently, the vision saliency detection and super resolution methods are also widely used for image processing [30, 38]. Jinlei Ma et al. [23] used visual saliency map to fuse the base layers. Zhang et al. [44] presented a novel fusion method based on NSST and the visual saliency, although the performance was improved by the visual saliency map, the treating processes of background information was too simple so that the details were lost.

To alleviate the aforementioned problems and obtain better fusion performance, a novel image fusion scheme using the visual saliency detection and optimized SCM in NSST domain is proposed. At the beginning, NSST decomposes the source images into a low-frequency subband and a series of high-frequency subbands. Then the visual saliency map of low-frequency subband and the modified spatial frequency of high-frequency subbands are utilized to act as the SCM external stimulation, respectively. For the sake of overcoming the disadvantage of single objective optimization, we present to optimize parameters of SCM with multi-objective artificial bee colony algorithm, besides the iteration times is set by the time matrix. Finally, the fused image is obtained via optimization process. Experimental results show the proposed method does well in the fusion of infrared and visible image and can preserve not only the spectral information of the visible image but also the thermal target information of the infrared image, thus the fused result contains high contrast and rich background details.

The remaining sections of this paper are summarized as follows. Section 2 presents an overview of the proposed fusion scheme and reviews the theory of related algorithms. Section 3 describes the image fusion strategies and steps in detail. Experimental results and discussions are given in Section 4. Some conclusions are summarized in Section 5.

2 The proposed fusion scheme

Figure 1 sketches out the main scheme of the proposed fusion method. Firstly, the infrared image and the visible image are decomposed into a low-frequency subband and a series of high-frequency subbands using NSST, respectively. Then, the modified frequency-tuned algorithm is used to extract the saliency map as external stimulation of SCM in the low-frequency subband, in the meanwhile, the modified spatial frequency (MSF) of the high-frequency subbands is used to stimulate the SCM. Next, the novel multi-objective artificial bee colony technique is utilized to optimize the parameters of SCM according to suitable fitness functions. Finally, the fused image is gained by taking inverse NSST.

Fig. 1
figure 1

Schematic diagram of the proposed image fusion framework

2.1 Non-subsampled shearlet transform

NSST, which was proposed by Easley [7], is an extension of the wavelet in multidimensional space and combines the non-subsampled pyramid (NSP) filter with shearlet transform to provide the multiscale decomposition. The shearlet transform (ST) is close to optimal sparse representation, the synthetic expansion of affine system is described as follows:

$$ {\Lambda}_{AB}\left(\psi \right)=\left\{{\psi}_{j,l,k}(x)={\left|\det A\right|}^{j/2}\psi \left({B}^l{A}^jx-k\right):j,l\in Z,k\in {Z}^2\right\}, $$
(1)

where ψj, l, k is expressed as a composite wavelet, A denotes the anisotropy matrix for multi-scale decomposition, B is a shear matrix for directional analysis, j, l and k are scale, the direction of decomposition and shift parameter, respectively. When \( A=\left[\begin{array}{cc}4& 0\\ {}0& 2\end{array}\right] \), \( B=\left[\begin{array}{cc}1& 1\\ {}0& 1\end{array}\right] \), the composite wavelet becomes shearlet, the structure of the frequency tiling by the shearlet is shown in Fig. 2.

Fig. 2
figure 2

The structure of the frequency tiling by the shearlet

The NSST decomposition is divided into two major steps: (I) Multi-scale decomposition. (k + 1) subbands as same size as the source image can be obtained by using the k-class non-subsampled pyramid filter, including a low-frequency map and a series of high-frequency maps; (II) The direction of localization. In pseudo polarization grid coordinates, standard shearlet is calculated by Meyer window function, which requires the subsampled operation to obtain the shift-invariance. However, NSST direction of localization uses the modified shearlet filter, which can map from the pseudo polarization to the Cartesian coordinate system avoid the next sampling operation via Fourier inverse transform, so NSST has the characteristic of the shift-invariance.

2.2 Saliency detection of infrared image

Achanta et al. [29] introduced a frequency-tuned (FT) approach to estimate center-surround contrast using color and luminance features. This approach obtained the saliency map S for an image I of width W and height H pixels thus could be formulated as follows

$$ S\left(x,y\right)=\left\Vert {I}_{\mu }-{I}_{\omega hc}\left(x,y\right)\right\Vert, $$
(2)

where Iμ is the arithmetic mean pixel value of the image, Iωhc(x, y) is the pixel value of the source image in the Gaussian blurred version (5×5 separable binomial kernel), and ‖‖ is defined as Euclidean distance.

Guided filter was proposed by He et al. [8], which is a linear shift filter. The filtering output at a pixel i is expressed as a weighted average:

$$ {q}_i=\sum \limits_i{W}_{ij}(I){p}_j, $$
(3)

where i and j are pixel indexes, Wij is the filter kernel, I is guidance image, p is a filtering input image and q is an output image. The guidance image I is set according to different applications and can be taken as input image p directly.

The filter kernel weights are expressed by

$$ {W}_{ij}(I)=\frac{1}{{\left|\omega \right|}^2}\sum \limits_{k:\left(i,j\right)\in {\omega}_k}\left(1+\frac{\left({I}_i-{\mu}_k\right)\left({I}_j-{\mu}_k\right)}{\sigma_k^2+\varepsilon}\right), $$
(4)

where |ω| is the number of pixels in the window, ωk is the window of k kernel function, μkand \( {\sigma}_k^2 \) are the mean and variance of the guidance image I respectively, ε denotes the smoothing factor.

The conventional FT algorithm utilized a Gaussian blurry filter to process the input image. However, the guided filter kernel used the pixel mean and variance of the neighborhood as local estimation, and could adjust the output weight adaptively based on the content of the image, which had superior competence in retaining the edge information and performance of detail enhancement, so this paper makes an improvement on the FT approach by using the guided filter.

$$ S\left(x,y\right)=\left\Vert {I}_{\mu }-{I}_G\left(x,y\right)\right\Vert, $$
(5)

where IG(x, y) is the guided filter output of the input image, the guidance image I is the same as input image p.

In the comparative study of well-known saliency detection methods, such as Itti model [19], saliency using natural statistics (SUN) [42], spectral residual approach (SR) [10]. Our modified method has the advantages in extracting target information of the infrared image, and keeping the edge details, and suppressing the background information of infrared image fully at the same time, as shown in Figs. 3 and 4. The X-axis and the Y-axis represent the position of the pixel, and the Z-axis represents the value of gray-scale in the three-dimensional diagram of gray-scale image.

Fig. 3
figure 3

Saliency detection of the infrared image

Fig. 4
figure 4

Three-dimensional diagram of the gray-scale image

2.3 Multi-objective ABC algorithm

Artificial bee colony (ABC) is a group intelligence optimization algorithm initially proposed by Karaboga [15] through imitating bee feeding behavior which uses various kinds of bees with a different division of labor to share information during the search process.

The ABC algorithm consists of three groups: employed bees, onlooker bees, scout bees. Each nectar position represents a possible solution, and the degree of income of the nectar corresponds to the fitness of the solution.

First of all, the ABC algorithm generates initial populations randomly, N denotes the number of bees, and also the number of the nectar. At the initial time of algorithm flow, all bees are set to scout bees.

Secondly, each solution consists of D dimensional vectors, where D denotes the number of the neural network parameters which need to be optimized. And then the nectar position also expresses the solution of the corresponding problem, which is searched using the iteration of the three kinds of bees, employed bees search for and calculate the income of the new location, which also is known as the fitness of the practical problem in the neighborhood based on the local information in the memory. According to the greedy rule, if the income of the new location is high enough, we should record the new location to replace the original location.

Finally, the obtained information should pass to the waiting onlooker bees by unique dance like the shape of ‘8’, while the search process is finished by the employed bees. Then onlooker bees start to search and choose a better solution by analyzing the obtained information, the rule is: the higher the probability of fitness is, the greater the probability of the choice of nectar position is.

The probability formula is as follow

$$ {p}_i=\frac{fit_i}{\sum \limits_{n=1}^N{fit}_n}, $$
(6)

in which, fiti denotes the value of the fitness function of the ith solution, n represents the number of the nectar or the number of employed bees.

The ith employed bees and onlooker bees search for new nectar position formula

$$ {V}_{ij}={X}_{ij}+{\varphi}_{ij}\left({X}_{ij}-{X}_{kj}\right), $$
(7)

where k ∈ {1, 2⋯, N}, j ∈ {1, 2⋯, D}, k ≠ i, φij ∈  rand (−1, 1) is used to limit the honey of the nectar location Xij.

Equation (7) shows the situation, namely, the smaller the difference between Xkj and Xij is, and the smaller the disturbance is. The optimal solution is achieved by the search area, and can shorten step size adaptively. Thus this algorithm has the advantage of adaptive convergence.

If the fitness still cannot be improved during a certain number of cycles, and the source nectar will be discarded. Scout bees will go to search by generating the random new nectar position.

Pareto domination is one of the effective methods for judging individual merits in low-dimensional multi-objective optimization [1, 28, 34]. Based on the concepts of Pareto non-inferior ranking and crowding distance in multi-objective evolutionary algorithm, we present the MOABC algorithm, the pseudo-code of algorithm flow is shown in Table 1, where N is the number of the employed bees; MCN is the maximum number of iterations; Limit is the number of honey source update times and ‘archive’ represents the external population.

Table 1 The pseudo-code of MOABC algorithm

2.4 Spiking cortical model

SCM was presented by Zhan K et al. [40], has the simple structure and fewer parameters, as shown in Fig. 5. It consists of multiple neurons, and each neuron contains three main function units: receptive field, modulation field, and pulse generator. Moreover, it does not need to learn or train, and can extract the useful information from the complex background. The mathematical expressions of the model are as follows

$$ {F}_{\mathrm{ij}}(n)={\mathrm{S}}_{ij}(n), $$
(8)
$$ {U}_{ij}(n)={fU}_{ij}\left(n-1\right)+{\mathrm{S}}_{ij}\sum \limits_{kl}{W}_{kl}{Y}_{kl}\left(n-1\right), $$
(9)
$$ {E}_{ij}(n)={gE}_{ij}\left(n-1\right)+{V}_{\theta }{Y}_{ij}\left(n-1\right), $$
(10)
$$ {X}_{ij}(n)=\frac{1}{1+{e}^{\left({E}_{ij}-{U}_{ij}\right)}}, $$
(11)
$$ {Y}_{ij}(n)=\left\{\begin{array}{l}1,\kern2.25em if\ {X}_{ij}(n)>0.5\\ {}0,\kern2.25em \mathrm{otherwise}\ \end{array}\right., $$
(12)

where n denotes the iteration times, (i, j) is the location of the image pixel, Fij(n) describes the feedback input signal of the neuron, Sij(n) is the input excitation signal, Uij(n) refers to the internal active state of the neuron, Wkl is the weighted coefficient matrix of linking between neurons, Eij(n) is the dynamic threshold, Vθ is the threshold of amplification factor, Yij(n) is the output signal of the neuron at nth iteration, f and g are the internal active and dynamic threshold signal decay coefficients, respectively (Fig. 5).

Fig. 5
figure 5

The structure of the basic SCM neuron

In order to show the difference within ignition range, the sigmoid function is used to improve the neuron output signal [39], as shown in (11), Xij(n) denotes the pixel pulse ignition output amplitude, as Xij(n) > 0.5, the neuron produces a pulse, which is called one firing time, the signal is captured by the linking matrix Wkl, and the adjacent neurons achieve synchronization pulse release at the spatial position. Tij(n) expresses the neuron firing times matrix after nth iteration, the formula is described as follows

$$ {T}_{ij}(n)={T}_{ij}\left(n-1\right)+{Y}_{ij}(n). $$
(13)

2.5 Multi-objective artificial bee colony optimization SCM

Commonly, the quality of image fusion need to be evaluated by using various evaluation metrics comprehensively. However, these single objective optimization algorithms were presented by Jin Xin et al. [13] and Banharnsakun A [3], and have only one fitness function so ignore the influence of other factors in the image fusion field. To achieve better fused results, we introduce the multi-objective optimization algorithm.

The main task is to optimize the parameters of SCM, namely, it is equivalent to finding the optimal solution set of the two-dimensional equation and the bees corresponding to SCM’s parameters f and g.

It is key point to select suitable fitness function, so we introduce several alternative objective evaluation metrics as the hybrid fitness function of MOABC algorithm. Those objective evaluation metrics include mutual information (MI) [9], mean structural similarity (MSSIM) [33], standard deviation (SD) [11], spatial frequency (SF) [11], image entropy (IE) [11] and margin information retention (QAB/F) [32].

  1. 1)

    MI shows the correlation between two events, the MI of U and V can be defined as follows:

$$ MI\left(U,V\right)=\sum \limits_{v\in V}\sum \limits_{u\in U}p\left(u,v\right){\log}_2\frac{p\left(u,v\right)}{p(u)p(v)}, $$
(14)

where p(u,v) is the joint probability distribution of U and V, p(u) and p(v) are the marginal probability distribution of U and V, respectively. The sum of mutual information between the fused image and two source images can be calculated to denote the difference of fusion quality, and then the mutual information metric can be described as follows:

$$ MI\left(A,B,F\right)= MI\left(A,F\right)+ MI\left(B,F\right), $$
(15)

Eq. (15) reflects a total amount of information that fused image F(i, j) contains about source image A(i, j) and source image B(i, j). The larger value of mutual information metric indicates that the fused image contains the more information and achieves the better the fusion effect.

  1. 2)

    SD is a measure of the dispersion degree of a set of image data averages. The standard deviation of an image is calculated as.

$$ SD=\sqrt{\frac{1}{M\times N}\sum \limits_{i=1}^M\sum \limits_{j=1}^N{\left(F\left(i,j\right)-\mu \right)}^2}, $$
(16)

where F(i, j) is the pixel value of the fused image at the location (i, j), and μ is the mean value.

  1. 3)

    SF is composed of row frequency (RF) and column frequency (CF), and is described as follows

$$ SF=\frac{1}{MN}\sum \limits_{i=1}^M\sum \limits_{j=1}^N\left( RF+ CF\right), $$
(17)

in which M is the row of the image, N is column of the image.

  1. 4)

    IE represents the amount of information in the fused image. It can be acquired by (18)

$$ IE=-\sum \limits_{i=0}^LP(l){\log}_2P(l), $$
(18)

where P(l) expresses the probability density of L.

  1. 5)

    MSSIM is an effective measure of similarity of two images, which is calculated as follows

$$ MSSIM=\frac{SSIM\left(A,F\right)+ SSIM\left(B,F\right)}{2}, $$
(19)

where SSIM(A, F) and SSIM(B, F) are correlation coefficients between infrared image and fused image, visible image and fused image respectively. SSIM (i, j) is defined as follows

$$ SSIM\left(i,j\right)=\frac{\left(2{\mu}_i{\mu}_j+{C}_1\right)\left(2{\sigma}_{ij}+{C}_2\right)}{\left({\mu_i}^2+{\mu_j}^2+{C}_1\right)\left({\sigma_i}^2+{\sigma_j}^2+{C}_2\right)}, $$
(20)

where μi, σj and σij express the mean, standard deviation, and cross-correlation, respectively. C1 and C2 are used to ensure stability when the mean value and the variance are close to zero. The rotationally symmetric Gaussian window with standard deviation 1.5 was selected in MSSIM.

  1. 6)

    QAB/F represents the transformation degree of edge information of the fused image and the source image. It is defined as follows

$$ {Q}^{AB/F}=\frac{\sum \limits_{i=1}^N\sum \limits_{j=1}^M\left({Q}^{AF}\left(i,j\right){w}^A\left(i,j\right)+{Q}^{BF}\left(i,j\right){w}^B\left(i,j\right)\right)}{\sum \limits_i^N\sum \limits_j^M\left({w}^A\left(i,j\right)+{w}^B\left(i,j\right)\right)}, $$
(21)

where \( {Q}^{AF}\left(i,j\right)={Q}_g^{AF}\left(i,j\right){Q}_o^{AF}\left(i,j\right) \), \( {Q}_g^{AF}\left(i,j\right) \) and \( {Q}_{\mathrm{o}}^{AF}\left(i,j\right) \) are the edge strength and orientation preservation value at the location (i, j), respectively. N and M are the size of the image, and QBF(i, j) is similar to QAF(i, j), wA(i, j) and wB(i, j) reflect the weight of QAF(i, j) and QBF(i, j) respectively.

According to the principle, namely, the larger the value of objective evaluation metrics are, the better the performance of the fusion method is [13, 36], so we adopt two multi-criteria fitness functions, as shown as follows

$$ {fitness}_1=\max \left( MI+ SD+ IE\right), $$
(22)
$$ {fitness}_2=\max \left({Q}^{AB/F}\right). $$
(23)

3 Fusion strategies and specific steps

3.1 Low-frequency subband fusion strategy

Commonly the low-frequency information is the main components of the source images. On the contrary, the high-frequency information contains the details of the image [7]. Most of the low-frequency coefficients are fused by the simple weighted averaging or maximum based strategies [35], which do not consider the relationship between pixels. In order to have a better fusion effect, we proposed a novel method that the improved edge saliency map is used as external excitation of SCM. We define edge saliency map as Map, which is described as follows:

$$ {Map}_{IR}\left(i,j\right)=\max \left[{S}_{IR}\left(i,j\right),{E}_1\left(i,j\right)\right], $$
(24)
$$ {Map}_V\left(i,j\right)=\max \left[{S}_V\left(i,j\right),{E}_2\left(i,j\right)\right], $$
(25)
$$ {E}_1\left(i,j\right)=\left({L}_{IR}\ast F\right)\left(i,j\right), $$
(26)
$$ {E}_2\left(i,j\right)=\left({L}_V\ast F\right)\left(i,j\right), $$
(27)
$$ F=\left[\begin{array}{ccc}0.2& 0.2& 0.2\\ {}0.2& 0.5& 0.2\\ {}0.2& 0.2& 0.2\end{array}\right], $$
(28)

where ∗ denote convolution, LIR(i, j), LV(i, j) are the low-frequency coefficients. E(i, j) represents the filtered image with convolution kernel F. SIR(i, j) and SV(i, j) represent the visual saliency map of the source images, which can be calculated using (5).

3.2 High frequency subband fusion strategy

The existing high-frequency fusion strategies contain the largest absolute value, regional energy, variance and gradient [7], but these strategies cannot extract detail information from the image adequately while only considering the individual pixels or regional characteristics. The gray value of a single pixel is used as the excitation of the neural network, this may lose image edges and texture features. Kong W et al. [17] introduced the modified spatial frequency which increases the gradient calculation of the diagonal direction, it can be utilized to extract more information in the infrared image sets.

Suppose H(i, j) denotes the high-frequency coefficient at the location (i, j), and MSF is measured using slipping windows (the size is3 × 3) of the coefficient, then MSF in each subband is used to motivate the neuron, and it is defined as follows:

$$ MSF=\frac{1}{MN}\sum \limits_{i=1}^M\sum \limits_{j=1}^N{\left( RF+ CF+ MDF+ SDF\right)}^{1/2}, $$
(29)
$$ RF={\left[H\left(i,j\right)-H\left(i,j-1\right)\right]}^2, $$
(30)
$$ CF={\left[H\left(i,j\right)-H\left(i-1,j\right)\right]}^2, $$
(31)
$$ MDF={\left[H\left(i,j\right)-H\left(i-1,j-1\right)\right]}^2, $$
(32)
$$ SDF={\left[H\left(i,j\right)-H\left(i-1,j+1\right)\right]}^2, $$
(33)

where RF, CF, MDF, SDF denote the frequencies at rows, columns, main diagonal and auxiliary diagonal, respectively. N and M are the size of the slipping window.

3.3 Specific image fusion steps

Assume that the infrared and visible images have been matched and treated with uniform size accurately. The steps of the image fusion algorithm based on SCM as follows.

  1. Step 1

    Decompose the infrared and visible images using NSST to obtain their low-frequency subbands {\( {L}_{IR}^K \),\( {L}_V^K \)} and a series of high-frequency subbands {\( {H}_{IR}^{l,k} \),\( {H}_V^{l,k} \)} at each K-scale and l-direction, where 1 ≤ k ≤ K.

  2. Step 2

    SCM is utilized to deal with the low-frequency subbands. Let the edge saliency maps be the feedback inputs of SCM.

  1. (a)

    Calculate the MapIR and MapV according to (24) and (25), and all coefficients are normalized.

  2. (b)

    Set the initial values as follows:Uij(0) = Tij(0) = Eij(0) = 0. In the initial state, all the neurons are inactivated, so Yij(0) = 0.

  3. (c)

    Calculate Uij(n), Eij(n), Yij(n) by (9), (10) and (12), respectively, and then compute the neuron’s firing times Tij(n) according to (13). The fusion coefficients are selected according to Tij(n), N is the maximum number of iterations, and the rule is described as:

$$ {L}_F^K\left(i,j\right)=\left\{\begin{array}{l}{L}_{IR}^K\left(i,j\right),\kern0.75em {T_{ij}}^{IR}(N)\ge {T_{ij}}^V(N)\\ {}{L}_V^K\left(i,j\right),\kern0.75em {T_{ij}}^{IR}(N)<{T_{ij}}^V(N)\end{array}\right.. $$
(34)
  1. Step 3

    Measure the MSF as the external excitation of SCM using (29). Referring to step 2, use SCM to fuse the high-frequency subbands {\( {H}_{IR}^{l,k} \),\( {H}_V^{l,k} \)}. The fused coefficients can be determined as follows:

$$ {H}_F^{l,k}\left(i,j\right)=\left\{\begin{array}{l}{H}_{IR}^{l,k}\left(i,j\right),\kern0.75em {T_{ij}}^{IR}(N)\ge {T_{ij}}^V(N)\\ {}{H}_V^{l,k}\left(i,j\right),\kern0.75em {T_{ij}}^{IR}(N)<{T_{ij}}^V(N)\end{array}\right.. $$
(35)
  1. Step 4

    Optimize the parameters of SCM using multi-objective artificial bee colony algorithm. First of all, initialize the bee populations and set maximum number of iterations. Then, find the optimal solution set according to the two fitness functions, as shown in (22) and (23). Finally, select the optimization solution based on the selection principle.

  2. Step 5

    Take the optimal parameters to set SCM and perform inverse NSST of the low-frequency and the high-frequency coefficients to obtain the fused image.

4 Experimental results and analysis

The simulation experiments were conducted by MATLAB2014a software on PC with Intel E5 2670 2.6 GHz, 16 GB RAM. We take several groups of accurate matching of infrared image and visible light image to test. All of them cover 256 or 512 Gy levels. The source infrared and visible images were collected from http://www.imagefusion.org/ and https://figshare.com/articles/TNO_Image_Fusion_Dataset/1008029.

4.1 Experiment parameters setting

According to Ref. [1], we initialize the bee populations as follows: feasible solutions number is 2, the sum of bees is 20 (the number of employed bees and onlooker bees is 10 respectively), the largest number of search limit is set to 10, the maximum number of iterations is 50. The 2-D initial random values are f ∈ [0, 1] and g ∈ [0, 1].

At the same time, so as to show the optimization effect of this method, the un-optimized SCM fusion method is used to compare and analyze. The high frequency coefficient adopts the modified spatial frequency as the fusion strategy, the low-frequency coefficient fusion strategy selects the saliency map of the image. According to Ref. [8] and the parameters of the conventional SCM are set as follows:

f = 0.2, g = 0.6, Vθ = 20, \( W=\left[\begin{array}{ccc}0.1091& 0.1409& 0.1091\\ {}0.1409& 0& 0.1409\\ {}0.1091& 0.1409& 0.1091\end{array}\right] \), n = 20, Vθ = 20.

In addition, the parameters of f, g, and iteration n are set by the optimized SCM adaptively and the remaining parameters are the same as that of the conventional SCM in our method. In our implementation, the proposed fusion method is compared with three representative conventional fusion methods and two state-of-the-art fusion methods, such as wavelet-based method (DWT) [20], Laplacian pyramid (LAP) [31], multiscale transform-based method (NSST-SCM) [18], multiscale transform and sparse representation (MST-SR) [21], guide the filtering-based method (GFF) [6]. The ‘db2’ wavelet adopts discrete wavelet decomposition; NSST uses a non-subsampling pyramid ‘maxflat’ filter and its decomposition directions are set as [12, 20, 37].

4.2 Parameters optimization

In order to verify the rationality of the parameters and the fusion strategies in our proposed method, several experiments were conducted on the image sets, we selected “UN Camp” for specific analysis. At the beginning, four groups of different fusion strategies are compared, Group 1: the low-frequency subbands are fused by a simple weighted averaging strategy, and the high-frequency subbands adopt SF as external excitation of SCM. Group 2: the low-frequency subbands are also fused by a simple weighted averaging strategy, and the high-frequency subbands adopt MSF as external excitation of SCM. Group 3: the saliency map is used as the external stimulus of SCM in the low-frequency subbands, and the high- frequency subbands are fused by the largest absolute value strategy. Group 4: the low frequency subbands are fused using modified saliency map as external excitation of SCM, and the high-frequency subbands are fused by the largest absolute value strategy. Next, the fifth experiment uses the conventional SCM, so the number of iterations cannot be optimized. The last is our fusion method, the ten sets of solutions about SCM parameters were listed in Table 2.

Table 2 The optimal solution sets of UN Camp

However, it is difficult to select which set of optimal solutions to be the final parameters of SCM, so we introduce the concept of the best compromise solution [4]. Generally, QAB/F can better reflect the object edge information in the fused image, so the solution of the maximum value of this index is f = 0.2209 and g = 0.5805, which is selected as the final solution, the selection of parameters is realized adaptively based on this criterion.

It can be seen from Fig. 6 that Fig. 6c to f correspond to four groups of comparative experiments. Figure 6g and h show the fused results of the un-optimized SCM and the proposed method, respectively. First of all, in terms of visual effects, these methods take a simple weighted averaging strategy, and have bad fusion effect, the modified FT method can preserve more edge information than original FT method, such as the details of eaves in the fused image. Obviously, the un-optimized SCM lacks the details of the fence in the regions marked by the yellow rectangle. Then we utilize objective evaluation indexes [9, 32, 33] to measure the fused results, and the data show that the modified strategies improve MI and QAB/F to a certain extent, as shown in Fig. 7.

Fig. 6
figure 6

Comparison of different fusion strategies

Fig. 7
figure 7

The chart of evaluation indexes

4.3 Subjective evaluations

The fused results based on the different methods above are illustrated in Fig. 8, 9, 10, and 11, and the red rectangle and yellow rectangle region represent the enlarged details of the region and the contrast region, respectively. For the case of Fig. 8, the key point of image fusion is to fuse the information of pedestrians and vehicles into the final image fully, and maintain environmental information as much as possible. In terms of visual effects, although these algorithms can fuse most information of the source images, both DWT method and GFF method have lost infrared character details, as shown in the yellow rectangular marked region. It can be seen that our method can well retain the details of the tarpaulins in the red rectangular marked area because of the uniform gray-scale distribution. However, the fusion effects of NSST-SCM and MST-SR methods are inferior to our method. The next set of the image is “Kaptein”, as shown in Fig. 9. Among DWT, LAP and GFF methods do not to fuse the sky area properly due to contaminated with the dark IR spectral information, while the sky in our result is brighter and less noisy. Moreover, as we can see in marked region, the edge of the street lamp and the trees have some shadow in the result by NSST-SCM method. Both MST-SR and the proposed method can achieve good visual effects compared with other methods.

Fig. 8
figure 8

Image fusion results of “Bristol Queen’s Road”

Fig. 9
figure 9

Image fusion results of “Kaptein”

Fig. 10
figure 10

Image fusion results of “Street”

Fig. 11
figure 11

Image fusion results of “Heather”

Figure 10 shows a scene that contains multiple targets and complex source of lights, which is similar to Fig. 8. Compared with the proposed method, both the bulb luminance and the contrast of the fused results obtained by NSST-SCM and GFF methods are a little lower, and the details of two lamps in the upper right are not fused; the background scenery in MST-SR result contains more IR noise, and the contrast of the DWT result is low. From Fig. 11, we can easily find out that the result based on GFF looks like the visible image which lost the infrared information, whereas the result of the proposed method contains more details of the natural scene and obvious target.

In summary, the proposed method is superior to other methods in both inheriting the characteristics of the source images and preserving background details on the visual level.

4.4 Quantitative comparison

Another essential evaluation criterion is quantitative comparison, so the image fusion effect is measured by some above objective evaluation indexes [9, 11, 32, 33]. From Table 3, 4, 5, and 6 report the objective evaluation results based on six methods. Moreover, the two important relative evaluation metrics that MI and QAB/F will be represented as graph intuitively, as shown in Fig. 12. It can be seen that these two indexes are superior to other methods, this indicates the fused image generated by the proposed method contains more significant information from the source image, and the details of the two source images are reflected more accurately, the remaining metrics are slightly better than other comparison methods, this proves that the image fusion quality of the proposed method is better objectively.

Table 3 Quantitative results of experiment on“Bristol Queen’s Road”
Table 4 Quantitative results of experiment on “Kaptein”
Table 5 Quantitative results of experiment on “Street”
Table 6 Quantitative results of experiment on “Heather”
Fig. 12
figure 12

Objective evaluation results based on Figs. 8, 9, 10 and 11

In conclusion, our proposed method retains the effective information of the source images and plays a significant role in the fusion of infrared and visible images.

5 Conclusions

In this paper, a novel infrared and visible light image fusion scheme is proposed, in which visual saliency map improves the low-frequency fusion strategy, the spatial frequency is utilized as the external incentive of SCM in NSST domain. Among them, the soft limiting function improves the output of SCM; at the meanwhile, the parameters of SCM are further optimized by multi-objective the artificial bee colony. Compared with other methods, the experimental results show that the modified SCM structure is simple which has fewer parameters to set, low computational costs, and objective is outstanding in the fused image, the outline is clear, rich background details in the fused image, the fusion performance is better than the other state-of-the-art methods both the subjective and objective evaluation. Our next research goal is to use parallel computing to reduce computational costs and extend the application domain of this scheme.