1 Introduction

With the rapid development of sensor and computer science technology, medical imaging has been playing an essential role in various clinical applications including medical diagnosis, surgical navigation, and treatment planning, which is a critical tool for the doctors to diagnose the diseases accurately [1].

Commonly, medical images are generated by different imaging mechanisms, which are focused on specific tissue or organ information, such as X-ray, computed tomography (CT), and magnetic resonance imaging (MRI). The CT images are used for the precise localization of dense structures like bones and implants, the MRI images can provide enough soft-tissue details with high-resolution anatomical information [2]. The main task of image fusion is to generate a single comprehensive image containing the unique characteristics of multimodal medical images, which can help doctors to make accurate decisions for various diagnoses [3].

Over the last few years, multiscale transform (MST) methods applied to image fusion have been studied extensively. Among the conventional tools of MST, we can mention discrete wavelet transform (DWT) [4], Laplacian pyramid (LAP) [5], and contourlet transform (CT) [6]. To achieve better frequency selectivity and regularity than CT and remove pseudo-Gibbs phenomena along the edges to some extent, non-subsampled contourlet transform (NSCT) was proposed by Da Cunha et al. [7]. In comparison with other decomposition methods, NSCT requires a larger amount of computation. To reduce the computational complexity of NSCT, non-subsampled shearlet transform (NSST) was proposed by Zhang et al. [8]; NSST has the shift-invariance of non-subsampled processes and inherits the main properties from shearlet and wavelet, such as the characteristics of anisotropy and computing speed. Therefore, NSST has a significant advantage in obtaining more details for image fusion.

In addition to choose the excellent decomposition method, how to generate a robust weight map is also the key step of image fusion. In conventional transform domain fusion methods, the weight maps were generated by the simple fusion rules such as weighted-average or choose-max. This kind of fusion rules does not consider the relationship between pixels which reduce the contrast of fused image and lose saliency information on a certain degree. To get the better fusion performance, the methods based on pulse-coupled neural network (PCNN) have also become a research hotspot. PCNN owns some superior characteristics, such as coupling and pulse synchronization which can be used to generate fused coefficients [9]. X. Jin et al. [10] proposed an image fusion method based on NSST and PCNN. K. J. He et al. [11] introduced a fusion method which combines focus-region-level partition and PCNN. However, PCNN has a large number of parameters which are always set as constants by human experience leading to the lack of universality. To address these problems, a modified neural network model called spiking cortical model (SCM) was proposed by Hou et al. [12], which devised a novel scheme based on SCM and NSST and overcame the shortcoming of parameters setting, and utilized the saliency map to fuse low-frequency coefficients. However, this algorithm has a certain limitation that saliency detection method only achieves outstanding performance on visible and infrared images.

In recent years, deep learning has gained many breakthroughs in various computer vision and image processing problems, such as image segmentation, super resolution restoration, classification, saliency detection, and so on [13]. Y. Liu et al. [14] proposed a novel multi-focus image fusion scheme using convolutional neural networks (CNNs); they used CNNs to classify the focus region and get a decision map. The fused image was generated by combining decision map and source images. Although deep learning achieves good performance, the limitation of this method is that it is just suitable for multi-focus image fusion. Then Y. Liu et al. [1] extended the CNNs model to medical image fusion which acquired a good effect.

In this paper, we propose a novel medical image fusion scheme based on deep learning framework and improved artificial neural network. In the beginning, NSST decomposes the source image into a low-frequency coefficient and a series of high-frequency coefficients. The main task in this paper is to design robust activity level measurements and weight assignment strategies, so the CNNs are used to encode a direct mapping from the source image to weight map. To enhance adaptability of the algorithm for different images, we design an effective weight assignment rule. Then the low-frequency coefficients are fused by deep learning framework. For high-frequency coefficients, we proposed a dual-channel SCM (DCSCM) to fuse the decomposed coefficients. Finally, the fused image is obtained via inverse NSST. Experimental results show that the proposed method does well in the fusion of medical images and can preserve not only the dense structures information of the CT image but also the soft-tissue information of the MRI image; thus, the result contains rich details and has a good visual effect.

The remaining sections of this paper are summarized as follows. Section 2 reviews the theory of related algorithms and describes the image fusion strategies and steps in detail. Experimental results are given in Section 3. Section 4 shows the discussion about the experimental results. Some conclusions are summarized in Section 5.

2 Methods

In this section, we briefly review the theory of NSST, CNNs, and DCSCM, which are essential components of the proposed method.

2.1 Non-subsampled shearlet transform

NSST, which was proposed by Zhang et al. [8], is an extension of the wavelet in multidimensional space and combines the non-subsampled Laplacian pyramid (NSLP) filter with shearlet transform to provide the multiscale decomposition. The shearlet transform (ST) is close to optimal sparse representation; the synthetic expansion of affine system is described as follows:

$$ {\varLambda}_{AB}\left(\psi \right)=\left\{{\psi}_{j,l,k}(x)={\left|\det A\right|}^{j/2}\psi \left({B}^l{A}^jx-k\right):j,l\in Z,k\in {Z}^2\right\} $$
(1)

where ψj, l, k is expressed as a composite wavelet, A denotes the anisotropy matrix for multiscale decomposition, B is a shear matrix for directional analysis, and j, l and k are scale, the direction of decomposition and shift parameter, respectively. When \( A=\left[\begin{array}{cc}4& 0\\ {}0& 2\end{array}\right] \), \( B=\left[\begin{array}{cc}1& 1\\ {}0& 1\end{array}\right] \), the composite wavelet transformed into shearlet, the structure of the frequency tiling by the shearlet is shown in Figs. 1 and 2, which show three-level multiscale and multidirectional decomposition of the NSST.

Fig. 1
figure 1

The structure of the frequency tiling by the shearlet

Fig. 2
figure 2

Three level multiscale and multidirectional decomposition of the NSST

The NSST decomposition is divided into two major steps:

  1. (I)

    Multiscale decomposition. (k + 1) sub-bands as same size as the source image can be obtained by using the k-class non-subsampled pyramid filter, including a low-frequency map and a series of high-frequency maps.

  2. (II)

    The direction of localization. In pseudo polarization grid coordinates, standard shearlet is calculated by Meyer window function, which requires the subsampled operation to obtain the shift-invariance. However, NSST direction of localization uses the modified shearlet filter, which can map from the pseudo polarization to the Cartesian coordinate system avoiding the next sampling operation via Fourier inverse transform, so NSST has the characteristic of the shift-invariance.

2.2 Convolutional neural networks

Recently, CNNs have shown impressive performance across various artificial intelligence tasks. While CNNs have achieved state-of-the-art results in many high-level computer vision tasks like classification, object detection, scene understanding, and much more, their performance on low-level image processing problems such as filtering and image fusion is not studied extensively [15].

CNNs become a new type of an artificial neural network model, which are combining artificial neural network and deep learning network. The convolution layer is the key to construct the convolutional neural network which is defined as follows.

$$ {x}_j^l=f\left(\sum \limits_{i\in {M}_j}{x}_i^{l-1}\ast {w}_{ij}^l+{b}_j^l\right) $$
(2)

where l denotes l-th layer, w denotes the convolution kernel, Mj is the receptive field of the input layer, b is the bias, ∗ denotes the convolution operation, and f represents the activation function.

The structure of neuron is shown in Fig. 3. Firstly, each neuron connected via synapses and neurons capture input signals from its dendrites, then the dendrites would transmit the signals to the cell body, eventually along the axons to produce the output signal. The proper activation function is an important part of neural network, Eq. (2) can be rewritten as Eq. (3) by incorporating the rectified linear unit (ReLU) activation function [16].

$$ {x}_j^l=\max \left(\sum \limits_{i\in {M}_j}{x}_i^{l-1}\ast {w}_{ij}^l+{b}_j^l,0\right) $$
(3)
Fig. 3
figure 3

The structure of neuron

VGG network increases the depth of network to achieve a better performance. Gatys et al. [17] proposed a novel image style transfer technology based on VGG network. Firstly, the VGG-19 network was used to extract the deep feature in multi-layers from content and style images, then the loss function of style and content was defined, and the generated image was achieved after a certain number of iterative training, which fuses the style and content of the source images respectively. There is no doubt that VGG network is an effective feature extractor that contains different information in each layer, and the structure of VGG-19 neural network is shown in Fig. 4. Inspired by the above work, the fixed VGG network in our paper which is trained on ImageNet (ILSVRC2012) to extract the multi-layer deep features of source medical images. Specifically, the training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent with momentum and the learning is stopped after 74 epochs [18]. Different from the style transfer and classification tasks, we do not need neural network to reconstruct images or output the probability of classification, but just utilize deep features of ReLU_1_1, ReLU_2_1, and ReLU_3_1 activation layer from the VGG-19 network to design the robust weight maps, and more details are described in Section 2.4.1.

Fig. 4
figure 4

The structure of VGG-19 neural network

2.3 Dual-channel spiking cortical model

Conventional SCM was presented by K Zhan et al. [19] and has the simple structure and fewer parameters. It consists of multiple neurons, and each neuron contains three main function units: receptive field, modulation field, and pulse generator. Moreover, it does not need to learn or train, and it can extract the useful information from the complex background. We modified the conventional SCM into dual-channel SCM (DCSCM) which can enhance its ability to extract details in the dark region. The mathematical expressions of the model are as follows:

$$ {U}_{ij}^1(n)=f{U}_{ij}^1\left(n-1\right)+{\mathrm{S}}_{ij}^1\sum \limits_{kl}{W}_{kl}{Y}_{kl}\left(n-1\right) $$
(4)
$$ {U}_{ij}^2(n)=f{U}_{ij}^2\left(n-1\right)+{\mathrm{S}}_{ij}^2\sum \limits_{kl}{W}_{kl}{Y}_{kl}\left(n-1\right) $$
(5)
$$ {U}_{ij}(n)=\max \left({U}_{ij}^1(n),{U}_{ij}^2(n)\right) $$
(6)
$$ {E}_{ij}(n)=g{E}_{ij}\left(n-1\right)+{V}_{\theta }{Y}_{ij}\left(n-1\right) $$
(7)
$$ {X}_{ij}(n)=\frac{1}{1+{e}^{\left({E}_{ij}-{U}_{ij}\right)}} $$
(8)
$$ {Y}_{ij}(n)=\Big\{{\displaystyle \begin{array}{l}1,\kern2.25em if\ {X}_{ij}(n)>0.5\\ {}0,\kern2.25em \mathrm{otherwise}\ \end{array}} $$
(9)

where n denotes the iteration times, (i, j) is the location of the image pixel, Sij(n) is the input excitation signal, and the superscript 1 and 2 represent channel 1 and channel 2 respectively. Uij(n) refers to the internal active state of the neuron, and it depends on the maximum of \( {U}_{ij}^1(n) \) and \( {U}_{ij}^2(n) \), Wkl is the weighted coefficient matrix of linking between neurons, Eij(n) is the dynamic threshold, Vθ is the threshold of amplification factor, Yij(n) is the output signal of the neuron at nth iteration, and f and g are the internal active and dynamic threshold signal decay coefficients, respectively.

In order to show the difference within ignition range, the sigmoid function is used to improve the neuron output signal [20], as shown in Eq. (8), Xij(n) denotes the pixel pulse ignition output amplitude, as Xij(n) > 0.5, the neuron produces a pulse, which is called one firing time, the signal is captured by the linking matrix Wkl, and the adjacent neurons achieve synchronization pulse release at the spatial position. Tij(n) expresses the neuron firing times matrix after nth iteration, the structure of the basic SCM neuron is shown in Fig. 5, and the mathematical expression is described as follows.

$$ {T}_{ij}(n)=\Big\{{\displaystyle \begin{array}{l}n,\kern7.25em {\mathrm{Y}}_{ij}(n)=1\\ {}{T}_{ij}\left(n-1\right)+{Y}_{ij}(n),\kern1.5em otherwise\end{array}} $$
(10)
Fig. 5
figure 5

The structure of the basic SCM neuron

2.4 Fusion strategies and specific steps

2.4.1 Low-frequency coefficient fusion strategy

Commonly the part of low-frequency contains the main components of the source image. On the contrary, the high-frequency coefficients preserve the more details of the source image. The low-frequency coefficients of the source images are fused by the simple weighted averaging or maximum value-based strategies, which do not consider the relationship between pixels. To address the limitation of traditional fusion strategies, we use the VGG-19 model to extract the multi-layer features of source images, the weight maps are generated by adaptive selection rule. Finally, the low-frequency coefficients are fused by the source images and weight maps. The low-frequency coefficient fusion framework is shown in Fig. 6.

Fig. 6
figure 6

The low-frequency coefficient fusion framework

We introduce this fusion strategy in details, \( {f}_k^{n,m} \) denotes the feature map of kth source image at nth layer, and m is the dimension of the feature map, m = 64 × 2n − 1, k = 2, where Fn indicates the layer in VGG-19 network, n ∈ {1, 2, 3} represents the ReLU_1_1, ReLU_2_1, and ReLU_3_1 activation layer respectively. \( {A}_k^n\left(i,j\right) \) is the activity level map which is generated by l1-norm at position (i, j), to make the fusion method robust to misregistration; the block-based average operator is used to calculate the final activity level map \( {\widehat{A}}_k^n\left(i,j\right) \), where r denotes the block size, to preserve more details, r = 1.

$$ {f}_k^{n,m}={F}_n\left({I}_k\right) $$
(11)
$$ {A}_k^n\left(x,y\right)={\left\Vert {f}_k^{n,m}\Big(x,y\Big)\right\Vert}_1 $$
(12)
$$ {\widehat{A}}_k^n\left(i,j\right)=\frac{\sum_{\beta =-r}^r{\sum}_{\theta =-r}^r{A}_k^n\left(i+\beta, j+\theta \right)}{{\left(2r+1\right)}^2} $$
(13)

Then we design an adaptive selection rule to make the weight maps more robust, tn(i, j) is the ratio of two activity level maps by Eq. (13), \( {W}_1^n\left(i,j\right) \) is the weight map which denotes that if tn(i, j) tends to zero, the L1 coefficient has a greater weight, and the same goes for \( {W}_2^n\left(i,j\right) \). The VGG-19 has pooling operation; it is necessary to carry out the upsampling operator and resize the weight maps size consistent with source images size.

$$ {t}_n\left(i,j\right)=\frac{{\widehat{A}}_1^n\left(i,j\right)}{{\widehat{A}}_2^n\left(i,j\right)} $$
(14)
$$ {W}_1^n\left(i,j\right)=\frac{t_n^3\left(i,j\right)}{1+{t}_n^3\left(i,j\right)} $$
(15)
$$ {W}_2^n\left(i,j\right)=\frac{1}{1+{t}_n^3\left(i,j\right)} $$
(16)
$$ {\widehat{W}}_k^n\left(i+p,j+q\right)={W}_k^n\left(i,j\right) $$
(17)
$$ p,q\in \left\{0,1,\cdots, \left({2}^{n-1}-1\right)\right\} $$
(18)

Finally, we carry out the maximum value rule for the initial fused coefficients of the three layers, so as to merge them into the final fused coefficient; the specific expressions are as follows.

$$ {L}_{Fused}^n\left(i,j\right)={L}_1\times {\widehat{W}}_1^n\left(i,j\right)+{L}_2\times {\widehat{W}}_2^n\left(i,j\right) $$
(19)
$$ {L}_{Fused}\left(i,j\right)=\max \left[{L}_{fused}^n\left(i,j\right),n\in \left\{1,2,3\right\}\right] $$
(20)

2.4.2 High-frequency coefficient fusion strategy

The existing high-frequency fusion strategies contain the largest absolute value, regional energy [21], variance, and gradient [22], but these strategies cannot extract detail information from the image adequately while only considering the individual pixels or regional characteristics. The gray value of a single pixel is used as the excitation of the neural network; this may lose image edges and texture features. Two diagonal gradient changes are added on the basis of the conventional method; it can be utilized to stimulate the DCSCM.

Suppose H(i, j) denotes the high-frequency coefficients at the location (i, j), and modified average gradient (MAG) is measured using slipping windows (the size is 3 × 3) of the coefficients, then MAG in each coefficient is used to motivate the neuron, and it is defined as follows:

$$ MAG\left(i,j\right)=\frac{1}{MN}\sum \limits_{i=1}^M\sum \limits_{j=1}^N{\left(\frac{\nabla {H}_h\left(i,j\right)+\nabla {H}_v\left(i,j\right)+\nabla {H}_{md}\left(i,j\right)+\nabla {H}_{vd}\left(i,j\right)}{2}\right)}^{1/2} $$
(21)
$$ \nabla {H}_h\left(i,j\right)={\left[H\left(i,j\right)-H\left(i,j-1\right)\right]}^2 $$
(22)
$$ \nabla {H}_v\left(i,j\right)={\left[H\left(i,j\right)-H\left(i-1,j\right)\right]}^2 $$
(23)
$$ \nabla {H}_{md}\left(i,j\right)={\left[H\left(i,j\right)-H\left(i-1,j-1\right)\right]}^2 $$
(24)
$$ \nabla {H}_{vd}\left(i,j\right)={\left[H\left(i,j\right)-H\left(i-1,j+1\right)\right]}^2 $$
(25)

where ∇Hh(i, j), ∇Hv(i, j), ∇Hmd(i, j), and ∇Hvd(i, j) denote the gradient changes in the horizontal, vertical, main diagonal, and oblique diagonal directions, respectively. N and M are the size of the slipping window.

2.4.3 Fusion steps

Assume that the CT and MRI images have been matched and treated with uniform size accurately.

The image fusion scheme is shown in Fig. 7.

Fig. 7
figure 7

Schematic diagram of the proposed image fusion framework

The steps of the image fusion method are narrated as follows.

  1. Step 1.

    Decompose the CT and MRI images using NSST to obtain their low-frequency coefficients {L1,L2} and a series of high-frequency coefficients {\( {H}_1^{l,k} \),\( {H}_2^{l,k} \)} at each K-scale and l-direction, where 1 ≤ k ≤ K.

  2. Step 2.

    The deep learning framework is used to fuse the low-frequency coefficients. According to Eqs. (11)–(20), the weight maps are generated by the VGG-19 network which can choose the low-frequency coefficients adaptively.

  3. Step 3.

    DCSCM is utilized to deal with the high-frequency coefficients. Let the MAG maps be the feedback inputs of the DCSCM.

    1. (a)

      Calculate the MAG1 and MAG2 maps according to Eq. (21), and all coefficients are normalized.

    2. (b)

      Set the initial values as follows: Uij(0) = Tij(0) = Eij(0) = 0. In the initial state, all the neurons are inactivated, so Yij(0) = 0.

    3. (c)

      Calculate Uij(n), Eij(n), and Yij(n) by Eq. (6), Eq. (7), and Eq. (9), respectively, and then compute the neuron’s firing times Tij(n) according to Eq. (10). The fusion coefficients are selected according to Uij(n), N is the maximum number of iterations, and the rule is described as follows:

$$ {H}_{Fused}^K\left(i,j\right)=\Big\{{\displaystyle \begin{array}{l}{H}_1^K\left(i,j\right),\kern0.75em {\mathrm{U}}_{ij}(N)={U_{ij}}^1(N)\\ {}{H}_2^K\left(i,j\right),\kern0.75em {\mathrm{U}}_{ij}(N)={U_{ij}}^2(N)\end{array}} $$
(26)
  1. Step 4.

    Perform the inverse NSST of the low-frequency and the high-frequency coefficients to obtain the fused image.

3 Results

The simulation experiments were conducted by MATLAB2017b and Python 2.7 software on PC with Intel E5 2670 2.6 GHz CPU, 16 GB RAM, GTX1080ti GPU. We take several groups of accurate matching of CT image and MRI image to test, and a pair of CT and MRI images belongs to the same patient, as shown in Figs. 8, 9, 10, 11, and 12. The source medical images were collected from http://www.med.harvard.edu/AANLIB/.

Fig. 8
figure 8

Fusion results of the “Data-1” image set with different methods. a CT, b MRI, c DWT, d LAP, e NSST-SCM, f SR, g GFF, h proposed

Fig. 9
figure 9

Fusion results of the “Data-2” image set with different methods. a CT, b MRI, c DWT, d LAP, e NSST-SCM, f SR, g GFF, h proposed

Fig. 10
figure 10

Fusion results of the “Data-3” image set with different methods. a CT, b MRI, c DWT, d LAP, e NSST-SCM, f SR, g GFF, h proposed

Fig. 11
figure 11

Fusion results of the “Data-4” image set with different methods. a CT, b MRI, c DWT, d LAP, e NSST-SCM, f SR, g GFF, h proposed

Fig. 12
figure 12

Fusion results of the “Data-5” image set with different methods. a CT, b MRI, c DWT, d LAP, e NSST-SCM, f SR, g GFF, h proposed

3.1 Experiment parameters setting

In this section, extensive experiments on CT and MRI medical images are performed to verify the effectiveness of the proposed method. Our fusion method is compared with three representative conventional fusion methods and two state-of-the-art fusion methods including wavelet-based method (DWT) [4], Laplacian pyramid (LAP) [5], multiscale transform-based method (NSST-SCM) [23], sparse representation-based method (SR) [24], and guided filter-based fusion method (GFF) [25]. In the experiments, DWT uses “db2” as the filter; NSST uses a non-subsampling pyramid “maxflat” filter, and its decomposition directions are set as [2,3,4]. The fusion rule for low-frequency coefficient is averaging while the high-frequency coefficients are fused using absolute maximum choosing rule. According to [23] and artificial experience, the parameters of the SCM and DCSCM are set as follows:

$$ f=0.2,g=0.6,{V}_{\theta }=20,W=\left[\begin{array}{ccc}0.1091& 0.1409& 0.1091\\ {}0.1409& 0& 0.1409\\ {}0.1091& 0.1409& 0.1091\end{array}\right]. $$

3.2 Subjective evaluations

For convenience, five pairs of CT and MRI images respectively called “Data-1,” “Data-2,” “Data-3,” “Data-4,” and “Data-5” are selected as representative results to demonstrate the performance of the proposed fusion method. All of them cover 256 gray levels and have the same size 256 × 256.

The fusion results based on different methods for “Data-1” image set are shown in Fig. 8. The CT and MRI images are respectively shown in Fig. 8a and b. The fusion results obtained from DWT, LAP, NSST, SR, and the proposed method are represented in Fig. 8c–h, respectively. The fusion results mainly retain both the bone structures of CT image and soft tissues of MRI image. However, there are slight differences in contrast and detail preservation.

To show the difference of comparison methods more directly, we marked the experimental results with a yellow rectangle. As shown in Fig. 8c and f, the fused results have very low contrast in the yellow marked region. Although the fused images using LAP and NSST shown in Fig. 8d preserve more information of CT image, they lose some details of MRI image. In terms of visual effects, the performance of the GFF is similar to our proposed method which can fully retain the details of the source images. The next medical image set “Data-2” is shown in Fig. 9. The result of using DWT is the lack of bone structure information and has a bad visual effect; the problem of low contrast also exists in Fig. 9f. There were no significant differences in the other three groups of results. Figure 10 shows the medical image set “Data-3,” compared with other comparison algorithms; the fusion result of the proposed method has high contrast and fully retains the soft tissue information as shown in Fig. 10h. The DWT, NSST-SCM, and SR methods lose the details of source images, and bone structures information is insufficient in Figs. 11 and 12. The obtained results by the proposed method have sharp edges, more details, and enhanced contrast.

3.3 Quantitative comparison

In addition to subjective visual evaluation, quantitative evaluation metric is an important tool to measure fusion performance. In this paper, those quantitative evaluation metrics include mutual information (MI) [26], mean structural similarity (MSSIM) [27], standard deviation (SD) [10], spatial frequency (SF) [10], image entropy (IE) [10], and margin information retention (QAB/F) [28] which are used to evaluate the different fusion methods.

  1. 1)

    MI shows the correlation between two events. The MI of two discrete random variables U and V can be defined as follows:

$$ MI\left(U,V\right)=\sum \limits_{v\in V}\sum \limits_{u\in U}p\left(u,v\right){\log}_2\frac{p\left(u,v\right)}{p(u)p(v)} $$
(27)

where p(u,v) is the joint probability distribution of U and V, p(u) and p(v) are the marginal probability distribution of U and V, respectively. The sum of mutual information between the fused image and two source images can be calculated to denote the difference of fusion quality, and then the mutual information metric can be described as follows:

$$ M{I}_F^{AB}= MI\left(A,F\right)+ MI\left(B,F\right) $$
(28)

Equation (28) reflects a total amount of information that fused image F(i, j) contains both source image A(i, j) and source image B(i, j). The higher score of MI is, the richer the information is obtained from the source images.

  1. 2)

    SD is a measure of the dispersion degree of a set of image data averages. The standard deviation of the fused image is calculated as.

$$ SD=\sqrt{\frac{1}{M\times N}\sum \limits_{i=1}^M\sum \limits_{j=1}^N{\left(F\left(i,j\right)-\mu \right)}^2} $$
(29)

where F(i, j) is the pixel value of the fused image at the location (i, j), and μ is the mean value. The metric indicates the clarity of the fused image; the higher this score is, the higher the image quality is.

  1. 3)

    SF is composed of row frequency (RF) and column frequency (CF) and is described as follows.

$$ SF=\frac{1}{MN}\sum \limits_{i=1}^M\sum \limits_{j=1}^N\left( RF+ CF\right) $$
(30)
$$ RF={\left[C\left(i,j\right)-C\left(i,j-1\right)\right]}^2 $$
(31)
$$ CF={\left[C\left(i,j\right)-C\left(i-1,j\right)\right]}^2 $$
(32)

where C(i, j) denotes the pixel value of the image, in which the size of the image is M × N. The higher score of this metric represents the fused image with higher resolution.

  1. 4)

    IE represents the amount of information in the fused image and the gray distribution of an image is P = {P1,P2,…Pn}, Pl denotes the ratio of the pixel number of gray value l and the total pixels of the image, and n is the total number of gray level. It can be acquired by Eq. (33).

$$ IE=-\sum \limits_{i=0}^LP(l){\log}_2P(l) $$
(33)

where P(l) expresses the probability density of L, and L represents the gray level of an image. The higher score of IE is, the more information the fused image contains.

  1. 5)

    MSSIM is an effective measure of similarity of two images, which is calculated as follows.

$$ MSSIM=\frac{SSIM\left(A,F\right)+ SSIM\left(B,F\right)}{2} $$
(34)

where SSIM(A, F) and SSIM(B, F) are correlation coefficients between the CT image and the fused image, the MRI image and the fused image respectively. SSIM (i, j) is defined as follows.

$$ SSIM\left(i,j\right)=\frac{\left(2{\mu}_i{\mu}_j+{C}_1\right)\left(2{\sigma}_{ij}+{C}_2\right)}{\left({\mu_i}^2+{\mu_j}^2+{C}_1\right)\left({\sigma_i}^2+{\sigma_j}^2+{C}_2\right)} $$
(35)

where μi, σj, and σij express the mean, standard deviation, and cross-correlation, respectively. C1 and C2 are used to ensure stability when the mean value and the variance are close to zero. The rotationally symmetric Gaussian window with standard deviation 1.5 was selected in MSSIM. The higher score of MSSIM is, the smaller the distortion of the fused image is.

  1. 6)

    QAB/F represents the transformation degree of edge information of the fused image and the source image. It is defined as follows.

$$ {Q}^{AB/F}=\frac{\sum \limits_{i=1}^N\sum \limits_{j=1}^M\left({Q}^{AF}\left(i,j\right){w}^A\left(i,j\right)+{Q}^{BF}\left(i,j\right){w}^B\left(i,j\right)\right)}{\sum \limits_i^N\sum \limits_j^M\left({w}^A\left(i,j\right)+{w}^B\left(i,j\right)\right)} $$
(36)

where \( {Q}^{AF}\left(i,j\right)={Q}_g^{AF}\left(i,j\right){Q}_o^{AF}\left(i,j\right) \), \( {Q}_g^{AF}\left(i,j\right) \), and \( {Q}_{\mathrm{o}}^{AF}\left(i,j\right) \) are the edge strength and orientation preservation value at the location (i, j), respectively. N and M are the size of the image, and QBF(i, j) is similar to QAF(i, j); wA(i, j) and wB(i, j) reflect the weight of QAF(i, j) and QBF(i, j) respectively. If the QAB/F gets the value higher and closer to unity, it means that the fused image is produced with less edge information loss.

The objective quantitative assessments based on image quality metrics are shown in Table 1. The best results are in italic. The proposed method achieves the best metric in terms of MI, SD, MSSIM, and QAB/F, and remaining metrics have subtle different between each other which demonstrate that the proposed method has a high level of competence in preserving edge details and saliency information. Moreover, the two important relative evaluation metrics that MI and QAB/F will be represented as line chart intuitively, as shown in Fig. 13.

Table 1 Quantitative assessments comparison of different methods
Fig. 13
figure 13

Objective evaluation results based on Figs. 8, 9, 10, 11, and 12. a The comparison line chart for MI, b the comparison line chart for QAB/F

4 Discussions

To summarize the experimental results, the fused images based on SR and DWT method look unsatisfactory due to low contrast and bone structure information loss. The visual effect of results based on LAP, NSST-SCM, and GFF has improved, but texture and edge in yellow marked region are not preserved fully. By contrast, the proposed method achieves clear and high contrast fused results by retaining salient features, which synthesizes soft tissues and bone structure information to the maximum extent. Next, we discuss the quantitative assessments of the image in detail. Among the six metrics, such as IE, SF, and SD reflect the internal features of a single image, which can measure the quality of the fused image commonly. The IE represents the information entropy of the fused image. The SF reflects the clarity of the image. The SD describes the contrast of the fused image. The bigger the SD is, the more dispersed the distribution of gray level in the image is, and the greater the contrast is, the better the visualization of the fused image is. However, some results contain more redundant information, which also lead to increase the score of those three metrics. To make more comprehensive and objective analysis, we introduce the other three metrics including MI, MSSIM, and QAB/F that can better reflect the internal relationship between the source images and the fused image. The MI describes the similarity of the image intensity distributions of the corresponding image pair and estimates how much information is obtained from source images. The higher score of MI indicates the richer information and details obtained from CT and MRI images which also assures more activity and clarity level in the fused images. The MSSIM represents the degree of distortion of the fused image. In addition, the QAB/F measures the amount of edge information transferred from the source images to the fused images. This metric is essential for medical image fusion that the higher score means the more the edge details such as bone structure and texture are fused, which is helpful for accurate pathological analysis. According to the quantitative analysis of the experimental results, the proposed method achieves the best performance in the MI, SD, and QAB/F metrics especially, which also means that the fused image has a high contrast and less distortion and contains enough dense structures, soft tissues information, texture, and edge details. The results based on the proposed method are more suitable for assisting doctors in the accurate diagnosis diseases. At the present stage, we adopt the deep learning scheme to extract multi-layer features and generate the low-frequency fusion weight which achieves good performance, but the selection of the deep features still depends on the artificial designed rule. In the future, we will continue to study the relationship between multi-model features and try to build an unsupervised deep fusion model and make the fusion framework more robust and adaptive.

5 Conclusions

In this paper, a novel fusion method for CT and MRI medical images is proposed by integrating CNNs and DCSCM in NSST domain. In the proposed method, the NSST provides both the multiscale and direction analysis of the source images. The CNNs based an activity level measurement that is used to fuse the low-frequency coefficients. In the term of high-frequency coefficients, the modified average gradient is utilized as the external incentive of DCSCM. Extensive contrast experiments have been carried out on different pairs of CT and MR images which can verify the superiority of the proposed method in both visual effects and objective evaluation. Moreover, the results demonstrate this fusion scheme has application prospect in the field of medical image fusion.