1 Introduction

As commonly known, infrared sensors have the capability to accurately capture the infrared radiation emitted by objects, independent of lighting conditions [16]. Therefore, infrared images can highlight targets with high thermal radiation well (such as the people in the infrared image in Fig. 1) [11]. Unfortunately, the spatial resolution of infrared images is insufficient for displaying detailed textures clearly. Conversely, visible images can provide clear textures and rich details, such as edges or lines. However, in situations with inadequate illumination or complex scenes, visible images often struggle to display targets effectively (such as the people in the visible image in Fig. 1) [26]. Hence, relying solely on a single infrared or visible image cannot offer comprehensive information about a specific scene in numerous computer vision applications. In dealing with this challenge, image fusion technology emerges as an effective solution capable of synthesizing the information from both infrared and visible source images [41]. As a result, infrared–visible image fusion technology plays a crucial role in various fields, including object recognition [35], remote sensing [37], surveillance [15], and more.

Fig. 1
figure 1

A pair of typical infrared–visible images. The left one is an infrared image, and the right one is a visible image

To acquire high-quality infrared–visible fusion images, a large number of fusion algorithms have been proposed. Among them, the fusion methods based on multi-scale transformation have garnered significant attention and witnessed rapid development [7, 8, 17]. For instance, the effectiveness of wavelet-based multi-scale transform algorithms has been substantiated through numerous experiments and applications, rendering them persistently studied and utilized to this day [9, 32, 40, 45]. However, a limitation of these methods is their inability to effectively represent two-dimensional singular information such as edges or lines [6]. Consequently, one has to propose multidirectional wavelet [2,3,4], curvelet [1], contourlet [23], shearlet [14], and other methods [5, 29] to solve the above problem. Regrettably, these approaches often necessitate designing complex filter banks, and their exceptional fusion performance comes at the expense of real-time processing capability.

With the development of technology, hybrid multi-scale methods have emerged as a solution to overcome the performance limitations of individual algorithms [12, 13, 42]. For example, Zhou et al. have proposed a novel fusion method based on hybrid multi-scale decomposition. In their algorithm, the fused image can produce enhanced visual effects by incorporating infrared spectral information [12]. Furthermore, with the widespread adoption of deep learning, a plethora of deep-learning-based fusion methods have been proposed in recent years [19,20,21, 27]. For example, in Ref. [18], a fusion framework for infrared–visible images based on deep learning is presented, while Ref. [20] combines a residual network with zero-phase analysis for the fusion of infrared–visible images.

These deep-learning-based fusion methods are innovative and have achieved notable achievements. But, according to Ref. [39], extensive fusion experiments suggest that the improvement in fusion performance brought about by deep-learning technology is quite limited. In most cases, these methods do not exhibit significant advantages compared to traditional classical algorithms. Additionally, there are two prominent drawbacks associated with deep-learning-based methods. Firstly, their fusion performance heavily relies on the availability of training datasets. Unfortunately, constructing large-scale datasets is often challenging in many fields, and ground-truth fusion images are frequently unavailable. Secondly, their generalization capability is severely constrained. Fusion models trained on datasets of multi-focus images usually show poor performance when confronted with other image types. As a result, it is crucial to develop a general, high-performance fusion method for infrared–visible images.

In general, the key to obtaining high-quality fusion images lies in extracting and effectively fusing the complementary information present in the source images. As is commonly acknowledged, the primary sources of complementary information between infrared and visible images are texture details (such as edges and lines) and bright-dark information. Therefore, fully extracting and efficiently fusing these elements is crucial for achieving optimal fusion outcomes. To address this objective, this paper proposes a novel fusion method called non-subsampled morphological transformation (NSMT). Firstly, the NSP decomposition is applied to decompose the source image into one low-frequency background image (LF) and a series of high-frequency detail images (HF) with different resolutions. Secondly, a dual-channel multi-scale morphological decomposition called MTBH is constructed to extract the bright-dark details from the LF image. Simultaneously, a dual-channel multi-direction morphological edge decomposition called MIOE is constructed to decompose the HF images to fully extract the edges with various directions. Subsequently, the edge images including smooth images and detailed images are fused by the approach of choosing the absolute maximum and window-gradient maximum, respectively. For the bright-dark information from the background, the rule of “choosing brighter from the bright and darker from the dark” is used to enhance image contrast. Finally, the fusion image is reconstructed by corresponding inverse transformation.

The rest of this paper is organized as follows: in Sect. 2, we will present and elaborate on the details of the proposed algorithm, covering three key aspects: decomposition, fusion, and reconstruction. Section 3 demonstrates the fusion performance of our method through a series of fusion experiments. Finally, Sect. 4 provides the main conclusions drawn from this research.

2 Proposed Algorithm

The flowchart of our NSMT is shown in Fig. 2, which consists of three processes, i.e., decomposition, fusion, and reconstruction. In Fig. 2, the iNSP, iMTBH, and iMIOE are the inverse transformations of NSP, MTBH, and MIOE, respectively.

Fig. 2
figure 2

The flowchart of our NSMT

2.1 Decomposition

The decomposition process of NSMT is shown in Fig. 3. First, the source image \(f(x,y)\) is decomposed to obtain K high-frequency detail images \({HF}_{i}\) and one low-frequency background image \(LF\) by K-level NSP. Second, the \(LF\) image is decomposed to obtain the bright-dark smooth image \({T}_{N}\)-\({B}_{N}\) and the bright-dark detail images \({TH}_{n}\)-\({BH}_{n}\) by N-level MTBH. For the high-frequency detail image \({HF}_{i}\), we can get the smooth inner-outer edge images \({I}_{i,D}\)-\({O}_{i,D}\) and multi-direction inner-outer edge detail images \({IE}_{i,d}\)-\({OE}_{i,d}\) by D-direction MIOE.

Fig. 3
figure 3

Decomposition schema of our NSMT

2.1.1 NSP Decomposition

To obtain shift-invariant property, Zhou et al. realized the NSP decomposition by constructing dual-channel 2-D non-subsampled filter banks [43]. When the decomposition filter banks \(\{{H}_{0}(z), {G}_{0}(z)\}\) and the reconstruction filter banks \(\{{H}_{1}(z), {G}_{1}(z)\}\) satisfy Bezout equation, such as Eq. (1), the decomposition will satisfy the condition of complete reconstruction.

$$ H_{0} \left( z \right) \cdot H_{1} \left( z \right) + G_{0} \left( z \right) \cdot G_{1} \left( z \right) = 1 $$
(1)

Moreover, to realize multi-scale decomposition, each stage needs to sample the previous filter according to the sampling matrix \({\varvec{D}}=2{\varvec{I}}\) (I represent the second-order unit matrix). The decomposition process is shown in Eq. (2).

$$ \left\{ {\begin{array}{*{20}l} {f_{0} \left( {x,y} \right) = F} \hfill \\ {f_{i} \left( {x,y} \right) = H_{0} \left( {z^{{2^{{i - 1}} I}} } \right)*f_{{i - 1}} \left( {x,y} \right)} \hfill \\ {h_{i} \left( {x,y} \right) = G_{0} \left( {z^{{2^{{i - 1}} I}} } \right)*f_{{i - 1}} \left( {x,y} \right)} \hfill \\ \end{array} } \right. $$
(2)

where \(F\) represents the source image; \({H}_{0}\left(*\right)\) and \({G}_{0}\left(*\right)\) represent the low-pass and high-pass filters after up-sampled, respectively; \({f}_{i}(x,y)\) and \({h}_{i}(x,y)\) represent the low- and high-frequency image in the i-th layer.

An image can be decomposed into \(K+1\) sub-images (including one low-frequency image and \(K\) high-frequency images) by K-level NSP decomposition. Taking image “Kaptein” as an example, the decomposed results by 4-level NSP as shown in Fig. 4.

Fig. 4
figure 4

The decomposed results of the “Kaptein” image by 4-level NSP

According to the decomposed results, it is easy to find that the edges and lines are mainly concentrated in high-frequency images, and their fineness increases with the layer. In addition, the details in the LF image are very blurred, in which the main information is the bright-dark background information. This result provides practical feasibility for extracting multi-direction edges from \({HF}_{i}\) and bright-dark information from \(LF\).

For reconstruction, high-frequency and low-frequency reconstructed filters are used to filter high- and low-frequency images, respectively. The reconstruction of the i-th layer is shown in Eq. (3).

$$ f_{i} \left( {x,y} \right) = H_{1} \left( {z^{{2^{i} I}} } \right)f_{i + 1} \left( {x,y} \right) + G_{1} \left( {z^{{2^{i} I}} } \right)h_{i + 1} $$
(3)

2.1.2 MTBH Decomposition

Morphological top-hat (TH) and bottom-hat (BH) can be defined based on morphological Open and Close. Their definitions are shown in Eq. (4). The TH can extract patches that are brighter than neighbors, namely bright details. And, the function of BH is to extract dark details [44].

$$ \left\{ {\begin{array}{*{20}c} {TH\left( F \right) = F - {\text{Open}}\left( F \right)} \\ {BH\left( F \right) = {\text{Close}}\left( F \right) - F} \\ \end{array} } \right. $$
(4)

Due to the TH and BH decomposition at a single scale cannot extract the bright-dark details adequately. Multi-scale Open and Close are defined, respectively, in Eq. (5).

$$ \left\{ {\begin{array}{*{20}l} {{\text{OPen}}_{n} \left( F \right) = {\text{F}}^\circ {\text{nS}} = \left( {{\text{F}}{ \ominus }{\text{nS}}} \right) \oplus {\text{nS}}} \hfill \\ {{\text{Close}}_{n} \left( F \right) = {\text{F}} \cdot {\text{nS}} = \left( {{\text{F}} \oplus {\text{nS}}} \right)-{ \ominus }{\text{nS}}} \hfill \\ \end{array} } \right. $$
(5)

where \(n=1, 2,... N\) is the scale factor, which is an integer and determines the size of structuring elements, \(\oplus \) and \(\ominus \) represent Dilate and Erode, respectively. In this paper, the shape of the structuring element \(S\) is “disk.” The \(nS\) represents the n-th structuring element whose size is determined by \(2n+1\). So, based on the pyramid idea, the multi-scale top-hat MTH decomposition can be defined such as Eq. (6).

$$ \left\{ {\begin{array}{*{20}l} {T_{0} \left( {x,y} \right) = F} \hfill \\ {T_{n} \left( {x,y} \right) = {\text{Open}}_{n} \left( {T_{n - 1} \left( {x,y} \right)} \right)} \hfill \\ {{\text{MTH}}_{n} \left( {x,y} \right) = T_{n - 1} \left( {x,y} \right) - T_{n} \left( {x,y} \right)} \hfill \\ \end{array} } \right. $$
(6)

where \(F\) is a source image, \({MTH}_{n}\left(x,y\right),n=1, 2, \dots , N\) represent a set of bright detail images. In addition, the last layer image is a smooth bright background image, which is represented by \({T}_{N}(x,y)\). By a similar process, we can obtain the decomposition of multi-scale bottom-hat MBH shown in Eq. (7).

$$ \left\{ {\begin{array}{*{20}l} {B_{0} \left( {x,y} \right) = F} \hfill \\ {B_{n} \left( {x,y} \right) = {\text{Close}}_{n} \left( {B_{n - 1} \left( {x,y} \right)} \right) } \hfill \\ {{\text{MBH}}_{n} \left( {x,y} \right) = B_{n} \left( {x,y} \right) - B_{n - 1} \left( {x,y} \right)} \hfill \\ \end{array} } \right. $$
(7)

In Eq. (7), \({MBH}_{n}\left(x,y\right)\) represents the dark detail image and the smooth dark background image is \({B}_{N}(x,y)\).

Figure 5 shows the decomposed results of the \(LF\) image from Fig. 4 by 5-level MTBH. As can be seen from Fig. 5, the high-layer image contains bright-dark details, and the fineness of bright-dark information increases with the number of layers. This is mainly because MTBH only compares with the pixels around the structuring element when extracting the bright-dark information. Therefore, bright-dark details will appear in high-layer images, while large areas with the bright-dark feature will be captured by low-layer images or directly retained in smooth images as backgrounds. Taking the pedestrian in the infrared image as an example, the pedestrian area has not been extracted in the high-layer bright detail image \({MTH}_{4}^{IR}\) or \({MTH}_{5}^{IR}\). However, in the low-layer image \({MTH}_{1}^{IR}\), the pedestrian area has been successfully extracted. Moreover, the sky in the visible image is retained as the background in the smooth image \({T}_{N}^{VIS}\)(which is the background information after removing the bright-dark details) due to its too large area. In addition, it should be noted that in the dark detail image \({MBH}_{n}(x,y)\), even the dark information is displayed with bright pixels.

Fig. 5
figure 5

The decomposed results of LF image by 5-levels MTBH

According to Eq. (6) and (7), the source image \(F\) can be perfectly reconstructed by MTH or MBH decomposition, respectively, as shown in Eq. (8).

$$ \left\{ {\begin{array}{*{20}l} {F = T_{0} \left( {x,y} \right) = T_{N} \left( {x,y} \right) + \mathop \sum \limits_{n = 1}^{N} {\text{MTH}}_{n} \left( {x,y} \right)} \hfill \\ {F = B_{0} \left( {x,y} \right) = B_{N} \left( {x,y} \right) - \mathop \sum \limits_{n = 1}^{N} {\text{MBH}}_{n} \left( {x,y} \right)} \hfill \\ \end{array} } \right. $$
(8)

2.1.3 MIOE Decomposition

Considering the directionality of edges, a series of 7-pixels “line” structuring element \({S}_{d}\) with different directions are used to construct multi-direction inner-edge (MIE) and outer-edge (MOE) to extract the edge details in different directions. The definition is shown in Eq. (9).

$$ \left\{ {\begin{array}{*{20}l} {{\text{MIE}}_{d} = F - F\Theta S_{d} } \hfill \\ {{\text{MOE}}_{d} = F \oplus S_{d} - F } \hfill \\ \end{array} } \right. $$
(9)

where the \(d\) represents the direction number; the \(F\) is the source image. Moreover, the inner-outer edge smooth image \({I}_{D}\) and \({O}_{D}\) are shown in Eq. (10).

$$ \left\{ {\begin{array}{*{20}c} {I_{D} = \frac{1}{D}\mathop \sum \limits_{d = 1}^{D} \left( {F\ominus S_{d} } \right)} \\ {O_{D} = \frac{1}{D}\mathop \sum \limits_{d = 1}^{D} \left( {F \oplus S_{d} } \right)} \\ \end{array} } \right. $$
(10)

Taking the \({HF}_{3}\) as the sample, the 4-direction decomposition is shown in Fig. 6. We can easily find that the edges in the decomposed images have significant directionality from \({MIE}_{\mathrm{3,1}}\) to \({MIE}_{\mathrm{3,4}}\) (or from \({MOE}_{\mathrm{3,1}}\) to \({MOE}_{\mathrm{3,4}}\)). It should be noted that the fineness of these multi-direction edge images is consistent, and the difference is only in the direction of the edges.

Fig. 6
figure 6

The 4-directions edge detail images of \({HF}_{3}\)

According to Eq. (9) and (10), the reconstruction process can be easily deduced from MIE or MOE, respectively, as shown in Eq. (11).

$$ \left\{ {\begin{array}{*{20}l} {IE^{R} = \frac{1}{D}\mathop \sum \limits_{d = 1}^{D} \left( {I_{D} + {\text{MIE}}_{d} } \right)} \hfill \\ {OE^{R} = \frac{1}{D}\mathop \sum \limits_{d = 1}^{D} \left( {O_{D} - {\text{MOE}}_{d} } \right)} \hfill \\ \end{array} } \right. $$
(11)

where D represents the total number of directions, and \({IE}^{R}\) and \({OE}^{R}\) represent reconstructed images based on MIE and MOE, respectively.

2.2 Fusion

To make full use of the extracted information, our NSMT adopts different fusion rules for the decomposed results from MTBH and MIOE.

2.2.1 Fusion for MTBH

There is a fact that high contrast is more conducive to highlighting image details. So, for the fusion of bright-dark information (including the smooth layer and detailed layer), the rule of choosing the brighter from the bright and darker from the dark is proposed to improve the image contrast. The technique of fusion can be formulated as Eq. (12).

$$ \left\{ {\begin{array}{*{20}c} {fT = \max \left( {T^{VIS} ,T^{IR} } \right)} \\ {fB = \min \left( {B^{VIS} ,B^{IR} } \right)} \\ \end{array} } \right. $$
(12)

where \(fT\) and \(fB\) represent the fused bright and dark images; \(T\) and \(B\) represent the extracted bright and dark images.

2.2.2 Fusion for MIOE

For the edge smooth image \({I}_{i,D}\) and \({O}_{i,D}\), the rule of choosing absolute maximum is adopted such as Eq. (13).

$$ fIO_{i,D} \left( {x,y} \right) = \left\{ {\begin{array}{*{20}l} {IO_{i,D}^{VIS} \left( {x,y} \right),} \hfill & {\left| {IO_{i,D}^{VIS} \left( {x,y} \right)} \right| > \left| {IO_{i,D}^{IR} \left( {x,y} \right)} \right|} \hfill \\ {IO_{i,D}^{IR} \left( {x,y} \right), } \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$
(13)

where \({fIO}_{i,D}\left(x,y\right)\) represents the fused inner-outer edge smooth image; \({IO}_{i,D}^{VIS}(x,y)\) and \({IO}_{i,D}^{IR}(x,y)\) represent the extracted infrared and visible inner-outer edge smooth image.

In addition, the window-gradient maximum rule is applied to fuse the inner-outer edge detailed image \({MIE}_{i,d}\) and \({MOE}_{i,d}\). In our NSMT, the size of the window is \(3\times 3\), and the fusion rule is shown in Eq. (14). The \(WG(x,y)\) is the window gradient of a specific pixel \((x,y)\).

$$ fMIOE_{i,d} \left( {x,y} \right) = \left\{ {\begin{array}{*{20}l} {{\text{MIOE}}_{i,d}^{VIS} \left( {x,y} \right),} \hfill & {{\text{WG}}_{VIS} \left( {x,y} \right) > {\text{WG}}_{IR} \left( {x,y} \right)} \hfill \\ {{\text{MIOE}}_{i,d}^{IR} \left( {x,y} \right),} \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$
(14)

where \(WG\left(x,y\right)=\sqrt{{{G}_{x}}^{2}+{{G}_{y}}^{2}}\), in which the \({G}_{x}\) and \({G}_{y}\) represent horizontal and vertical gradients, and their definitions are shown in Eq. (15).

$$ \left\{ {\begin{array}{*{20}c} {G_{x} = \mathop \sum \limits_{x = 1}^{3} \mathop \sum \limits_{y = 1}^{3} \left[ {f\left( {x,y} \right) - f\left( {x,y - 1} \right)} \right]} \\ {G_{y} = \mathop \sum \limits_{x = 1}^{3} \mathop \sum \limits_{y = 1}^{3} \left[ {f\left( {x,y} \right) - f\left( {x - 1,y} \right)} \right]} \\ \end{array} } \right. $$
(15)

2.3 Reconstruction

2.3.1 Reconstruction of Background Image \({\varvec{f}}{\varvec{L}}{\varvec{F}}\)

In the process of reconstructing the \(LF\) image, we hope to ensure the integrity of bright-dark information and flexibly adjust the image contrast by using the proportion of bright-dark components. So, the smooth images are synthesized by the average method, and the bright-dark detail images by the weighted method. According to Eq. (8), the reconstruction process of \(fLF\) can be expressed as shown in Eq. (16),

$$ fLF = {{\left( {fT_{N} + fB_{N} } \right)} \mathord{\left/ {\vphantom {{\left( {fT_{N} + fB_{N} } \right)} {2.0 + {\upalpha }\mathop \sum \limits_{n = 1}^{N} f{\text{MTH}}_{n} - \left( {1 - \alpha } \right)\mathop \sum \limits_{n = 1}^{N} f{\text{MBH}}_{n} }}} \right. \kern-0pt} {2.0 + {\upalpha }\mathop \sum \limits_{n = 1}^{N} f{\text{MTH}}_{n} - \left( {1 - \alpha } \right)\mathop \sum \limits_{n = 1}^{N} f{\text{MBH}}_{n} }} $$
(16)

where \(fLF\) is the low-frequency background image \(LF\) of the fused image; \(\mathrm{\alpha }\) is a weight factor to adjust the proportion of bright-dark details. In this paper, \(\mathrm{\alpha }=0.5\) without any prior knowledge.

2.3.2 Reconstruction of Edge Detail Images \({{\varvec{f}}{\varvec{H}}{\varvec{F}}}_{{\varvec{i}}}\)

As we all know, the inner and outer edges are also equally important. So, the average method is used to synthesize the \({fHF}_{i}\) as shown in Eq. (17),

$$ f{\text{HF}}_{i} = \frac{{f{\text{IE}}_{i}^{R} + f{\text{OE}}_{i}^{R} }}{2} $$
(17)

where the \({fIE}_{i}^{R}\) and \({fOE}_{i}^{R}\) represent the reconstructed inner and outer fused image according to Eq. (11).

2.3.3 Reconstruction of Fusion Image

According to Eq. (3), the \({f}_{0}(x,y)\) can be reconstructed by iterating \(fLF\) and \({fHF}_{i}\) layer by layer. Then, the fusion image \(F\) can be obtained directly according to \(F={f}_{0}(x,y)\).

3 Experiments and Discussions

To explore the fusion performance of our NSMT, some state-of-the-art methods are selected as comparison objects such as FPDE [5], HMSD [42], JSR [24], LATLRR [22], MSVD [30], VSMWLS [28], and some deep-learning-based methods (including DLF [18], DenseFuse [19], and ResNet [20]). In this paper, about our NSMT, there are five sub-images, i.e., one low-frequency background image \(LF\) and four high-frequency detail images \({HF}_{i} (i=\mathrm{1,2},\mathrm{3,4})\) by 4-level NSP decomposition. In addition, to extract bright-dark information as much as possible from the LF image, the number of decomposition levels of MTBH is set to 5 (\(N=5\)). For the \({HF}_{i}\) image, the numbers of directions of MIOE from \(i=1\) to \({4}\) are set to \(2, 4, 8, 16\) (\(\mathrm{namely }D=2, 4, 8, 16\)), respectively.

Ten pairs of typical infrared–visible images have been selected as fusion objects from the public database, and all images have been pre-registered. These images are often selected as fusion objects in many infrared–visible image fusion studies because they are very representative and contain common objects such as buildings, trees, roads, pedestrians, and grasslands. The source images and fusion results are shown in Fig. 7. The first two rows are original infrared and visible source images, the last row is the fusion results of our NSMT, and the rest nine rows correspond to the fusion results of nine comparison methods, namely DLF, DenseFuse, FPDE, HMSD, JSR, LATLRR, MSVD, ResNet, and VSMWLS. According to the fusion results, we can find that all methods can fuse infrared and visible images successfully.

Fig. 7
figure 7

Fusion results of ten pairs of infrared–visible images. From top to bottom: infrared images, visible images, fusion results of DLF, DenseFuse, FPDE, HMSD, JSR, LATLRR, MSVD, ResNet, VSMWLS our NSMT

3.1 Subjective Evaluation

Due to the limitation of page size, it is very difficult to judge which one is the best or worst according to the above figure. To show their difference more intuitively, taking the first column images as an example, the regions in red rectangles were selected as observation objects as shown in Fig. 8.

Fig. 8
figure 8

The fusion images from first column. From (a) to (j) are the fusion results of DLF, DenseFuse, FPDE, HMSD, JSR, LATLRR, MSVD, ResNet, VSMWLS, and our NSMT

According to selected regions, it is not difficult to find that the results of our NSMT and FPDE are better than that of other methods in terms of sharpness, and they can show clearer outline of the branches. As we all know, the edge details are mainly from the visible image. Therefore, when the visible image is taken as the reference, the fusion result of NSMT can better retain the information from the source image. There are two main reasons for the above phenomenon. On the one hand, the NSMT extracts the bright-dark information from the low-frequency background and fuses them with the rule of “choosing brighter from the bright and darker from the dark,” which can effectively preserve the bright-dark distribution from the source images. On the other hand, the multidirectional edges extracted from the high-frequency images can effectively maintain the edge integrity with the rules of absolute maximum and window-gradient maximum. The combination of the above two aspects makes the fusion image generated from NSMT have an excellent visual effect.

In order to further explore the performance of the NSMT, the sixth column in Fig. 7 is taken as the observation object, as shown in Fig. 9. First, as far as the pedestrian indicated by the red arrow is concerned, the pedestrian in the fusion images of HMSD and NSMT has higher brightness than that of others. However, when we focus on the green rectangular regions, NSMT has obvious advantages over HMSD in terms of sharpness and edge integrity. In addition, NSMT shows better performance than other methods in terms of contrast in this region. An excellent contrast can make the results generated by NSMT have higher sharpness and more notable targets. These results further verify the excellent performance of NSMT in fusing infrared and visible images.

Fig. 9
figure 9

The fusion images from sixth column. From (a) to (j) are the fusion results of DLF, DenseFuse, FPDE, HMSD, JSR, LATLRR, MSVD, ResNet, VSMWLS our NSMT

3.2 Objective Evaluation

In practice, an ideal fusion image is often associated with specific tasks, so it is not always easy to find a universal quality evaluation method. Although the subjective evaluation can judge the fusion quality from the human perspective. But, the quality difference among fusion results is very small in most cases. So, it is difficult to give uniform and accurate results due to differences in individual mental state, visual acuity, psychological factors, and so on. To circumvent the above deficiency, many objective evaluation methods and metrics are proposed. However, there is no universally accepted method to objectively evaluate image fusion results.

In this paper, to comprehensively evaluate our NSMT from different perspectives, we choose eight commonly used and typical objective metrics, namely standard deviation (SD), information entropy (IE) [34], mutual information (MI) [33], \({Q}_{0}\) [36], \({Q}_{W}\) [31], \({Q}^{AB/F}\) [38], \({Q}_{E}\) [31] and VIF [10]. Among them, the SD can evaluate the image contrast from a statistical viewpoint. The \(IE\) can measure the information amount of the fused image from the viewpoint of information theory. The \(MI\) can measure the information amount that is transferred from source images to the fused image. The \({Q}_{0}\) can evaluate the distortion degree of a fused image, which combines three factors of image distortion related to the human visual system, namely loss of correlation, contrast distortion, and luminance distortion. The \({Q}_{w}\) is used to evaluate the salience of information from source images. The \({Q}^{AB/F}\) can measure the amount of edge details transferred from source images into the fused image. The \({Q}_{E}\) was designed by modifying \({Q}_{w}\), which contains edge information and visual information, simultaneously. The VIF can measure the visual information fidelity of the fusion image relative to the source images.

The evaluation results of all metrics based on ten pairs of typical infrared–visible images are shown in Fig. 10. It can be found from the above figure that the NSMT has generated the best results in most cases on the other six metrics except \({Q}_{0}\) and \(VIF\). According to SD, we can find that our NSMT is significantly ahead of other algorithms in the ninth pair of images. This is because the brightness difference between the source images of the ninth pair is very obvious. About the ninth pair of images, we can see that the infrared image is very dark on the whole, while the overall brightness of the visible image is high. Therefore, it can be judged that our NSMT can make good use of the bright-dark components to synthesize high-contrast fused images when the brightness difference between the source images is large. Additionally, we note that in terms of MI, the fusion results of NSMT have significant advantages in most images. This demonstrates that the NSMT can retain the information from the source image better than other algorithms. This conclusion can also be drawn from the performance of NSMT on metric \({Q}^{AB/F}\). Moreover, the superiority of NSMT in metric \({Q}_{E}\) further proves its excellent edge extraction ability. As it should be, for \({Q}_{0}\) and VIF, our algorithm also shows good fusion performance, although not obvious.

Fig. 10
figure 10

Objective comparisons of the eight metrics

To show the fusion performance of our algorithm as a whole, the average results of all images for each metric are shown in Table 1, in which the best results have been marked in bold values. According to this table, we can intuitively see that our algorithm is ahead of these state-of-the-art algorithms in almost all metrics (not including \({Q}_{0}\) and VIF), especially in image contrast (by SD), and mutual information (by MI). Although the performance of NSMT is mediocre in metrics \({Q}_{0}\) and VIF, it still shows satisfactory competitiveness.

Table 1 The average results of each metric

3.3 Comparison of Fusion Time

In this section, the fusion processing speed of our method will be shown by fusing the ten pairs of images from Fig. 7. To ensure a more reasonable comparison of fusion speeds across all methods, we exclude the worst and best fusion results. The mean fusion processing time of the remaining eight results is then calculated and presented in Fig. 11. The experimental platform employed in this study is Windows 10, with an AMD 3900 × CPU and 32GB of memory. As depicted in Fig. 11, it is evident that DenseFuse exhibits the fastest fusion speed, while JSR has the slowest. Although our NSMT does not have a significant advantage over DenseFuse, its fusion speed remains competitive.

Fig. 11
figure 11

The fusion processing time (unit is second) of all methods

Fig. 12
figure 12

Objective comparisons of the four metrics for noisy image fusion

4 Fusion of Noisy Image

It is well known that infrared thermal radiation imaging is easily polluted by environmental thermal noise. Ref [25] points out that morphological operation has certain denoising ability. To test this conjecture, the white Gaussian noise with different degrees is added to infrared source images. In this paper, the mean value of white Gaussian noise is 0, and the variance is 5, 10, 15, and 20, respectively. In general, noise will blur image details, such as edges or lines. In addition, the addition of Gaussian white noise will affect the brightness distribution of the image to a certain extent, thus changing the contrast of the original image. Therefore, for the evaluation of noisy image fusion, four evaluation metrics are selected, namely \({Q}_{w}\), \({Q}_{E}\), \({Q}^{AB/F}\), and \(SD\). The evaluation results of four metrics based on ten pairs of typical infrared–visible images are shown in Fig. 12. The axis of the radar chart represents the fusion algorithms.

According to Fig. 12, we can draw some very useful conclusions. First, the noise in the infrared image indeed degrades the image quality of fusion results. Second, the noise has a significant impact on salient features (such as edges and lines). And, the salient features decreases as the noise level increases, which can be drawn from the metrics \({Q}_{w}\), \({Q}_{E}\), and \({Q}^{AB/F}\). Third, the noise has a relatively small effect on contrast (according to \(SD\)), especially when the noise intensity is not high, this effect can be ignored. However, when the noise increases to a certain level, the image contrast will decrease significantly. Even so, the results of \(SD\) show that the contrast of the fused images from NSMT, DLF, DenseFuse, HMSD, LATLRR, and RESNET are almost unaffected by noise. Fourth and most importantly, for image fusion with noise, our NSMT performs the best among all algorithms regardless of any metric, and its performance is significantly superior to other algorithms. In summary, our NSMT can not only successfully fuse noisy images but also effectively reduce the impact of noise on the fusion results and produce very satisfactory results.

5 Conclusions

In this paper, we propose a new image fusion algorithm called NSMT. Firstly, the NSP is used to decompose the source image to obtain the low- and high-frequency components, and then the multi-scale MTBH and multi-direction MIOE are constructed to furtherly extract the bright-dark and edge details. In the decomposition process, the MTBH decomposition can efficiently extract the bright-dark details from the low-frequency background image. Meanwhile, for the high-frequency detail images, the MIOE decomposition can extract the edge details in different directions well. In the fusion process, the rule of “choosing brighter from the bright and darker from the dark” can effectively fuse the bright-dark information to improve the image contrast and show better visual effects. Moreover, the rules of choosing absolute maximum and window-gradient maximum can retain and highlight the multi-direction edge from the high-frequency detail images well. According to the fusion experiments, we find that the proposed method has absolute superiority in fusion quality compared with those state-of-the-art methods. Our NSMT can not only generate fused images with high contrast but also preserve the information from source images well. In addition, our NSMT has shown its excellent performance in extracting and retaining the edge information of visible images from both subjective and objective aspects. In summary, it can be concluded that the proposed method in this paper is a very competitive fusion algorithm for infrared–visible images.