1 Introduction

With the advancement of modern remote sensing technology, various Earth observation satellites continuously provide remote sensing images with different spatial, temporal, and spectral resolutions [1]. Satellite remote sensing is unable to actively monitor changes, has low spatial resolution, and cannot meet high-precision requirements. The fixed orbit and observation frequency of satellites result in a low frequency of Earth observation, making it challenging to obtain timely information for sudden events, leading to slow data collection and low operational efficiency [2]. Different sensors highlight different information in images; PAN has higher spatial resolution, while MS provides more spectral information [3]. For research purposes, the characteristic information of a single image is insufficient to meet research objectives. High spatial and spectral resolution is essential for a range of image processing applications [4]. Therefore, the fusion of MS and PAN images, also known as pansharpening, is crucial for obtaining a comprehensive image with abundant spatial and spectral information [5].

Researchers commonly classify fusion methods into three levels based on the stage in the processing workflow and the abstraction level of information, namely pixel-level, feature-level, and decision-level fusion [6]. Pixel-level fusion methods directly manipulate pixels of source images, improving image quality and referentiality through different rules. Feature-level fusion methods extract objects of interest from source images, combining features using different fusion rules to create a multi-modal fusion image with highlighted features. Decision-level fusion extracts information using interpreted/labeled data. The principle of selection is linked to consolidated data. The main advantage of this approach is that higher-level representations make multi-modal fusion more robust and reliable [7].

Pixel-level image fusion methods can be categorized into three types: spatial domain-based, transform domain-based, and other methods. Spatial domain methods directly operate on the pixels of source images, such as Weighted average, PCA, and IHS. [8]. An example is the multi-scale exposure fusion method with detail enhancement in the YUV color space proposed by Qiantong Wang et al. [9]. Spatial domain fusion methods are simple, efficient, and computationally fast. However, the commonly used simple overlay operations for fusion rules often significantly reduce the signal-to-noise ratio and contrast of the resulting images. Classical transform domain operations include pyramid transforms and wavelet transforms [10]. Algorithms based on multi-scale transform methods, such as Laplacian pyramids [11, 12], wavelet transforms [13, 14], and hybrid methods, neural networks [15, 16], play a crucial role in image fusion and have been proven effective decomposition algorithms [17]. Fusion methods based on Multiscale Transform (MST) are the most common traditional methods. They typically apply traditional weighted fusion rules to base layers, overlooking global contrast [18]. To address these limitations, many scholars have proposed improvements. For instance, Yu Zhang et al. introduced a multi-modal brain image fusion method guided by local extreme value maps, using two guided image filters with local minimum and maximum mappings, respectively [19]. Veshki et al. presented a multi-modal image fusion method based on coupled feature learning [20], separating the images to be fused into relevant and irrelevant components using sparse representations with the same support and Pearson correlation constraints. However, their fusion images suffer from severe color loss. Jie et al. proposed a multi-modal medical image fusion method based on multiple dictionaries and truncated Huber filtering [21], using multiple dictionaries and truncated Huber filters to separate images at different layers. However, the color fidelity of the fused images is not high, and there is significant spectral loss.

Addressing the shortcomings of the aforementioned algorithms, this paper proposes a remote sensing image multi-scale fusion model based on locally extreme maps-guided image filters. The innovation lies in the following aspects:

  1. (1)

    The use of locally extreme maps-guided image filters to decompose the image into base and detail layers, combined with wavelet transforms for further extraction of details from the base layer.

  2. (2)

    For low-frequency coefficients of the base layer, an improved regional energy fusion rule is applied, enhancing pixel connectivity within regions to maintain color consistency.

  3. (3)

    For high-frequency coefficients of the base layer, an improved AGPCNN rule.

2 Multiscale theory

In this section, we will introduce the general multiscale models.

Scale-space image processing is a fundamental technique in computer vision for object recognition and low-level feature extraction [22]. The term "scale-space" was introduced by Witkin when proposing a method for one-dimensional signal processing through convolution with a Gaussian kernel [23]. Scale-space can be considered as an alternative to traditional statistical smoothing methods [24].

Li et al. proposed an Image Fusion Algorithm based on Laplacian Pyramid and Principal Component Analysis Transforms [12]. The Laplacian pyramid model reduces the resolution of the image through downsampling, forming a multi-scale transformation space to better extract image features. The Principal Component Analysis method adopts a dimensionality reduction approach to rank and reduce the complex data information, seeking coordinate systems that reflect patterns to the maximum extent. Both the Laplacian pyramid and Principal Component Analysis methods used in their work are multi-scale spatial transformation techniques. The multi-scale spatial transformation model employed in this paper also involves downsampling to enhance the extraction of image features.

The fundamental idea behind multiscale analysis is to map points in high-dimensional space coordinates to low-dimensional space while preserving similarity between the two as much as possible.

The general steps of multiscale analysis are as follows:

  1. (1)

    Evaluate the nature of the data and, considering the data acquisition method, choose an appropriate analysis approach.

  2. (2)

    Determine the appropriate dimension based on the criteria for evaluation.

  3. (3)

    Assess the effectiveness and reliability of the results obtained from multiscale analysis.

  4. (4)

    Name the coordinate axes and classify objects based on the spatial map of the data.

The general steps of multiscale analysis are illustrated in Fig. 1.

Fig. 1
figure 1

The general flow of multi-scale analysis theory

3 Proposed fusion model

This section introduces the fusion model proposed in this paper, termed the local extrema maps-guided fusion model (LEMG). The fusion process of LEMG is outlined as follows:

  1. (1)

    Smooth the source image using an image filter guided by local extrema maps. Generate a detail layer by subtracting the smoothed image from the source image, where the smoothed image serves as the base layer.

  2. (2)

    Perform multi-scale decomposition on both the base and detail layers of two source images to extract additional details.

  3. (3)

    The base layer incorporates the global information of the source image to minimize spectral loss, ensuring the fused image retains the color fidelity of the source image, decompose the base layers of multiple scales using wavelet transform into low-frequency and high-frequency components. Apply an improved regional energy fusion rule to merge the low-frequency components of each layer and an enhanced AGPCNN rule to merge the high-frequency parts. For the detail layers, use a weighted average fusion rule to create multiple levels of merged detail layers.

  4. (4)

    For each base layer, the wavelet inverse transform is applied to form the fused base layers. The fused 3rd to 4th layers of base layers and the corresponding detail layers are fused using the rule of taking the absolute value of the maximum to highlight the features of each image. The fused images of the 1st and 2nd layers are blended using an average weighting rule to smooth the final fused image, achieving the optimal fusion effect.

The schematic diagram of the fusion model is depicted in Fig. 2.

Fig. 2
figure 2

Frame diagram of image fusion

3.1 Local extreme maps-guided image filter

Classical filters are based on the concept proposed by the Fourier transform. Filters separate useful signals from noise, thereby improving the signal's resistance to interference and signal-to-noise ratio. At the same time, filters can filter out uninteresting signal frequencies, achieving the goal of filtering signal frequencies and improving analysis accuracy. For images that require smoothing and highlighting edges, bilateral filters are commonly used to achieve this goal. The main idea is to construct a Gaussian kernel based on color intensity distance and spatial distance. However, bilateral filters tend to cause gradient reversal and produce artifacts when processing edges, and the computation time is too long. Therefore, Kaiming He and others proposed the concept of guided filters [25]. Figure 3 illustrates the distinct filtering outcomes of bilateral and guided filters. From left to right, the filtered images of the bilateral filter and the guide filter. The left column shows the filtering results of the bilateral filter, and the right column shows the filtering results of the guided filter.

Fig. 3
figure 3

Comparison of filter results

Guided filters are essentially an optimization algorithm that constructs an energy function and then minimizes it using optimization strategies such as least squares and Markov random fields. The energy function is usually defined as follows:

$$ E = U + V $$
(1)

where U is the data term, V is the smoothing term.

Guided filters require two source images, namely the guided image and the input image to be processed. The function is defined as follows:

$$ \mathop O\nolimits_{i} = \mathop a\nolimits_{k} \mathop G\nolimits_{i} + \mathop b\nolimits_{k} ,\quad \forall i \in \mathop \omega \nolimits_{k} $$
(2)

where Oi and Gi are the output and guided images, ak and bk are unique constant linear functions in the \(M \times M\) window \(\omega_{k}\).

The energy function is defined as follows:

$$ E(\mathop a\nolimits_{k} ,\mathop b\nolimits_{k} ) = \sum\limits_{{i \in \mathop \omega \nolimits_{k} }} {\mathop {((\mathop {\mathop a\nolimits_{k} G}\nolimits_{i} + \mathop b\nolimits_{k} - \mathop I\nolimits_{i} )}\nolimits^{2} + \theta \mathop a\nolimits_{k}^{2} )} $$
(3)

where \(I_{i}\) is the input image, \(\theta\) is a constant to prevent \(a_{k}\) from becoming too large.

Yu Zhang et al. introduced a guided filter based on a locally extreme value map as the guided image for the input image, thereby significantly suppressing the salient features of the input image. First, the input image is filtered using the following formula:

$$ I_{F} = G(I,{\text{I}}_{{{\text{min}}}} ,M) $$
(4)

In the equations presented, \(I_{F} ,I,{\text{I}}_{{{\text{min}}}} ,M\) represents the post-filtered image, the input image, the local minimum map of the input image, and the window size. The calculation formula for Imin is given by:

$$ {\text{I}}_{{{\text{min}}}} = ime(I,s) $$
(5)

Here, \(ime\) denotes the morphological erosion operator, and s represents a disk-shaped structuring element.

Subsequently, the filtered image is further processed through a local maximum value mapping, with the calculation formula given by:

$$ I_{F} = G(I_{F} ,{\text{I}}_{{{\text{max}}}} ,M) $$
(6)

Here, \(I_{f} ,I_{max}\) represents the post-filtered image and the local maximum image of IF. The calculation for Imax is as follows:

$$ I_{{{\text{max}}}} = imd(I_{F} ,s) $$
(7)

Here, \(imd\) denotes the morphological dilation operator.

The formula for the guided image filter based on local extreme maps is given by:

$$ I_{{{\text{out}}}} = LG(I,s,M) $$
(8)

Here, Iout is the filtered output image, and \(LG\) represents the guided image filter based on local extreme maps.

Prominent bright and dark features are removed from the input image to obtain a smooth image. Subsequently, significant features are obtained by subtracting the smoothed image from the input image, according to the following formulas:

$$ I_{b} = \max (I - I_{{{\text{out}}}} ,0) $$
(9)
$$ I_{d} = \min (I - I_{{{\text{out}}}} ,0) $$
(10)

In the following, Ib, Id denotes the bright feature map and the dark feature map, with \(\max ,\min\) representing the maximum and minimum values, respectively.

The results of image filtering using local extreme maps-guided image filter on the experimental data of this paper are presented in Fig. 4. The images are the first to fourth filtered images from left to right, with the left column showing the filtering results of Image 1 and the right column showing the filtering results of Image 2.

Fig. 4
figure 4

The results of image filtering using local extreme maps-guided image filter

3.2 Fusion rule for low-frequency coefficients

The energy algorithm involves calculating the energy of an image or pixel region. In the context of an image, the grayscale value corresponds to the energy, with higher grayscale values indicating higher energy. Generally, energy calculation involves summing the squared grayscale values of each pixel within an image or region.

Conventional region energy algorithms typically select a window region and employ a fixed weight multiplied by the region's pixels to compute the energy of the central pixel. Iterating through the image matrix yields an energy matrix of equal size to the image matrix. However, traditional region energy algorithms employ a fixed weight matrix that cannot be adjusted based on varying region features. Consequently, these algorithms fail to adapt to diverse image characteristics, limiting the enhancement of fusion effects. This paper introduces an improved region energy algorithm that elevates the fusion effect of the region energy fusion strategy by constructing a weight matrix adaptable to each region's features.

Initially, gradient information for each region is computed to construct the weight matrix, as given by the following formula:

$$ AG(F) = \sqrt {\mathop {(I(i + 1,j) - I(i,j))}\nolimits^{2} + \mathop {(I(i,j + 1) - I(i,j))}\nolimits^{2} } $$
(11)

Subsequently, the spatial weight matrix for the region is created using the following formula:

$$ W_{s^{\prime}} = e^{{\left( { - \frac{{x^{2} + y^{2} }}{{2\varepsilon^{2} }}} \right)}} $$
(12)
$$ W_{s} = \frac{{W_{s^{\prime}} (i,j)}}{{\sum\limits_{(i,j) \in \omega } {W_{s^{\prime}} (i,j)} }} $$
(13)

The variables are defined as follows: \(\varepsilon\) embodies the standard deviation of the Gaussian kernel associated with the spatial weight matrix, x represents a \(M \times M\) matrix configuration where each row encompasses M values between elements of [− 1,1] within the window size, y signifies a \(M \times M\) matrix setting where each column comprises \(M\) values within the window size between elements of [− 1,1], \(W_{s^{\prime}}\) stands for the preliminary spatial weight matrix, and \(W_{s}\) denotes the normalized spatial weight matrix, serving as the conclusive spatial weight matrix.

Finally, the formula for the adaptive weight matrix is as follows:

$$ F_{1} = {\text{conv}}\left( {\left( {AG_{1} } \right)^{2} ,W_{s1} } \right) $$
(14)
$$ F_{2} = {\text{conv}}\left( {\left( {AG_{2} } \right)^{2} ,W_{s2} } \right) $$
(15)
$$ W = \frac{F1}{{(F_{1} + F_{2} )}} $$
(16)

Here, F1, F2 denotes the outcome of convolution between the spatial weight matrix and the gradient matrices of the two input images, with conv symbolizing the convolutional operation.

Figure 5 compares the results of low-frequency fusion and high-Frequency Coefficients in this paper with traditional region-based energy fusion rules. The left column is the result image of the traditional fusion rule, and the right column is the result image of the fusion rule proposed in this paper.

Fig. 5
figure 5

Compare images with traditional fusion rules

3.3 Fusion rules for high-frequency coefficients

The pulse coupled neural network (PCNN) is an artificial neural network inspired by physiological stimuli, providing the ability to establish a highly adaptable physiological filter. PCNN models various aspects of the primate visual cortex, including pulse duration, inter-pulse duration, and neural interconnections. This network not only meets the filtering requirements of visual models but also generates essential connections and pulses for simulating state modulation and temporal synchronization. However, due to the considerable number of parameters in the PCNN model, the AGPCNN model is introduced. This model features a simplified structure with fewer parameters and employs Gaussian filters for distributing weights among neurons.

The AGPCNN model comprises five components: feedforward input, connectivity input, neuron internal state, binary output, and dynamic threshold. The respective computational formulas are as follows:

$$ V(i,j) = I(i,j) $$
(17)
$$ L_{n} (i,j) = \sum\limits_{M = - 1}^{1} {\sum\limits_{N = - 1}^{1} {G_{\sigma } (M + 2,N + 2)E_{n - 1} (i + M,j + N)} } $$
(18)
$$ Y_{n} (i,j) = V(I,j)(1 + \alpha (i,j)L_{n} (i.j)) $$
(19)
$$ E_{n} (i,j) = \left\{ \begin{gathered} 1,\quad {\text{if}}\;Y_{n} (i,j) > K_{n - 1}^{(i,j)} \hfill \\ 0,\quad {\text{otherwise}} \hfill \\ \end{gathered} \right. $$
(20)
$$ K_{n} (i,j) = d_{\Theta } K_{n - 1} (i,j) + a_{\Theta } E_{n - 1} (i,j) $$
(21)

In the provided excerpt, (i, j) signifies the position of neurons, n represents the iteration count, \(G_{\sigma }\) denotes a Gaussian filter with a standard deviation of \(\sigma\) and a 3 × 3 kernel, \(\alpha\) stands for adaptive estimation of connection strength, \(d_{\Theta }\) is the decay constant, and \(a_{\Theta }\) corresponds to the normalization constant. The computational formula of \(\alpha\) is articulated as follows:

$$ \alpha (i,j) = I_{s} (i,j) = \sqrt {I_{r} (i,j)^{2} + I_{l} (i,j)^{2} } $$
(22)

\(I_{s} (i,j)\) represents the spatial frequency of the \(3 \times 3\) local window of neurons for (i, j), \(I_{r} (i,j)\) represents row frequency, and \(I_{l} (i,j)\) represents column frequency. The calculation is performed as follows:

$$ I_{r} (i,j) = \sqrt {\frac{{\sum\limits_{x = - 1}^{1} {\sum\limits_{y = - 1}^{1} {\left( {I(i + x,j + y} \right) - I(i + x,j + y - 1))^{2} } } }}{M \times N}} $$
(23)
$$ I_{l} (i,j) = \sqrt {\frac{{\sum\limits_{x = - 1}^{1} {\sum\limits_{y = - 1}^{1} {\left( {I(i + x,j + y} \right) - I(i + x - 1,j + y))^{2} } } }}{M \times N}} $$
(24)

\(M \times N\) is the size of the regional window.

The calculation of pulse time for the AGPCNN model after n iterations is as follows:

$$ T_{n} (i,j) = T_{n - 1} (i,j) + E_{n} (i,j) $$
(25)

The AGPCNN model employed in this study has transitioned from the prior Gaussian filter-weighted allocation for neighboring neurons to an allocation using He initialization. This adjustment, which takes into account the properties of \({\text{Re}} LU\), effectively mitigates the problem of gradient vanishing, resulting in an enhanced fusion effect.

4 Experimental results and comparisons

This section presents the experimental comparative results between the algorithm model proposed in this paper and eight other existing algorithms.

4.1 Experimental setup

In order to validate the effectiveness of the proposed algorithm model, comparisons were made against eight other state-of-the-art algorithm models. The experimental data comprised four sets of remote sensing images, namely, panchromatic images and multispectral images. The experiments were conducted in the Matlab2022a environment using NVIDIA GeForce RTX 4060 Laptop GPU. Refer to Fig. 6 for the data images utilized in this study. The eight compared algorithm models are as follows:

Fig. 6
figure 6

Image of the experimental data

  • C1: Detail-Enhanced Multi-Scale Exposure Fusion in YUV Color Space (DEMEF) [9].

  • C2: Medical image fusion by adaptive Gaussian PCNN and improved Roberts operator (AGPRO) [13].

  • C3: NSCT-DCT based Fourier Analysis for Fusion of Multi-modal Images (NDFA) [14].

  • C4: Medical image fusion based on extended difference-of-Gaussians and edge-preserving (EDGEP) [15].

  • C5: Fusion of Multi-modal Images using Parametrically Optimized PCNN and DCT based Fourier Analysis (PODFA) [16].

  • C6: Local extreme map guided multi-modal brain image fusion (LEGFF) [19].

  • C7: Multi-modal image fusion via coupled feature learning (CCFL) [20].

  • C8: Multi-modal medical image fusion via multi-dictionary and truncated Huber filtering (MDHU) [21].

4.2 Experimental results

Figure 7 illustrates the fusion images obtained from the experiments. From top to bottom are the fusion image results of the first to fourth experimental groups. From left to right, they are C1 ~ C8 and the proposed algorithm are, respectively. Tables 1, 2, 3 and 4 provide the parameter data for evaluation. The data with the highest rankings in the evaluation parameters are marked with underlines.

Fig. 7
figure 7

Image of the result of the experimental fusion

Table 1 The first group of Evaluation index data
Table 2 The second group of Evaluation index data
Table 3 The third group of Evaluation index data
Table 4 The fourth group of Evaluation index data

From a subjective perspective, the fusion images between the first and third groups, as well as the fourth group of experiments reveal that C1, C4, C5, C6, C7, and the proposed fusion model algorithm exhibit superior color restoration, edge contour retention, and detail texture preservation. However, C2, C3, and C8, while maintaining good contour texture, suffer from severe color distortion and exhibit a significant color gap from the original multispectral images, resulting in poor fusion performance. The fused image results of the second group of experiments are similar to those of the first group of experiments. C1, C4, C5, C6, C7, and the proposed fusion model algorithm demonstrate commendable image fusion effects, while C2, C3, and C8 show subpar fusion results.

In the objective evaluation of fusion effects, this study utilizes eight fusion assessment parameters, namely mean squared error (MSE), color consistency index (CCI), peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), distortion degree (DD), similarity measure (SM), correlation coefficient (CC), and mutual information (MI). MSE measures the expected value of the squared difference between estimated and true parameter values, indicating the degree of data variation. A smaller MSE implies better accuracy in describing experimental data. CCI gages the perceptual color difference between two colors, with higher CCI values indicating smaller color differences between the fusion and source images. PSNR represents the mean squared error between the fusion and original images, with higher values signifying lower image distortion. SSIM quantifies the similarity between two images, where larger SSIM values indicate greater similarity between the fusion and source images. DD assesses image distortion, with smaller values denoting less distortion in fusion images. SM measures image similarity, with larger values indicating better fusion effects. CC reflects the statistical indicator of the closeness of the relationship between variables, with larger values indicating a stronger correlation. MI measures the distance (similarity) between two distributions, with larger values indicating higher similarity between the fusion and source images.

Tables 1, 2, 3 and 4 demonstrate that the proposed fusion model algorithm holds a dominant position in all four sets of experiments across the eight evaluation metrics. In comparisons with eight other algorithms, all the data are in the first place. The algorithm exhibits superior color retention in fusion images, coupled with lower spectral distortion, demonstrating outstanding image fusion performance and applicability to various image fusion scenarios.

5 Conclusion

The experimental results highlight the efficacy of the proposed fusion model algorithm in maintaining color accuracy and minimizing spectral losses within the domain of remote sensing image fusion. Additionally, it demonstrates a robust performance in preserving image edge contours and textures during the fusion process. Future research will focus on refining and enhancing the overall image fusion performance of this model.