1 Introduction

Due to the limited depth of the field of optical lenses, the sensors cannot generate images of all objects at various distances with equal clarity, which means a scene cannot simultaneously focus on all targets well, i.e., only the objects at focus plane would appear sharp, while others would appear blur [1]. An effective solution to this problem is through multi-focus image fusion technology, which can create a single ‘all-in-focus’ image with all the objects are in focus [2]. The fused image is more suitable for human or machine perception as well as the computer processing tasks such as segmentation, feature extraction, and target recognition.

The multi-focus image fusion algorithm is divided into two categories: spatial domain-based algorithm and transform domain-based algorithm [3]. Currently, the transform domain-based multiscale image fusion algorithm has become a mainstream class fusion method [4]. The traditional discrete wavelet transform (DWT) has been successfully used in image fusion [5]. However, DWT lacks shift-invariance due to its underlying down-sampling process [6], which means the fusion is sensitive to registration. Unser [7] proposed a shift-invariant DWT (SIDWT) to overcome the shortcomings of DWT. However, the two-dimensional wavelet is isotropic and cannot effectively describe the abrupt transitions such as line and curve singularities. In addition, the wavelet can only be limited to capture three directions of the information [8].

To overcome the shortcomings of the wavelet, Do and Vetterli developed a true two-dimensional image representation method, namely the contourlet transform (CT) [9]. Compared with DWT, the CT not only has multiscale parameters and localization but also multi-direction and anisotropy. As a result, the CT can represent edges and other singularities along curves more efficiently [10]. However, the up- and down-sampling process of CT results in lacking shift-invariance and having pseudo-Gibbs phenomena in the fusion. In 2006, Da Cunha et al. [11] proposed an over complete transform, namely, the nonsubsampled CT (NSCT). NSCT inherits the advantages of CT, while also possessing shift-invariance and effectively suppressing Pseudo-Gibbs phenomena [12]. Thus, the NSCT is more suitable for image fusion.

Currently, artificial neural networks have been successfully used in image processing. The pulse-coupled neural network (PCNN) was proposed in 1990 by Eckhorn to simulate the processing mechanism of cat’s visual cortex, which results in a new type of neural network model [13]. Based on the PCNN bionic mechanism and pulse synchronization of neurons characteristics, PCNN plays an important role in image fusion [14]. However, the optimal parameters that set PCNN for different images have poor versatility when applied in image fusion. For most image fusion methods based on PCNN, the parameters are set to fixed values. However, visual processing for obvious characteristic areas should react more strongly than the generic areas. Hence the parameters for the PCNN neurons should reflect the importance of each pixel’s characteristic of the image [15, 16]. Therefore, we need a way to adaptively set the parameters of the PCNN-based fusion method.

In addition, the selection of the fusion rules is also the key problem in the transform domain-based fusion algorithm. The conventional fusion rule that used Max-Average technique for selecting high–low subbands has some disadvantages such as reduce the contrast of the fused image [4]. In this paper, an adaptive PCNN is proposed to act as the fusion rule after the source images are decomposed by NSCT. In order to better measure the pixel importance of the source images to the fused image, we employ fuzzy logic combined with the human visual perception characteristics to build an adaptive PCNN model and propose sum-modified Laplacian (SML) to motivate the PCNN neurons. The experimental results indicate that the proposed method can obtain better subjective and objective results than series of existing fusion methods.

Fig. 1
figure 1

Contourlet decomposed schematic diagram

Fig. 2
figure 2

Nonsubsampled contourlet decomposed schematic diagram

The rest of this paper is organized as follows. The related theories are briefly introduced in Sect.  2. The proposed image fusion method is described in Sect.  3. The experimental results and analysis are depicted in Sect.  4, and the final conclusions are given in Sect.  5.

2 Preliminaries

2.1 Nonsubsampled contourlet transform

The CT process is divided into two stages. First, the Laplacian Pyramid (LP) is utilized to capture the point singularities and to decompose the original images into low-frequency and high-frequency sub-images. Then, the Directional Filter Bank (DFB) offers an efficient directional multiresolution image representation and divides the high-frequency subbands into directional subbands [9]. A contourlet decomposed schematic diagram is shown in Fig. 1. According to the sampling theorem, the pseudo-Gibbs phenomena would appear in low- and high-frequency sub-images in LP domain. Directional subbands which come from the high-frequency sub-images by DFB filtering would also appear the pseudo-Gibbs phenomena. These phenomena would weaken the directional selectivity of the CT-based method.

The NSCT is proposed based on the CT which inherits the advantages of the CT. The whole implementation of NSCT process consists of two phases, the multiscale decomposition and the directional decomposition [11]. The most difference from CT is that NSCT is composed of two shift invariant parts, i.e., nonsubsampled Pyramid filter banks (NSPFB) and nonsubsampled directional filter banks (NSDFB). An NSCT decomposed schematic diagram is shown in Fig. 2. The size of different sub-images decomposed by NSCT is identical, so it is easy to find the connection among sub-images of different images, which is beneficial to design fusion rules. Additionally, NSCT-based fusion can effectively reduce the impacts of misregistration on the results.

2.2 Pulse-coupled neural network

PCNN was proposed in the 1990s and was gradually replacing the traditional neural networks. PCNN does not need to learn or train and can extract the useful information from the complex background. It has become the most widely studied neural network model, but the theory should be further improved. In practical applications, a number of parameters in PCNN need to be set, which increase the difficulty and complexity of its usage. Consequently, the simplified model of PCNN has often been used. In this paper, we use an improved PCNN [17], with neurons after q iterations, whose mathematical formula is shown below.

$$\begin{aligned} \left\{ {\begin{array}{l} F_{ij} (q)=I_{ij} \\ L_{ij} (q)=e^{-\alpha _L }L_{ij} (q-1)+\sum _{m,n} {W_{ij,mn} Y_{mn} (q-1)} \\ U_{ij}^k (q)=F_{ij}^k (q)(1+\beta L_{ij}^k (q)) \\ \theta _{ij}^k (q)=e^{-\alpha _\theta }\theta _{ij}^k (q-1)+V_\theta Y_{ij}^k (q-1) \\ Y_{ij}^k (q)=\left\{ {\begin{array}{l} 1,\;\;\;\;\;\;\;\;\;\;\;U_{ij}^k (q)>\theta _{ij}^k (q) \\ 0,\;\;\;\;\;\;\;\;\;\;\hbox {otherwise} \\ \end{array}} \right. \\ \end{array}} \right. \end{aligned}$$
(1)

where q denotes the iteration times, ij represents a pixel in the image matrix position, \(I_{ij} \) is an external input stimulus signal, \(Y_{ij} \)and \(U_{ij} \) are the output neurons of the external and internal state information, \(F_{ij} \)and \(L_{ij} \) are the feed input and the input connectors, \(W_{ij,mn} \) is the synaptic gain strength and subscripts m and n are the size of the linking range in PCNN. \(\beta \) is the linking strength. \(\theta _{ij} \) is the threshold. \(\alpha _{ij} \) is the decay constants. \(V_L \) and \(V_\theta \) are the amplitude gain.

3 Proposed method

In this section, we will provide a more detailed description of the proposed fusion method based on NSCT and fuzzy-adaptive PCNN. Suppose A and B are two registered source images. After k-level NSCT decomposition, the corresponding subbands of different scales and directions can be obtained, i.e., A:\(\{L^{A},H_{k,l}^A \}\) and B:\(\{L^{B},H_{k,l}^B \}\), where \(L^{A}\), \(L^{B}\) are the low-frequency sub-images and \(H_{k,l}^A \), \(H_{k,l}^B \) represent the high-frequency sub-images at level k and direction l.

3.1 SML motivated PCNN neurons

In most multiscale fusion algorithms based on PCNN, a single pixel is often directly used to motivate a neuron in multiscale decomposition domain. Actually, the human visual nervous system is often highly sensitive to the edge information and the directional features [17]. Thus, pure single pixel motivation for PCNN is not sufficient. The SML can reflect the marginalized mutation and clarity information of an image [18]. Therefore, compared to spatial frequency (SF), variance, gradient energy and Laplace energy, SML is more suitable to motivate PCNN neurons. The definition of SML is defined as follows:

$$\begin{aligned} \hbox {SML}(i,j)=\sum _{m=-M_1 }^{M_1 } {\sum _{n=-N_1 }^{N_1 } {[\hbox {ML}(i+m,j+m)]} } \end{aligned}$$
(2)

for a window with size \((2M_1 +1)(2N_1 +1)\), where ML(ij) is the Modified Laplacian (ML), which is defined as:

$$\begin{aligned} \begin{array}{l} \hbox {ML}(i,j)=|2C(i,j)-C(i-\hbox {step},j)-C(i+\hbox {step},j)|+ \\ \;\;\;|2C(i,j)-C(i,j-\hbox {step})-C(i,j+\hbox {step})| \\ \end{array} \end{aligned}$$
(3)

where step is a variable spacing between coefficients. In this paper, step always equals 1. C(ij) denotes the pixel value of one coefficient located at (ij).

Hence, in the proposed fusion method, the SML is used to motivate the PCNN neurons, while human visual perception characteristics are used to determine the fuzzy-adaptive linking strength.

3.2 Fuzzy-adaptive linking strength determination

As can be seen from references [19, 20], the linking strength \(\beta \) of PCNN has a certain relationship with the pixel characteristic of an image, and \(\beta \) should reflect the importance of a pixel relative to its surrounding pixels. The larger of \(\beta \) means the PCNN neuron is captured and fired more easily and quickly. Meanwhile, the human visual nonlinearity characteristics can determine the pixel relative to its surrounding pixels whether they are visually important or not. However, the uncertainty and subjectivity exists in the process of visual respondence.

Fuzzy logic can effectively address this problem and the uncertainty contribution of each source image pixel. By fuzzy if–then rules and membership functions applied for the image date set, the fuzzy logic approach can model and combine the images to enhance the contrast of the fused image [21]. Hence, we propose a novel fuzzy logic based method to determine the linking strength of PCNN, which can be adaptively determined by measuring each coefficient’s importance to its surrounding coefficients.

Fig. 3
figure 3

The framework of the proposed fusion algorithm

For the low-frequency subbands, there is a nonlinear relationship between the contrast sensitivity threshold and the brightness of the background by considering human visual perception characteristics. A concept of visibility that can effectively measure the image uniformity is introduced in [22]. Based on the visibility concept, a local visibility is defined, which is used to measure the clarity of the low-frequency subbands. The local visibility is defined as:

$$\begin{aligned}&L_\mathrm{VI}^x (i,j)\nonumber \\&\quad =\left\{ {\begin{array}{l} \frac{1}{M\times N}\sum _{i=1}^M {\sum _{j=1}^N {\frac{|L^{x}(i,j)-\tau ^{x}(i,j)|}{\tau ^{x}(i,j)^{1+\alpha }}\;\;\;\;\;\tau ^{x}(i,j)\ne 0} } \\ L^{x}(i,j)\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\hbox {otherwise} \\ \end{array}} \right. \nonumber \\ \end{aligned}$$
(4)

where \((M\times N)\) is the window size, \(\tau ^{x}(i,j)\) denotes the mean low coefficient located at (ij) of the image x, and \(\alpha \) is a visual constant ranging from 0.6 to 0.7. If the local visibility of a coefficient is larger, then the coefficient is more important in the image. The fuzzy membership values \(\mu _\mathrm{L} \) corresponding to the low-frequency subband coefficient’s local visibility is thus can be calculated as follows:

$$\begin{aligned} \mu _\mathrm{L} \left( L_\mathrm{VI}^x (i,j)\right) =\frac{1}{1+e^{-\left( L_\mathrm{VI}^x (i,j)-a_1 \right) }} \end{aligned}$$
(5)

where \(a_1 =\hbox {average}\left( L_\mathrm{VI}^x (i,j)\right) \).

For the high-frequency subbands, because human visualization is sensitive to the changes in local contrast of the image. Meanwhile, considering the advantages of SML in distinguishing clearness and blurriness of the image block and the human visual perception characteristics, the local visual features contrast is defined as:

$$\begin{aligned} L_\mathrm{VC}^{x,k,l} (i,j)=\left\{ {\begin{array}{l} \left( {\frac{1}{\vartheta _k^x (i,j)}} \right) ^{\alpha } \frac{\mathrm{SML}_{k,l}^x (i,j)}{\vartheta _k^x (i,j)} \;\;\;\;\;\vartheta _k^x (i,j)\ne 0 \\ \hbox {SML}_{k,l}^x (i,j)\;\;\;\;\;\;\;\;\;\;\;\;\;\; \;\;\;\;\;\;\;\;\;\hbox {otherwise} \\ \end{array}} \right. \end{aligned}$$
(6)

where, \(\vartheta _k^x (i,j)\) is the average low coefficient located at (ij) of the image x at the scale k, and \(\hbox {SML}_{k,l}^x (i,j)\) denotes the SML located at (ij) of the image x at the scale k and direction l, \(x\in \{A,B\}\). If the local visual feature contrast of a coefficient is large, then the coefficient has more importance in the image. Hence, the fuzzy membership values \(\mu _\mathrm{H} \) corresponding to the high-frequency subband coefficient’s local visual features contrast can be defined as:

$$\begin{aligned} \mu _\mathrm{H} \left( L_\mathrm{VC}^{x,k,l} (i,j)\right) =\frac{1}{1+e^{-\left( L_\mathrm{VC}^{x,k,l} (i,j)-a_2 \right) }} \end{aligned}$$
(7)

where \(a_2 =\hbox {average}\left( L_\mathrm{VC}^{x,k,l} (i,j)\right) \).

Therefore, the linking strength \(\beta \) of low-frequency subbands and high-frequency subbands are presented as \(\beta _\mathrm{L}^x (i,j)=\mu _\mathrm{L} \left( L_\mathrm{VI}^x (i,j)\right) \) and \(\beta _\mathrm{H}^{x,k,l} (i,j)=\mu _\mathrm{H} (L_\mathrm{VC}^{x,k,l} (i,j))\), respectively, representing their importance in the corresponding source image.

3.3 Algorithm

The block diagram of the proposed fusion scheme is shown in Fig. 3. This method can be easily extended to more than two images. The fusion process consists of the following steps:

Step 1 Decompose the nth source images \(I_{n}\) by using NSCT to obtain one low-frequency subband and a series of high-frequency subbands at each level k and direction l.

Step 2 Compute the SML of the subbands as the input of PCNN using (2).

Step 3 Compute the linking strength\(\beta _\mathrm{L}^n \)and \(\beta _\mathrm{H}^{n,k,l} \)of the low-frequency subbands and the high-frequency subbands respectively, as described in Sect.  3.2.

Step 4 Input the SML to motivate the PCNN with the linking strength of step 3 to generate pulses of neurons using (1), and compute the firing times as follows:

$$\begin{aligned} T_{i,j}^{n,k,l} =T_{i,j}^{n,k,l} (q-1)+Y_{i,j}^{n,k,l} (q) \end{aligned}$$
(8)

Step 5 When the number of iterations reaches q, the maximum firing time rules for fusion subband coefficients is defined as follows:

$$\begin{aligned}&D_n^{k,l} (i,j)\nonumber \\&\quad =\left\{ {\begin{array}{l} 1,\;\;\;\;\;\hbox {if}\;\;T_{i,j}^{n,k,l} =\max \left( T_{i,j}^{1,k,l} ,T_{i,j}^{2,k,l} ,...,T_{i,j}^{n,k,l} \right) \\ 0,\;\;\;\;\hbox {otherwise} \\ \end{array}} \right. \nonumber \\ \end{aligned}$$
(9)
$$\begin{aligned}&P_F^{k,l} (i,j)=\left\{ {\begin{array}{l} P_n^{k,l} (i,j),\;\;\;\;\;\;\hbox {if}\;D_n^{k,l} (i,j)=1 \\ 0,\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\hbox {otherwise} \\ \end{array}} \right. \end{aligned}$$
(10)

where, \(P_F^{k,l} \) and \(P_n^{k,l} \) denote the coefficients of the fused image and nth source image, respectively.

Step 6 Use the inverse NSCT on the fused coefficients to obtain the final fused image.

4 The experimental results and analysis

In this section, we compare the performance of our technique with the existing image fusion methods based on Gradient Pyramid (GP), DWT, SIDWT, CT, NSCT and NSCT-PCNN. In addition, this method was also compared with the method of literature [17], which uses SF to motivate the PCNN neurons (SF_PCNN). All the methods are the three layers decomposition. All the NSCT-based methods have the same multi-direction decomposition levels are 1, 2 and 8. The parameters of PCNN are set as \(W=[0.707\;1\;0.707;\;1\;0\;1;\;0.707\;1\;0.707]\),\(\alpha _L =0.01\), \(\alpha _\theta =20\), \(m\times n=3\times 3\), and q = 200. The size of the window for computing SML, \(L_\mathrm{VI} \) and \(L_\mathrm{VC} \) was set as \(3\times 3\).

Fig. 4
figure 4

Referenced and blurred Lena images. a Referenced image; b Blurred on right; c Blurred on left

Fig. 5
figure 5

The experimental results of fusing simulated multi-focus image: ah the fusion results of GP, DWT, SIDWT, CT, NSCT, NSCT+PCNN, NSCT+SF_PCNN and the proposed method, respectively); ip the residual images between the eight fusion results and the source image Fig. 4b

4.1 Experiments on simulated multi-focus image fusion

The first experiment is conducted using an extensive set of artificially generated multi-focus images. Figure 4a shows the referenced Lena image and two blurred artificial images obtained by Gaussian filtering as can be seen from Fig. 4b, c, respectively. The fusion experiments are then carried out by using the eight fusion methods mentioned above. The fusion results and their corresponding residual images of different methods are depicted in Fig. 5. In residual images, the lower residue of the features means the more detail information in the focus region has been transferred to the fused image, i.e., the corresponding fusion method is better. As seen from Fig. 5, we can observe that the residual image between the result of the proposed method and Fig. 4b is almost zero on the left region. In other words, the proposed method has extracted the most information from the clear/focus region (the left side of the image) of Fig. 4b. From the fusion and residual results, we can also easily find that the GP method has the worst performance in the eight methods.

In order to objectively evaluate the performance of those algorithms, two evaluation criteria including the root mean square error (RMSE) [4] and the structural similarity measure (SSIM) [23] are introduced and employed in the paper. RMSE and SSIM are the metrics that measure the error and similarity between a referenced image and a fused image, respectively. The smaller the RMSE, the better the performance of the fusion algorithm. However, the larger the SSIM, the better the performance of the fusion algorithm. Figure 6 shows the values of RMSE and SSIM obtained from the fusion results of Fig. 5a–h and the referenced image Fig. 4a. As seen from Fig. 6, the RMSE value of the proposed method is the minimum, and the SSIM value of the proposed method is the maximum, which means our method can obtain the fused result that it is closer to the standard image compared to other seven fusion methods, i.e., the proposed method can achieve the best fusion result.

Fig. 6
figure 6

Comparison on RMSE and SSIM of different fusion methods on simulated multi-focus images (methods 1–8 are GP, DWT, SIDWT, CT, NSCT, NSCT+PCNN, NSCT+SF_PCNN and the proposed method, respectively)

Fig. 7
figure 7

Standard test images

Table 1 Comparison on objective criteria of different methods in simulated multi-focus images

In addition to above evaluation, Table 1 gives two other frequently used metrics, mutual information (MI) [24] and edge-based similarity measure \((Q^{AB/F})\) [25] to assess the results obtained by those fusion methods. MI indicates how much of the information the fused image conveys about the source images. The greater the value of MI, the better the fusion effect. \(Q^{AB/F}\) shows the similarity between the edges that are transferred from the input images to the fused image. The value of \(Q^{AB/F}\) is more close to one, which indicates a better fused image. From Table 1, we can observe that the values of MI and \(Q^{AB/F}\) of the proposed method are larger than other seven fusion methods.

Fig. 8
figure 8

Comparison on RMSE and SSIM of different fusion methods on standard test images (methods 1–8 are GP, DWT, SIDWT, CT, NSCT, NSCT+PCNN, NSCT+SF_PCNN and the proposed method)

To further verify the effectiveness of the proposed method, tests were then realized on ten groups of synthetic multi-focus images, which are created by convolving Gaussian blurring on ten standard images as shown in Fig. 7. Then, the ten groups of simulated images are fused by the previously listed eight methods. Figure 8 shows the average values of RMSE and SSIM of the different fusion methods on standard test images. It is easy to find that in ten sets of fusion results, the SSIM and RMSE values of the proposed method results outperform other seven methods. In addition, Table 2 shows the average values of MI and \(Q^{AB/F}\) of different fusion methods. Similarly, we can also observe that the values of MI and \(Q^{AB/F}\) of the proposed method are all the largest and optimal in the eight fusion methods. Therefore, the four optimal criteria values presented here can indicate that the fused image obtained by the proposed method can retain much more focused information and achieve higher similarity and correlation to the source images.

Table 2 Comparison on objective criteria of different methods in standard test images
Fig. 9
figure 9

Source images “disk” and the fusion results. a The first source image with focus on the right. b The second source image with focus on the left. c GP result. d DWT result. e SIDWT result. f CT result. g NSCT result. h NSCT-PCNN result. i NSCT-SF_PCNN result. j Result of the proposed algorithm. kr Zoom in regions of the eight fused results marked by red rectangles. sz The residual images between the eight fusion results and source image b

Table 3 Comparison on objective criteria of different methods in real multi-focus images

4.2 Experiments on real multi-focus image fusion

The second part of the experiment is conducted on three sets of generally used real multi-focus images, including pepsi, lab, and disk, which are used to further verify the validity of the proposed method. Every set includes two source images that focus on the right and focus on the left, respectively. An example of experimental data, fusion results and their corresponding residual images between the fused result and the source image focused on left (foreground) are show in Fig. 9, respectively. From the residual images, it is easy to find that the proposed method has extracted much more information than other seven methods in the focused left areas, and almost all of the useful information in the focus area of the source images has been transferred to the fused image. Considering the example of the set of ‘disk’ images in Fig. 9, we can see that GP-based fusion result has lower contrast, DWT- and CT-based fusion results have a significant Pseudo-Gibbs phenomenon as marked in the red rectangle. However, it is difficult to visually discriminate the difference among the proposed method and the SIDWT and other NSCT-based fusion results because all these methods have shift-invariance. But from the residual images in Fig. 9, we can easily see that residue between the result of the proposed method and the source image is smallest as can be seen from the foreground of Fig. 9z.

In order to better evaluate these fusion methods on three groups of real multi-focus images, quantitative assessments of the performance of the eight methods are needed. However, since in real applications no reference images can be provided, only MI and \(Q^{AB/F}\) are then used to evaluate the performance. The quantitative results of the fusion results are given in Table 3. From Table 3, it is easy to see, just as Table 1 and Table 2, that the performance of the proposed method is the best in the eight methods because it has the biggest values of both MI and \(Q^{AB/F}\), which means the fused image generated by proposed method contains more information from the source image, and the details of the two source images are reflected more accurately.

5 Conclusions

In this paper, we propose a novel multi-focus image fusion technique based on the NSCT and a fuzzy-adaptive PCNN. After the sources images to be fused are decomposed by NSCT, SML is utilized as the motivation for PCNN neurons, and the linking strength of PCNN is automatically determined by calculating the fuzzy membership value of each pixel. The fuzzy logic is developed by considering the human visual perception characteristics to different frequency subbands; the proposed fuzzy-adaptive PCNN model is thus formed. Experimental results indicate that the proposed method outperforms several popular widely used fusion methods, which means our method is effective and promising. Although the NSCT and PCNN algorithms are time-consuming, we believe that with a more efficient implementation approach such as C++, the running time of the method can be significantly reduced in the future work.