1 Introduction

The motivation for the study and research of multi-focus image fusion is to provide a practical approach to solve the problem of adaptive focusing ability of an imaging device. Normal imaging devices generally contain converging (convex) lenses, which captures a bundle of light rays originating from a specific point on the object and converges them to a single point in the focal plane. Such lenses are unable to produce a homogeneous focused capture of an object or scene due to its limited range of focus or Depth of Field (DoF). In optics, the range of focus is defined as the distance between the nearest and the farthest objects in a scene that appears to be acceptably sharp in an image. Objects lying within the focal range appear to be more sharp and clear compared to the objects which are away from it. Focused capture of a scene comprises of more feature details and spatial information such as edges, color, contour, texture, intensity. So, it is a necessary task to achieve uniform focus throughout the image for better human understanding and machine perception. This can be achieved by a multi-focus fusion algorithm which takes one or more input images with different levels of focus to produce a more interpretive all-in-focus image. The focussed image can be utilized in solving many image processing problems such as feature extraction, contour detection, object segmentation, and recognition. Image fusion is largely applicable in the field of computer vision, geographic information systems, biomedical research, robotics, navigation and, surveillance systems [2].

2 Related work

Fusion has been a widely researched domain in the field of image processing literature, and yet it continues to be an active area of research. Image fusion algorithms generally take place at different (a) levels of abstraction/information representation (pixel, feature, and decision), (b) domains (spatial, spectral and hybrid) and (c) types of source images (multi-focus, multi-sensor, multi-exposure and multispectral) [15]. Several algorithms for multi-focus image fusion have been introduced for both gray-scale and color images in the literature. Pixel level fusion methods can be further divided into three groups, i.e., coefficient based, window-based, and region-based. Coefficient based methods in pixel domain follow a general three-step procedure, 1) application of an appropriate transform operation on the image pixels to obtain the domain-specific coefficients, 2) combining the coefficients so obtained by employing suitably devised rules, 3) reconstruction of the fused image by taking an inverse transform. Some notable transforms that are used in the literature are discrete cosine transform (DCT), discrete wavelet transform (DWT), dual-tree complex wavelet transform (DTCWT), log-Gabor transform, discrete cosine harmonic wavelet transform (DCHWT), curvelet, contourlet and ripplet [7] transform, non-subsampled contourlet transform (NSCT), scale-invariant feature transform (SIFT), non-subsampled shearlet transform etc. [12]. The choice of transform depends on the type of the source images to be fused. Transform domain methods modify the pixel values and require perfect inversion to the spatial domain for proper visualization. Pixel-based methods in spatial domain directly manipulates the pixel intensity or integrates spatial features (both global and local) extracted from the source images to achieve a uniformly focused image. Few of the simplest approaches include simple or weighted average and selection of maximum or minimum pixel. The evaluation of focus measures (e.g., spatial frequency, sum-modified-laplacian, Tenengrad measure, energy of gradient, index of fuzziness and moment-based measure) on a block of pixel to measure/rank the activity level of a pixel/block has been popularly used in this domain [10, 30]. Spatial level image fusion is also performed using different (a) colour spaces (HSI, RGB, LUV) , (b) dimensionality reduction techniques (principal component analysis (PCA) [27], independent component analysis (ICA) [20]) and (c) specialized filters (fast filter [38], guided filter [13], edge-preserving filters like bilateral filter [25]). The methods discussed above are also applied in one or more combination at multiple resolution levels (multi-resolution methods) to extract the coarse and fine features from the image where the number of resolution levels largely determines the quality of the fused images.

The concepts of machine learning and deep learning [36] have also been utilized to develop image fusion algorithms which uses pulse-coupled neural network (PCNN) [11], convolutional neural network (CNN) [31], multi-scale CNN [4] etc. Although these methods require the network to be trained with numerous training data prior to testing [18]. Zhang et al. has presented a detailed review of multi-focus fusion techniques based on sparse representation [23, 40]. Sparse coding mechanism which simulates the behaviour of human vision are also used to develop algorithms using dictionary and sub-dictionaries like convolutional sparse representation (CSR) [34],[37] adaptive sparse representation (ASR) [1, 33] etc. However, computational complexity still happens to be a major issue in these algorithms [5, 16].

Window/coefficient based methods in pixel domain often introduce intensity variation, blocking artifacts, blurring effect, sensitivity to noise in the fused images resulting in the introduction of region-based fusion approaches. In such techniques, the irregular semantic regions are segmented/extracted at first, leading to the creation of a joint/separate segmentation map prior to fusion [19]. Morphology has been repeatedly used to distinguish between focussed and defocussed pixels. Li et al. has proposed a matting based fusion approach where the focused region is roughly obtained by morphological filtering followed by image matting to distinguish the foreground from the background [14]. A novel algorithm to obtain a boundary between the focused and defocused regions within the image using a multi-scale morphological focus measure is proposed in [35]. Baohua et al. have used a graph-based visual saliency (GBVS) algorithm followed by morphological watershed transform to extract the focused regions by [39]. Similar region partitioning strategies proposed for image fusion can be found in [6, 17]. In this paper, a novel region-based algorithm for multi-focus images is proposed where the focused region is obtained from the edges of the source images assuming the fact that the focused portion of an image contains more prominent and clear edge features. Hence, the edge images are used as a starting point to form a region for the fusion process. The efficacy of the proposed method has been evaluated qualitatively and quantitatively with appropriate fusion metrics and compared with the state-of-art.

The contribution of the paper is as follows:

  • It is a region-based fusion method which uses edge features as a basis for focus/saliency detection.

  • The threshold values used in the edge detection procedure do not require any tuning.

  • The focused edges are completely separated from the defocused ones using morphological edge by reconstruction, which are gradually converted into region.

  • It produces significantly good results in less execution time and works for multiple focus situation, as evident from experimental results.

The rest of the paper is organized as follows: Section 3 consists of a detailed description of the proposed fusion algorithm. Section 4 presents the experimental results along with subjective and objective evaluations with some future directions. The concluding remarks are drawn in Section 5.

3 Proposed method

A multi-focus fusion algorithm should be able to isolate and extract the maximum amount of focused information from the source images so as to construct the resultant fused image by combining them. Accordingly, the proposed algorithm first separates the more focused features from the ill-focused features (Fig. 1a and c). The constituting color images, are first converted into YUV format using the weighted sum of red, green and blue channel, I = 0.299 ∗ R + 0.587 ∗ G + 0.114 ∗ B prior to finding the edge features [8]. The edge features from the constituting images are then extracted employing the zero crossing edge detector using (1). The focused edges in the component images are those which have a significant concentration of foreground pixels in the edge image, as obtained in Fig. 1b and d.

$$ LoG(x,y)=[\frac{x^{2}+y^{2}-2\sigma^{2}}{\sigma^{4}}]e^{\frac{-(x^{2}+y^{2})}{2\sigma^{2}}}) $$
(1)
Fig. 1
figure 1

a Focus on Background; b Edge Image of (a); c Focus on Foreground; d Edge Image of c

For each of the source image, the edge detection is performed using two edge operators with varying detection strength, i.e., (a) Zero-crossing with LoG and (b) Canny edge operator [3]. The edge output obtained from (a) is used in Section 3.1 as a starting edge map to detect the focused edge pixels. The edge image obtained from (b) has been used for two different purposes: 1) It is used as a reference image for correcting the discontinuities that may arise as a result of block-wise operation (Section 3.3). 2) It is used in Section 3.3.1 to perform XOR operation to find the number of edge pixels that lie in the focussed region. The reason behind choosing two separate edge detectors is explained as follows: Being a weak edge detector, zero-cross edge operator detects the edges belonging to the focused region of the image alone (Fig. 2b and c). On the other hand, the Canny operator performs an optimal edge detection and produces extra edges belonging to the poorly focused regions that may be connected with the edges from the focussed regions (Fig. 2f). Such extra edges are undesirable here, and to mitigate the problem, the upper and lower threshold values used in the Canny edge detection algorithm could have been fine-tuned to control the amount of edges to be detected. But, this approach becomes time consuming and image dependent because the focused features in individual source images are present at different locations.

Fig. 2
figure 2

a Y component (Focus on the Background); b Edge image using Zero-cross; c Local magnified region of (b); d Edge image using Canny; e Local magnified region of (d); f Extra edges captured from ill-focused portion (Yellow Box)

3.1 Blockwise Region Comparison

The binary edge images obtained from the zero-cross edge operator has separated the set of focused features from those of defocused ones to a significant extent. The concentration of the foreground pixels (white) in the focused regions are more than that corresponding to the regions out of focus. However, due to false alarm in the edge detection process, some of the non-focussed pixels will also have a contribution in the same, which are subsequently removed by an additional block-based region comparative algorithm [29]. In this algorithm, we compare the corresponding blocks (non-overlapping) of n × n pixels of the candidate images and construct the focused edge image as:

$$ E_{FA}(m,n)= \begin{cases} 1,& \text{if}\ S_{A}>S_{B}\ \text{\ and}\ E_{RA}(m,n)=1 \\ 0,& \text{others} \end{cases} $$
(2)
$$ E_{FB}(m,n)= \begin{cases} 1,& \text{if}\ S_{A}<S_{B} \ \text{and}\ E_{RB}(m,n)=1\\ 0,& \text{others} \end{cases} $$
(3)

where, EFA and EFB are the edge pixels corresponding to focused edge features in the source images A and B respectively,

$$S_{A}=\sum\limits_{(m,n)\in W_{n\times n}}{E_{RA}}$$
$$S_{B}=\sum\limits_{(m,n)\in W_{n\times n}}{E_{RB}}$$

and SA and SB denotes the total contribution of the white pixels from n × n block from the binary edge images ERA and ERB obtained using zero-cross operator.

$$W_{n \times n}=\text{window of $n \times n$ pixels}$$

This method eliminates the maximum number of foreground pixels belonging to the out-of-focus edges of the image. It retains the pixels belonging only to the focused area of the image, as shown in Fig. 3b and d. Nonetheless, this process comes along with two drawbacks, as mentioned below:

  • As shown in Fig. 3e (Yellow box), in spite of the maximal removal of the blurred edge pixels, though a number of white pixels still remain in the picture. The block-wise area comparison algorithm is unable to remove the white pixels as they are isolated in nature.

  • The algorithm discussed above makes a comparison between the local region of two source images, and this comparison tends to introduce breaks and dis-connectivity in the outer edges of the main structure in the focused portion, as shown in the Fig. 3e (Green Box).

The above two drawbacks are removed by performing the following steps (Sections 3.2 and 3.3).

Fig. 3
figure 3

a, c Before region comparison; b, d After region comparison; e Isolated pixels(yellow box) and introduced breaks(green box)

3.2 Removal of isolated points

The isolated white foreground points against the dark background, as discussed above for both the source images, are removed by performing binary morphological dilation (4) followed by the largest connected component selection. The binary edge images obtained until now consists of edge pixels in a disconnected and distributed manner resulting in more than one connected component. Components that are not connected but exhibit a kind of coherence in terms of size and local orientation are more likely to be parts of bigger components. The dilation operation aims at doing justice to such coherent yet isolated components by connecting them, provided they are in close proximity as assessed by the size of the structuring element used in dilation. Binary morphological dilation is a procedure which thickens the object boundaries and causes a growth in such reasons. Therefore, it reduces the number of connected components in the edge image by merging/unifying a significant number of isolated components. The number of connected components before and after dilation for some of the source images are shown in the Table 1. The reduction in the number of components keeps the edge structure of the image intact. After dilation, the largest component is kept discarding all the smaller components (which also includes the isolated pixels). For all the input images, this combined approach of dilation followed by the largest component selection has removed all the isolated pixels, thereby leaving the focused edge region alone. However, binary dilation operation saturates after a certain number of iterations and fails to restore any wide gaps/breaks that exist in the outer boundary of the images. To recover the original edge structure of the image, morphological reconstruction is performed as briefly discussed in the next section. This is where the canny edge counterpart of the source image comes into play.

$$ \textbf{Dilation:} I\oplus s =\{z|[(s^{\prime})_{z}\cap I] \subseteq s] $$
(4)

where, I and s are the image and structuring element respectively.

Table 1 Number of connected components: before and after dilation

3.3 Morphological edge reconstruction

Morphological reconstruction is a process in which an image known as ’marker’ image gets repeatedly morphologically processed/modified based on the characteristics/features of another image called ’mask’ [26]. The reconstruction takes place on the basis of specified connectivity. The marker and mask images should preferably be same in size and the number of elements in the marker image should be less than or equal to that in the mask image. In this context, the respective edges images after dilation (obtained in Section 3.2) formed after selecting the largest component becomes the marker image which gets morphologically reconstructed with respect to canny edge image chosen as the mask image. Canny edge image provides us with strongly connected edges containing the maximum amount of edge information. This reconstruction procedure successfully restores the wide breaks that get introduced as a result of the block-wise comparison. Figure 4 shows the reconstruction taking place for the source images, as mentioned. It is obvious that the number of pixels in the reconstructed image is less than its respective canny edge image counterpart for all the source images. To simplify the process of region conversion, only one of the reconstructed image is considered for region conversion. Hence, the next task would be to identify the reconstructed image that is structurally complete enough to convert into a region. This is obtained with the help of a focus measure, as discussed in the next section.

Fig. 4
figure 4

a Foreground and b Background edge image with breaks (green box) and isolated pixels (yellow box); c Foreground and d Background edge images after reconstruction

3.3.1 Choosing the best reconstructed image

For selecting the best-reconstructed image, we use a focus measure based on morphological filtering used in [14]. It uses the bright and dark top-hat transforms to extract the high-frequency details of the image. The initial focus information map (Fig. 5b, b1) is generated by taking the maximum value among the two transforms at each pixel location using (5):

$$ {F_{I}}(x,y)=max[{T^{b}_{I}}(x,y),{T^{d}_{I}}(x,y)] $$
(5)

where,

$${T^{b}_{I}} (x,y)=\text{Bright top-hat transform of I(x,y)}$$
$${T^{d}_{I}} (x,y)=\text{Dark top-hat transform of I(x,y)}$$

The image obtained is binarized using the condition given by (6). It provides better visualization of the blurred and prominent pixels for further processing (Fig. 5c, c1).

$$ {R_{I}}(x,y)= \begin{cases} 1,& {F_{I_{1}}(x,y)>F_{I_{2}}(x,y)} \\ 0,& \text{otherwise} \end{cases} $$
(6)
Fig. 5
figure 5

Process representing the selection of best reconstructed image: a, (a1) Color/Gray-scale image; b, (b1) Initial focus map; c, (c1) Binary thresholded focus map; d, (d1) Cleaned focus map; e, (e1) Rough focus region; f, (f1) Resultant XOR image; g, (g1) Reconstructed Image

The output from (6) provides better demarcation of focused features from the blurred ones. The images obtained from (6) are further cleaned by sequential operation of morphological closing, Is = (Is) ⊖ s followed by hole-filling for obtaining the focused regions in a more prominent form that can serve the purpose (Fig. 5e, e1). Now, an exclusive OR operation is performed between this binary image and the canny edge image to count the number of edge pixels lying within the focused regions. To select the best reconstructed image obtained in Section 3.3, a simple rule is followed which uses the difference in the count of the dark pixels against bright background and vice-versa. Let Pd be the number of dark pixels within the white area of the XOR image (Fig. 5f, f1) and Pb denote the bright pixel count in the reconstructed image (RFA,RFB). Let \(D_{I_{A}}\) and \(D_{I_{B}}\) be the absolute differences in the pixel count given by, \(D_{I_{A}}={P_{d}(I_{A})-P_{b}(I_{A})}\), \(D_{I_{B}}= {P_{d}(I_{B})-P_{b}(I_{B})}\). Then, the best reconstructed edge image (IR) is given by,

$$ \text{$I_{R}$=} \begin{cases} R_{F_{A}}, if D_{I_{A}}<D_{I_{A}}\\ R_{F_{B}}, if D_{I_{B}}<D_{I_{B}} \end{cases} $$
(7)

The image with less absolute difference in the pixel count is considered to be the best-reconstructed image because it signifies that the image has more number of pixels. The observations are presented in Table 2 below for some of the source images, which show that the near focused (foreground) object is better reconstructed as compared to the far-focused (background).

figure c
Table 2 Difference in Pixel Count

3.4 Region conversion

The objects in the edge images are generally disconnected in nature with more than one connected edge component. The best-reconstructed edge image selected by using (7) is converted into an approximate binary region by carrying out an iterative step combining a morphological closing operation and hole filling. For binary images, the holes are defined as a set of its regional minima, which are not connected to the image border. The holes are filled by removing all minima, which are not connected to the image border, by using morphological reconstruction by erosion. The marker image is set to the maximum value except along its border, where the values of the original images are kept. Closing operation in a binary image enlarges the foreground regions keeping the original boundary intact while the hole filling operation helps fasten the process of region conversion by reducing the number of iterations (Fig. 6). To fill in the gaps between the edges, a square matrix is chosen as the structuring element for the closing operation, the dimension of which changes with every iteration. The number of iterations required till saturation differs for each input edge image. The procedure is summarized in Algo. 2. It is observed that two different types of regions are hence formed:

  1. a)

    Type I: Region touching the image boundary: For regions touching the borders, the borders are identified as left, right, top or bottom (Fig. 6a and b). The small gaps between the objects and the identified image borders if any, are filled up with bright pixels.

  2. b)

    Type II: Regions away from image boundary: In this type, the region images obtained after execution of Algo. 2 is perfect for carrying out the fusion procedure and do not require any further processing.

An extra post-processing step is carried out for all the rough region images (Type I or Type II) to detect and fill leftover holes (if any). Post-Processing of the images: For certain region images, the hole filling scheme described above is not capable of vanishing all types of holes from the regions. For example, there may be holes in the regions which are connected to the background through narrow constriction (Fig. 6(c5)). To fill up such holes, we have first dilated the image so as to fill up narrow constrictions, which leaves bigger holes isolated from the background. Subsequently, we detect such holes by computing the Euler Number corresponding to all such regions.

  1. b)

    Detection of holes using Euler Number (Eu): It is defined as the difference between the number of components and the number of holes in a binary image. Since all the region images obtained consists of a single component as of now, image consisting holes will have either 0, or negative Euler number. Hence, the Euler number is calculated after dilation to detect the presence of holes.

    $$ E_{u}= \begin{cases} 1, & \text{,no hole is present}\\ 0, / \text{less than 1} & \text{,holes are present}\\ \end{cases} $$
    (8)

After hole detection and filling, the resulting image is again eroded by the same structuring element to bring the regions back to their original configuration, but with the holes disappearing (Fig. 6(c6)). The region image formed at this step itself acts as an initial focus information map for the fusion procedure.

Fig. 6
figure 6

Region conversion: a and b Type-I image; (a1-a4,b1-b4) Intermediate Images; (a4,b4) Partial region image; (a5,b5) Final region image after border detection and filling; c Type-II image; (c1-c5) Intermediate Images; (c5) Partial region image; (c6) Final region image after post-processing

3.5 Construction of decision map

A decision map (D) corresponding to the input image- color or gray-scale needs to be constructed after the region formation described above.

Gray-scale Images::

For gray-scale images, the decision map is simply the output from the previous step, i.e., \(D_{I_{G}}=R_{I_{F}}\).

Color Images::

For color input images, the binary region image from the previous step is converted to a 2-color decision map as per (9). The color decision map is created as follows:

$$ D_{I_{C}}(x,y)= \begin{cases} C_{1}, & \text{if}\ R_{I}(x,y)=0\\ C_{2}, & \text{if}\ R_{I}(x,y)=1\\ \end{cases} $$
(9)

where C1 and C2 refers to different colors.

figure d
figure e

3.6 Image fusion

The fused image is formed by utilizing the decision map obtained in the previous section for gray-scale as well as color images. Depending on the decision map, the pixels from the respective source images are selected to construct the final fused image (Fig. 7). The fusion process is executed by using the fusion rule as specified in (10) and (11).

Fig. 7
figure 7

Diagram representing the fusion process using the decision map for (a) color image; (b) grayscale image

Gray-scale images:

$$ I_{F_{G}}(x,y)= \begin{cases} I_{1}(x,y), & \text{if}\ D_{I_{G}}(x,y)=0\\ I_{2}(x,y), & \text{if}\ D_{I_{G}}(x,y)=1\\ \end{cases} $$
(10)

where \(D_{I_{G}}\) is the binary decision map. Color Images:

$$ I_{F_{C}}(x,y)= \begin{cases} I_{1}(x,y), & \text{if}\ D_{I_{C}}(x,y)=C_{1}\\ I_{2}(x,y), & \text{if}\ D_{I_{C}}(x,y)=C_{2}\\ \end{cases} $$
(11)

where \(D_{I_{C}}\) is the colored decision map. The overall process flowchart of the fusion algorithm is given in Fig. 8.

Fig. 8
figure 8

Flowchart representing the fusion process: a and b Color Image; (a1,b1) Y-Component; (a2,b2) Canny edge image; (a3,b3) Zero-cross edge image; (a4,b4) Block-wise compared image; (a5,b5) Dilated images; (a6,b6) Removed isolated points; (a7,b7) Reconstructed image; c Best reconstructed image; d Region image; e Color decision map; f Fused image

4 Experiment results and discussion

4.1 Parameters

The various parameters associated with this experiment are: block size (n), standard deviation (σ1) of the Gaussian filter used in Canny, (σ2) of LoG filter, threshold values (t1,t2) and (t3) used in Canny and zero-cross edge detection respectively and structuring element (s) used for morphological operations. The block size chosen for region comparison to separate the focussed edges are chosen by trial and error such that it produces visually superior results. The threshold values for edge detection are heuristically chosen by the edge detectors depending on the features present in the source images. However, all the threshold values lie between 0 and 1, some of which are presented in Table 3. The optimal value of σ in the Gaussian kernel depends on image factors such as the resolution of the image and size of the objects in it. For all such image dependent parameters, the default values are chosen without any tuning. For binary morphological operations, disk-shaped structuring element is chosen because of its isotropic property, which retains the image details without introducing block effects. Here, we have adopted the minimum value for the size of the structuring element. The values used for the above-mentioned parameters are provided in Table 4.

Table 3 Threshold values used in edge detection
Table 4 Values of the parameters used in the experiment

4.2 Comparison methods

The proposed method has been tested on 30 pairs of 2-D colored and grayscale multi-focus images, and the results are compared with other representative image fusion algorithms in terms of visual/subjective perception and objective evaluation. In the paper, we have presented the results for 16 pairs. All the algorithms selected for the purpose of comparison are CPU executable pixel/region based fusion methods implemented using gradient domain (GD) [21], adaptive block-based discrete wavelet transform (DWT-AB) [32], image matting (IM) [14] and Gaussian curvature filter (GCF) [24]. Additionally, to establish the efficacy of the proposed method with the state-of-art, it has also been compared with two learning-based methods, e.g., adaptive sparse representation (ASR) [33], and convolutional neural networks (CNN) [31]. In GD based method, ’weighted sum’ is adopted to fuse the chrominance channels while the fused luminance channel is obtained by a gradient reconstruction technique based on ’Haar’ wavelets. DWT-AB employs discrete wavelet transform with three levels of decomposition, in which the low-frequency coefficients are fused using adaptive block method, and high-frequency components are combined using local wavelet energy. IM is a spatial method which applies image matting technique on a roughly segmented result obtained by using morphological bright and dark top-hat transforms as focus measure. GCF initially uses a gaussian curvature filter to obtain the salient (sharpest) regions from the source images followed by a course fusion map using a synthetic focusing degree criterion combining spatial frequency and local variance. ASR proposes a learning-based fusion approach using sub-dictionaries, whereas CNN method uses convolutional neural networks for the same. The proposed algorithm, as well as the comparison methods, are implemented using Matlab programming language on a computer with 2.4 GHz CPU and 4GB RAM.

4.3 Objective evaluation

The quality of the fusion results can be quantitatively assessed by using several fusion metrics, which may or may not take reference (ground truth) image into account. In most of the cases, no-reference based fusion metric is adopted due to unavailability of perfect ground truth image. The concept of a perfectly fused image is subjective in nature because it is quite challenging to measure the ’perfectness’ of a fused image. Moreover, the quality of source images, mis-registration, and illumination defects directly impact the results as well. Also the process of combining information from the source images might create additional effects such as contrast enhancement in the fused image, which is not desirable. So, for objective evaluation of the fused results, fusion quality indices are extensively used by researchers. Here, the results of the proposed algorithm are numerically evaluated using two groups of quality metrics, a) image quality metrics, (b) fusion quality metrics. Average pixel intensity (API), standard deviation (SD), average gradient (AG) judges the quality of fused image beyond fusion (for e.g., enhancement of features) and are independent of the source images. On the other hand, feature mutual information (FMI), gradient-based metric (\(Q_{AB}^{F}\)), Piella’s edge-based metric (Qe), and Zhao’s metric (\(P_{blind}^{\prime }\)) measures the degree of fusion with respect to the source images. Each of the metrics mentioned above is briefly defined below. To maintain the generality, the source images are denoted by ’A’, ’B’ whereas ’F’ denotes the fused image having a dimension of M × N.

  1. 1)

    Average pixel intensity (API): It serves as an index to measure the overall brightness in an image and is given by (12):

    $$ API= \frac{{\sum}_{i=1}^{M}{\sum}_{j=1}^{N}F(i,j)}{MN} $$
    (12)

    where F(i,j) is the pixel intensity.

  2. 2)

    Standard Deviation (SD): It is defined as the square root of variance and measures the spread of the data from the mean value (13).

    $$ SD=\sqrt{\frac{{\sum}_{i=1}^{M}{\sum}_{j=1}^{N}(F(i,j)-}{line{F})}{MN}} $$
    (13)
  3. 3)

    Average Gradient (AG): It measures the degree of clarity and sharpness which is given by (14):

    $$ AG=\sqrt{\frac{ (F(i,j)-F(i+1,j))^{2}+ (F(i,j)-F(i,j+1))^{2}}{MN}} $$
    (14)
  4. 4)

    Feature Mutual Information (FMI) [9]: It is a non-reference objective image fusion metric proposed by Haghighat et al. which measures the transfer of features from the source image to the fused image. It is based on mutual information, and the uniqueness of the algorithm lies in the choice of gradient map as an information feature. A gradient map contains information about pixel neighborhoods, edge strength and directions, texture, contrast, and other region-based features. The authors have proved that a JPDF (joint probability distribution function) can be constructed with a given marginal probability distribution function (MPDF). The normalized values of the gradient magnitude at each pixel location in the feature images are used in marginal distributions. The amount of feature information transferred to F from A and B are individually measured by (15) and (16):

    $$ {I_{{FA}}=\sum\limits_{f,a}p_{{FA}}\log_{2}\frac{p_{{FA}}}{p_{{F}}.p_{{A}}}} $$
    (15)
    $$ I_{{FB}}=\sum\limits_{f,b}p_{{FB}}\log_{2}\frac{p_{{FB}}}{p_{{F}}.p_{{B}}} $$
    (16)

    The FMI metric is expressed as:

    $$ FMI^{AB}_{F}=I_{{FA}}+I_{{FB}} $$
    (17)

    The normalized FMI is formulated as:

    $$ FMI^{AB}_{F}= \frac{I_{{FA}}}{H_{{F}}+H_{{A}}}+ \frac{I_{{FB}}}{H_{{F}}+H_{{B}}} $$
    (18)

    where HA, HB and HF are histogram based entropies of the images A,B and F respectively. It lies in the interval of [0,1].

  5. 5)

    Gradient-Based fusion metric (\(Q_{AB}^{F}\)) [28]: Xydeas and Petrovic proposed a pixel-level fusion metric which measures the amount of transfer of edge information from source images to the fused image. It employs the Sobel edge detector and calculates the edge strength and orientation at each edge pixel. This metric is widely used in analyzing the edge strength and quality of fusion results. It is mathematically expressed in (19):

    $$ Q_{AB}^{F}=\frac{ {\sum}_{m=1}^{M}{\sum}_{n=1}^{N} Q_{m,n}^{AF}w_{m,n}^{AF}+ Q_{m,n}^{BF} w_{m,n}^{BF} } {{\sum}_{m=1}^{M}{\sum}_{n=1}^{N} w_{m,n}^{AF}+ w_{m,n}^{BF}} $$
    (19)

    where \(Q_{m,n}^{AF}\) and \(Q_{m,n}^{BF}\) are edge preservation values weighted by \(w_{m,n}^{AF}\) and \( w_{m,n}^{BF}\) at coordinate (x,y), respectively. The value of this metric lies within [0,1] where larger value indicates better performance.

  6. 6)

    Piella’s metric (QE) [22]: This metric proposed by Piella and Heijmans quantifies the transfer of salient information from the source images to the fused images. Three different indices, namely, fusion quality index (QO), weighted fusion quality index (QW), and edge dependent fusion quality index (QE) are evaluated separately. Keeping up with the context of this paper, only QE is evaluated for this algorithm because it measures the transfer of edges in the fused results. The mathematical representation of this metric is given in (20):

    $$ Q_{E}(A,B,F)=Q_{W}(A,B,F).Q_{W}(A',B',F')^{\alpha} $$
    (20)

    where α ∈ [0, 1]. The parameter α denotes the relative contribution of the edge images by the original images. The mathematical derivation of all the indices are elaborated in the original paper.

  7. 7)

    Zhao’s Metric (Pblind′) [41]: Zhao et al. have relied on image features based on phase congruency and principal moments to design a pixel level quantitative fusion metric. It is defined as the product of three separate correlation coefficients.

    $$ P'_{blind}=(P_{p})^{\alpha}(P_{M})^{\upbeta}(P_{m})^{\gamma} $$
    (21)

    where p, M, m denotes the phase congruency, maximum and minimum moments respectively. α, β, γ are the tunable parameters used in the algorithm. For local analysis, a block-based approach is adopted, where the final value of Pblind′ is obtained by averaging all the values computed over the number of blocks of the entire image.

    $$ P'_{blind}=\frac{1}{K}\sum\limits_{k=1}^{K}P'_{blind}(k) $$
    (22)

    The above mentioned metrics have been evaluated on sets of colored and gray-scale multi-focus image pairs as shown in Fig. 9. The values of API, SD, AG are provided in Table 6 and \(FMI^{AB}_{F}\), \(Q_{AB}^{F}\), QE and Pblind′ for all the images are presented in Table 5.

Fig. 9
figure 9

Sets of multi-focus images used in experiment

Table 5 Performance evaluation using fusion quality metrics
Table 6 Average values of image quality metrics

4.4 Subjective evaluation and fusion results

A good fusion algorithm should be able to produce accurate and reliable results by simultaneously removing the redundant information and integrating the complementary information. Besides, it should not introduce/enhance extra features, artifacts, or inconsistencies which are not a part of the source images. Below, we present a comparative discussion on the results produced by the proposed method and the other six methods. Some of the results are zoomed separately so as to illustrate the quality of results produced by the algorithm (See Figs. 10111213 and 14). The quality of the source images determines the quality of the fusion results. All the input source images are assumed to be pre-registered.

Fig. 10
figure 10

Source image ”Leg” and fusion results: a Source image with focus on the background; b Second source image with focus on foreground; c GD based result; d DWT-AB result; e IM based result; f ASR based result; g GCF based result; h CNN based result; i Result using proposed method; j Region image obtained by our method; (c1), (d1), (e1), (f1), (g1), (h1) and (i1) are the local magnified regions of c, d, e, f, g, h and i respectively

Fig. 11
figure 11

Source image ”Seascape” and fusion results: Same order as in Fig. 10

Fig. 12
figure 12

Source image ”Calender” and fusion results: Same order as in Fig. 10

Fig. 13
figure 13

Source image ”Jar” and fusion results: Same order as in Fig. 10

Fig. 14
figure 14

Source image ”Clock” and fusion results: Same order as in Fig. 10

As evident from Fig. 10a and b, the features are not identically focused in both the pairs. The results of the proposed fusion method, along with other methods for the first source image, are presented in Fig. 10(c-i). It may be observed that the fused image obtained in GD based method (Fig. 10c) produces an image with highly increased contrast and brightness, which is an undesirable effect and modifies the intensity values taken from the source images. In addition, it creates a strong shadow around lighter objects within the scene (Fig. 10(c1)). In DWT-AB based approach, there is unnecessary distortion and distribution of color components in the vicinity. In Fig. 10d, the blue color gets unevenly spread around the text ’Nature’ and in Fig. 10(d1), colors get introduced along the fold in the cloth within the leg, which was not present earlier. Figure 10e presents the results of IM based approach, which clearly shows the lack of clarity along the edges of the leg (Fig. 10(e1)). Figure 10f presents the results of ASR based method which also casts shadows around lighter objects (Fig. 10(f1)) similar to GD based method. In GCF based method, the details present in the stone leg are not captured in the fused image (Fig. 10g). The results obtained from the CNN (Fig. 10h) as well as the proposed method (Fig. 10i) are quite comparable. The respective local magnified regions are presented in Fig. 10(h1) and (i1). The proposed method has produced better results with no noticeable distortion and artifacts in less amount of time.

The second experiment has been performed on the ’Seascape’ source images, as shown in Fig. 11a and b. Originally, there is a color disparity in the input images, the reasons for which could be many but not known to us in particular. This has in fact affected the fusion results in all the methods except ours. Results from GD, DWT-AB, and ASR based method are shown in Fig. 11c, d and f, respectively, clearly shows that there is a gross change/distortion in color after the sea surface is brought to focus. It confirms the fact that these fusion algorithms have taken color values from the foreground object in the far-focussed image, which is not desirable. Moreover, in the magnified regions (Fig. 11(c1 and d1), the edges of the rock do not appear to be prominent. The magnified regions of IM based approach (Fig. 11(e1)) shows the presence of unnecessary pixels along the edges of the upper portion of the rock. The GCF based method has fused the blurred version of the sea from the near focussed image instead of capturing the details present in the far focussed source image (Fig. 11g). Results from CNN based method lacks clarity along the curvature of the rock, as shown in Fig. 11(h, h1). The results of the proposed algorithm and its magnified images presented in Fig. 11i and i1, respectively, have outdone the other methods in terms of visual clarity. It has perfectly combined the complementary features from the source images with adequate edge clarity, no color distortion, shadow or blurriness.

In ’Calendar’ source image, one of the source image is focused on the book (Fig. 12a) with the blue background while the other image has its focus on the table calendar (Fig. 12b). GD based result has led to an increase in the brightness as well as contrast, thereby creating strong shadows around the text (Fig. 12(c1)). DWT-AB based results show a slight spillover of blue color along the edges of the table calendar (Fig. 12(d1)). The results of IM based technique shown in Fig. 12(e1) are satisfactory in terms of visual quality. Results using ASR have introduced shadow around the text, ’IMAGE’ in the book, and the top-left corner of the table calendar (Fig. 12(f1)). GCF based method has also produced artifacts over the same text (Fig. 12(g1)). Taking into account all the defects mentioned above, the CNN based method (Fig. 12(h, h1), and the proposed algorithm has produced better results. (Fig. 12(i, i1).

The multi-focus image pair presented in Fig. 13a and b consist of blue jar focused at foreground and orange jar focused at background, respectively. The results obtained by different fusion methods are presented in Fig. 13c-i. For a minute comparison, the magnified images are presented in Fig. 13(c1)-(i1). From the results, it is perceived that the GD based method has intensified the color and brightness and introduced shadows in the results (Fig. 13(c1). DWT-AB results are slightly distorted in terms of color around the text ’Flora’ in the orange jar (Fig. 13(d1)). Likewise, the GCF method has produced box effects, which is clearly visible in Fig. 13(g, g1). IM based result is shown in Fig.13(e1) has some white noise/dots spread over the bottom portion of the blue jar (Fig. 13(e1)). This is certainly a drawback of the algorithm and could be attributed to ill measures of focus. Apart from the presence of shadows around the dark text/objects, ASR based results are quite encouraging (Fig. 13(f1)). The results from the CNN algorithm presented in Fig. 13(h, h1) and the proposed method in Fig. 13(i, i1) are quite satisfactory. For gray-scale images, the visual comparison has been demonstrated using only one of the six images used in this experiment. The gray-scale image, ’Clock’, consists of two clocks of varying sizes where the focus is on the larger and smaller clock, respectively (Fig. 14a and b). The observations regarding the results obtained from various methods are quite similar to that of the color images. The results from the different fusion methods are provided in Fig. 14c-i. GD based method suffers from a similar problem as discussed previously (Fig. 14(c1)). In DWT-AB based method, the edge of the smaller clock lacks sharpness at the region of overlap over the larger clock (Fig. 14(d1)). The edges are comparatively better expressed in IM based method, but the edge of the smaller clock slightly bends along the number ’8’ of the larger clock (Fig. 14(e1)) which is more prominent in the results obtained using ASR (Fig. 14(f1)). The GCF based result (Fig. 14(g1)) and CNN based result (Fig. 14(h1)) clearly produces distortion and lacks clarity along the edges respectively. The results of the proposed method presented in Fig. 14(i1) shows the superiority in terms of quality the fused results.

The results on rest of the color and gray-scale source image pairs are provided in Figs. 15 and 16 respectively. Summarizing the observation in the obtained results, we can say the gradient domain (GD) based method enhances the brightness as well as the contrast for all the images, which is also established by the metric values from Table 6.

Fig. 15
figure 15

Results for other source images: a using GD; b using DWT-AB; c using IM; d using ASR; e using GCF; f using CNN; g using proposed method

Fig. 16
figure 16

Results for other gray-scale images using the proposed method

It also creates shadows around light objects in the dark background and vice-versa. DWT-AB based method leads to incorrect scattering and spread of color components from adjacent pixels. For certain color images, the focus measure used in image matting (IM) based method fails to distinguish the focused and defocused pixel, thereby creating slight noise along the edges. The GCF based method introduces block effects and fails to distinguish between focussed and defocussed pixels. Again, ASR based approach suffers from similar problem i.e., shadows and construction of dictionaries increases the computational complexity as well. The results from the CNN approach and the proposed method are quite comparable in terms of fusion quality; however training a deep neural network requires a significant amount of computation time and processing speed (Table 7).

Table 7 Average execution time of algorithms (in seconds)

4.5 Results using other source images

The proposed approach has been tested using source images obtained under different (a) complex background environment, (b) illumination conditions, and (c) time to perform rigorous analysis of the algorithm. Figure 17 illustrates the results obtained after fusing source images that satisfy the above criteria.

Fig. 17
figure 17

1stRow, 2ndRow: Image datasets with different (a1),(a2) complex background; (b1),(b2) artificial lighting; c night time; d poor lighting; 3rdRow: Results obtained by proposed method

4.6 Fusion with multiple source images

Generally, to restore focus within a scene, we may need to deal with multiple partially focussed source images. Though the algorithm demonstrates the two focus situation, but it can be extended to process multiple source images by fusing them one by one. This is illustrated by Fig. 18 which contains three source images.

Fig. 18
figure 18

Results obtained by the proposed algorithm for multiple source images, a Focus on front seal; b Focus on middle seal; c Focus on distant scenary; d Fused result of (a) and (b); e Fused result of (c) and (d)

4.7 Results on noisy source images

To study the performance of the proposed algorithm in the presence of noise, experiments are performed on noisy image pairs generated by adding noise externally to the multi-focus source images. The degree of noise degradation is an important factor exceeding which highly reduces the quality and the focus information/content from the source images, thereby producing results of less practical value. The source image pairs are corrupted with two types of noise, a) Salt-and-pepper and b) Gaussian white noise prior to the application of the proposed fusion algorithm. In case of salt-and-pepper, the maximum noise density (d) that is acceptable to retain the focus content is 0.2, i.e., 20% of pixels. For Gaussian white noise, keeping the variance constant, the acceptable mean (m) is found out to be 0.50. Fig. 19 illustrates the results produced by the proposed algorithm in presence of noise.

Fig. 19
figure 19

Results on noisy multi-focus source pairs, a and b Salt-and-pepper, d = 0.2; c and d Gaussian white noise, m = 0.50; (a1),(b1),(c1),(d1) Results produced by the proposed approach

4.8 Limitations of the algorithm

In case the source images chosen for fusion happen to contain weak edges, the proposed algorithm does not perform well. The source image pair provided in Fig. 20 consists of a sea beach focused on the left and right side, respectively. The proposed algorithm fails to work on such source images due to (a) absence of strong and prominent edge content, (b) absence of prominent structures/object. This is illustrated from the respective edge images depicted alongside where the sky does not produce any strong edges due to smooth regions. Also, we cannot differentiate between the edge pixels belonging to objects and pixels due to noise. As the edge detectors are unable to locate the focused edge pixels primarily, features depending solely on spatial intensity values of an image are not sufficient to use in such cases. In addition to this, the proposed algorithm cannot deal with multi-modal or multi-sensor image sets because features expressed in such images are partly complementary and partly redundant.

Fig. 20
figure 20

a Focus on the right; b Edge image of (a); c Focus on left; d Edge image of (c)

4.9 Future works

  • Edge detectors used in the algorithm are based on brightness gradients alone, which may give strong responses for irrelevant regions and weak for the relevant ones, thus producing improper edge maps to start with. Moreover, it may not work with natural, real-life images. Deep learning methods can be used to devise edge detector models, which will directly respond to the relevant focussed edges, which can automate the process to some extent.

  • For real-time application, we can explore the possibility of FPGA (Field Programmable Gate Arrays) implementation, which would further reduce the overall execution time.

5 Conclusion

This paper presents a multi-focus image fusion method which uses edges of focused features from the source images as a basis for selecting the focused regions prior to constructing the fusion result. A block-wise region comparison followed by morphological dilation operation is performed to enhance the clarity of the focused edges further. Morphological edge by reconstruction is performed to restore and reconstruct any broken edge to maintain the continuity. The best reconstructed source image is chosen based on a focus measure, which in turn is used in constructing a binary region (initial decision map). For color images, the binary region is further converted to a colored decision map for the ease of fusion procedure. It is to be noted that the decision map is formed by using the best reconstructed edge image out of the two source edge images. To form the fused image, the binary (or colored decision map) is used to combine pixels from the gray-scale (or color) source images. The method has been compared with other similar methods, both in terms of quantitative and qualitative evaluation. It is observed that the visual quality of the outputs is superior in comparison to the results produced by other methods. The values of the fusion metrics obtained as a part of quantitative evaluation is as good as the other state-of-the-art methods.