1 Introduction

Automatic human authentication based on face biometric has been extensively studied in computer vision with the knowledge of human physiology. Despite extensive studies, many problems in face authentication still persist due to the inherent difficulties of extracting face biometric [55]. A wide variety of problems such as lighting and shadows plague the attempt for unconstrained face authentication. These problems are known as illumination problems. In addition to illumination problem, face authentication is equally challenging due to variations in facial expression, pose and disguise [3, 6, 30, 40, 42, 52]. Availability of several sensors has fuelled the interest of research community in this field. Optical sensor, infrared (IR) sensor and 3D sensor are few examples of sensors used in face authentication. These sensors are utilized for extracting different information, which are usually complementary in nature. For instance, an optical sensor captures visible band whereas an IR sensor acquires information about IR band of the electromagnetic spectrum. It is a challenging task to perform face authentication using visible images in an uncontrolled environment. IR sensor traps the emitted heat energy from face region and generates thermogram which is more robust to illumination disparities. This virtue of IR sensor is exploited by researchers nowadays, and they are focusing on IR imaging based face authentication system. An IR sensor focuses on electromagnetic spectrum with wavelength in the range of 8 to 14 micro meter / micron [41]. Thus, we can see that even under varying illumination conditions, the usage of IR image in face authentication has significant benefits over visible image. However, an IR face image is quite sensitive towards the ambient temperature changes. The heat pattern emitted from the body also fluctuates with its temperature, thus leading to inaccurate classifications. In other words, face authentication using visible images works effectively in such circumstances [39]. Few examples of visible and infrared (IR) face images from UGC-JU [41], IRIS [12] and SCface face databases have been shown in Figs. 12 and 3 respectively.

Fig. 1
figure 1

The UGC-JU visible and their corresponding IR face images

Fig. 2
figure 2

The IRIS visible and their corresponding IR face images

Fig. 3
figure 3

The SCface visible and their corresponding IR face images

Here, one can conclude easily that these modalities are not able to convey significant information in one single image. Therefore, multimodal fusion is required to extract all possible essential information in a fused image. Image fusion is a process to combine complementary information in addition to the redundant information present in source images in such a manner that the fused image carries more information for human or machine perception in contrast to any of the individual images. The redundant information increases the reliability and accuracy of a face authentication system. At the same time, integration of complementary information enhances the understandability of the fused image [19, 24, 37]. Fusion techniques could be classified into data, feature, score, rank, and decision levels [22] based on the stage at which it operates. Data and feature level fusions are known as low-level fusion whereas score, rank, and decision level fusions are called high-level fusion. The low-level fusion is superior to high-level fusion due to the following reasons: Firstly, an information loss is involved in each stage of succession from data to decision level of image fusion in a biometric security system. Hence, the sooner the fusion is performed, the wealthier is the data carried to the subsequent stages. Secondly, most of the researchers use fusion at lower levels in their work for its simplicity. Lower level fusion does not affect the design of the classifiers so it is easy. However, the statistical dependences between classifiers badly affect the performance of higher level fusion.

Normally, Mallat or pyramidal wavelet transforms, for instance ‘Haar’ and ‘Daubechies’, are considered for multi-resolution image fusion. Recognition of fused face images, by utilizing state-of-the-art methods, is shown in Section 2. However, Mallat wavelet transform (MWT) suffers from translation, which may reduce the accuracy of a biometric face authentication system significantly. It motivates to choose a translation invariant or stationary wavelet transform called as À-trous wavelet transform (AWT) in this work. The translation-invariance is accomplished by removing the down-samplers and up-samplers in the MWT and up-sampling the filter coefficients by a factor of 2(l− 1) in the lth level of AWT [45, 51]. AWT provides an output image after each level of decomposition, which is same as original image in size. It means AWT is an inherently redundant approach by default. In this study, a fusion rule is formulated using the approximation and wavelet coefficients of the AWT of input face images along with differential box counting (DBC) based fractal dimension (FD) method [38]. After obtaining the fused face images using the proposed fusion method, which would be discussed in Section 4, the superiority of the fused images has been estimated quantitatively using some of existing fusion metrics such as mutual information [35], sobel edge detection [56], spatial frequency [14] and universal image quality index [54], which illustrate the usefulness of the proposed method over existing fusion methods. In order to authenticate humans, instead of considering conventional machine learning algorithms for estimating the similarity between two face images, a novel similarity measure is proposed in this work. The proposed similarity measure is relied on maximum matching of a bipartite graph, which would be formed using superpixels of two fused face images and the weight of each edge is defined by a cost function which is described in Section 4.

The rest of this paper is ordered as follows. Section 2 presents a short literature review on fused face recognition. Different methods that help to define fusion rule and similarity measure between two fused face images are described in Section 3. The proposed fusion rule is illustrated in Section 4. Section 4.2 describes some of the existing fusion metrics for evaluating of the proposed fusion algorithm in addition to existing fusion algorithms. The experimental results and discussion are presented in Section 5. A comparative study of the proposed method with the state-of-the-art methods is also discussed in Section 5. Finally, the conclusion is drawn in Section 6.

2 Related work

The method proposed by S. G. Kong et al. in [25] is one of the pioneers in face recognition, discussed different aspects of visible and IR images and presented a fusion method for combining these two modalities images. Prior to image fusion, images were registered using some software. S. G. Kong et al. performed their experiments on NIST/Equinox and UTK-IRIS databases and the obtained recognition accuracies are 85% and 67% respectively. In [9], M. K. Bhowmik et al. presented a pixel-level fusion method using IR and visible face images by integrating 70% visible and 30% IR information at each pixel position. Then the fused images were mapped into an eigenspace, which was fed further into radial basis function and multi-layer perceptron based Artificial Neural Network (ANN) classifiers for recognition and the noted recognition accuracies are 96% and 95.07% respectively on Object Tracking and Classification Beyond Visible Spectrum data set (OTCBVS). The critical comment about this work is that there is no adequate reasoning behind integrating 70% visible and 30% IR face information to obtain the fused image. In [5], G. Bebis et al. introduced pixel-level and feature-level fusion techniques. ‘Haar’ wavelet was employed on both IR and visible face images. Then, a mask was created, which consists of the same number of pixels as the original image based on Genetic Algorithm (GA). This mask assisted to choose the wavelet coefficients from both the IR and visible face images. Two modalities face images were mapped separately into eigenspace for obtaining the features in their second experiment. Next, GA was considered to choose the eigen features from both the images. All the experiments were done on Equinox face database. The obtained maximum recognition rate is about 97% when eyeglasses were not present both in the gallery and probe sets contain multiple illuminations. In [47], R. Singh et al. presented a combined image fusion and match score fusion of multi-modality face images. For fusion of visible and IR face images, 2v-granular SVM (2v-GSVM) was used. The 2v-GSVM employed multiple SVMs to learn both the local and the global characteristics of the multi-modality face images at different granularity levels and resolutions. The 2v-GSVM determines some weighted factors, which help to form the fused image. Then local and global features from the fused images were extracted using 2D log-polar Gabor transform [48] and local binary pattern [4, 7] algorithms respectively. The corresponding match scores were mixed by Dezert-Smarandache theory of fusion [47], which uses plausible and paradoxical reasoning. The presented method was tested on Notre Dame and Equinox databases and was assessed with existing statistical learning, and evidence theory based fusion methods. The verification accuracies are 95.85%, 94.80% using 2D log-polar Gabor, local binary pattern and 94.98%, 94.71% using 2D log-polar Gabor, local binary pattern on Norte Dame and Equinox databases respectively. A two-level hierarchical DWT based data level image fusion using IR and visible face images were presented by R. Singh et al. in [46]. Then 2D log-polar Gabor wavelet was considered to extract amplitude and phase features from the fused image. An adaptive SVM learning algorithm intelligently chooses either amplitude or phase features to produce a fused feature set. R. Singh et al. did their experiments on Equinox face database and shown that the integration of visible light and short-wave IR spectrum face images produced the best recognition rate with an equal error rate of 2.86%. In [8], M. K. Bhowmik et al. presented relative studies on fusion of visual and IR images by ‘Haar’ and ‘Daubechies’ wavelet transformations. The decomposition operation up to level 5 had been done on both types of images independently using ‘Haar’ and ‘Daubechies’. The wavelet coefficient for the fused image was formed by choosing higher magnitude coefficients from approximation coefficients of both IR and visible face images and smaller magnitude coefficients from the detail coefficients of both the modalities of images. After applying the inverse transform on the fused coefficient, the fused image was generated. PCA was considered to extract the features from the fused images. Here, multilayer perceptron (MLP) had been adopted for the classification purpose. Experiments had been done on IRIS dataset to validate the presented methods. The average recognition rates are 87% and 91.5% for ‘Haar’ wavelet and ‘Daubechies’ wavelet based methods respectively. In [53], N. Wang et al. presented a complex fusion strategy at both pixel-level and feature-level and different classification methods like two-dimensional PCA (2D PCA), two-dimensional LDA (2D LDA), two-directional two-dimensional PCA ((2D)2PCA), two-directional two-dimensional LDA ((2D)2LDA) and two-directional two-dimensional Fisher PCA ((2D)2FPCA). The NVIE visible and T-IR face database were used for their experiments. The obtained maximum recognition rates for pixel-based complex algorithm using ((2D)2LDA) and feature-based complex algorithm using ((2D)2FPCA) were 97.09% and 97.97% respectively. In [20], G. Hermosilla et al. presented a fusion method using GA. The presented method merged the most relevant information from IR and visible face images with image descriptors. GA searches weights for the IR and visible face images in the form of a genetic code in order to maximize the recognition rate as an objective function. G. Hermosilla et al. [20] used Equinox and PUCV-VTF databases for their experiments and reported 97% and 99% recognition for these two databases respectively.

3 Background

In this section, different methods namely, AWT, FD, local ternary pattern (LTP), superpixels, maximum matching would be discussed, which help to design a fusion rule to fuse IR and visible face images followed by similarity measure between two fused images for face authentication.

3.1 À-trous wavelet transform

The transform domain based methods to accomplish image fusion are gaining importance due to its good fusion performance [34]. However, in most of the cases, this type of methods based on DWT use pyramidal decomposition hence are translation-variant. Dutilleux [13] introduced a translation-invariance wavelet transform known as AWT, where down-sampling and up-sampling in DWT are removed to achieve translation-invariance. In AWT, À-trous is derived from the French word trous meaning holes in English. AWT is also known as stationary wavelet transform, undecimated DWT and redundant DWT [15]. AWT is also a non-orthogonal, shift invariant, symmetric, dyadic wavelet transform. At each level of decomposition, AWT decomposes an approximation image (AI) of an input image I, into a coarser AI and a detail image (DI), where AI and DI contain the approximation/low-pass and detail/high-pass coefficients respectively. The image decomposition scheme used by AWT is inherently redundant [49], where exactly same number of pixels are there in input and output images after each level of decomposition. Hence, AWT decomposition scheme can be depicted using a parallelepiped as shown in Fig. 4. Moreover, the successive AIs while going up through the resolution levels of the parallelepiped, have a coarser spatial resolution. The AI of I at level l, \(A{I_{I}^{l}}\), can be computed using (1).

$$ A{I_{I}^{l}}(i, j)=AI_{I}^{(l-1)}(i, j)\otimes h_{l} \text{} \forall l=1, 2, ..., L, $$
(1)

where hl is the b3-spline scaling function for level l, L is the maximum level of decomposition and ⊗ represents the convolution operator. Practically, AI of I at level l is obtained by convolving the AI of I at level (l − 1) with the low-pass scaling function for level l. The 0th level AI of I is the image I itself i.e. \(A{I_{I}^{0}}(i, j)=I(i, j)\). Moreover, the AI of I at the maximum level L, \(A{I_{I}^{L}}(i, j)\) is sometimes simply represented as AI of I i.e. AII(i, j). The AI of I at different levels of decomposition represents its low frequency information at that level respectively.

Fig. 4
figure 4

Illustration of parallelepiped AWT

The b3-spline scaling function for level 1, h1 is obtained using (2) [16].

$$ h_{1}=\frac{1}{256}\left( \begin{array}{lllll} 1&4&6&4&1\\ 4&16&24&16&4\\ 6&24&36&24&6\\ 4&16&24&16&4\\ 1&4&6&4&1 \end{array}\right) $$
(2)

The scaling function for a subsequent level is obtained from the scaling function of its previous level by interlacing zeros in between the rows and columns respectively. For example, scaling function for level 2, h2, is obtained from h1 by placing zeros in between the rows and columns which is represented in (3).

$$ h_{2}=\frac{1}{256}\left( \begin{array}{lllllllll} 1&0&4&0&6&0&4&0&1\\ 0&0&0&0&0&0&0&0&0\\ 4&0&16&0&24&0&16&0&4\\ 0&0&0&0&0&0&0&0&0\\ 6&0&24&0&36&0&24&0&6\\ 0&0&0&0&0&0&0&0&0\\ 4&0&16&0&24&0&16&0&4\\ 0&0&0&0&0&0&0&0&0\\ 1&0&4&0&6&0&4&0&1 \end{array}\right) $$
(3)

The spatial information lost between two successive AIs of the parallelepiped are collected in a single DI [17] which is shown in Fig. 4. The DI of I comprises its high frequency information. The DI of I at level l, \(D{I_{I}^{l}}\) is computed as the difference between the AI of I at level (l − 1) and l respectively which is represented by (4).

$$ D{I_{I}^{l}}(i, j)=AI_{I}^{(l-1)}(i, j)-A{I_{I}^{l}}(i, j)\text{} \forall l=1, 2, ..., L $$
(4)

The AI of I at a level l can be reconstructed by adding the DIs of all successive decomposition levels with the AI of last level L, which is depicted in (5). Hence, the original input image, I, is restored by adding the DIs of all the decomposition levels with the AI of the maximum decomposition level L.

$$ A{I_{I}^{l}}(i, j)=A{I_{I}^{L}}(i, j)+\sum\limits_{k=1}^{L-l}DI_{I}^{l+k}(i, j) $$
(5)

AWT would be considered to devise a fusion rule for fusing IR and visible face images and the detail discussion would be there in Section 4. The approximation and wavelet coefficients after second level decomposition of both visible and IR face images of a person shown in Fig. 1 (second from the left) are shown in Fig. 5.

Fig. 5
figure 5

AI and DI of the second left images of Fig. 1 (T-IR,V-visible, A-AI, W-DI, and L-level of decomposition)

3.2 Fractal dimension using differential box counting method

FD is a measure of roughness or irregularities in the form of self-similar elements present in an image. Mandelbrot [28] introduced the term fractal. It is a Latin word and the meaning of this word is fractus in English, which denotes irregular segments. Several approaches are there in literature to calculate FD of an image [23, 31, 32]. These methods are quite expensive [38]. Sarkar and Chaudhuri [38] presented an efficient approach, called DBC to compute FD of a gray-scale image. Generally, DBC is used to measure texture of an image [29]. Consider an image of size M × M pixels as a 3-Dimensional space where x and y coordinates represent the length and breadth of the 2-dimensional image plane respectively. The z coordinate denotes the height i.e. intensity value of a particular pixel located at (x, y). The xy-plane is partitioned into non-overlapping grids of size s × s pixels, where s is an integer and varies from 2 to M/2. The gray-level of incomplete grids outside the image boundary is treated as zero. The scale of a grid of size s × s is r, where r = s/M. On each grid there is a number of boxes of size s × s × h, where h is the height of box as shown in Fig. 6, (G/h) = (M/s) and G is the total number of gray-levels.

Fig. 6
figure 6

The process of finding out the number of boxes (nr) by DBC method. (Here, M= 25, s= 5 and nr is 3 for the grid)

Let the maximum and minimum gray-levels on the (i, j)th grid be gmax and gmin respectively. Suppose nr(i, j) is the number of boxes required to fill (i, j)th grid at scale r, which can be computed by (6).

$$ n_{r}(i,j)=\left\lceil\frac{g_{max}}{h}\right\rceil-\left\lceil\frac{g_{min}}{h}\right\rceil+1 $$
(6)

Total number of boxes, Nr, are required to fill the whole image at scale r is calculated by (7).

$$ N_{r}=\sum\limits_{i, j}n_{r}(i,j) $$
(7)

FD of the image is the slope of a line generated by least square linear fit method after plotting some points, where x-axis and y-axis represent log(1/r) and log(Nr) respectively. DBC also helps to construct the proposed fusion rule and further be discussed in Section 4.

3.3 Local ternary pattern

The local binary pattern (LBP) is used by many researchers to extract facial features due to its computational efficiency and discriminative power. In order to improve the performance of single resolution LBP, Chen et al. [10] have used multi-resolution LBP to extract face features. In case of LBP, all the neighboring pixels are equally treated but different neighboring pixels may contribute differently to the facial description. In order to solve the above issue, Lei et al. [26] have proposed an optimal neighborhood sampling strategy which multiplies a pixel difference vector with an optimal learned soft sampling matrix for generating the facial feature vector. Tan et al. [50] have proposed local ternary pattern (LTP) to extract face features at different lightening conditions, which is an extension of LBP. It is normally used for texture extraction in uniform and near-uniform regions, which would be treated as features. Unlike LBP, LTP creates a ternary pattern based on (8).

$$ s(I_{k}, I_{c})=\left\{\begin{array}{llll} -1, \text{if } I_{k}<I_{c}-t \\ 0,\text{if } I_{k}>I_{c}-t \text{ and} I_{k}<I_{c}+t \\ 1, \text{if } I_{k}>I_{c}+t, \end{array}\right. $$
(8)

where Ik are neighboring pixels of a 3 × 3 window surrounded by a centre pixel, Ic and t is threshold value. The value of t is 7 and is inherited from [50]. Thus, k varies from 0 to 7 as there are 8 neighboring pixels around Ic. Figure 7 demonstrates basic LTP process for a given 3 × 3 image patch. The value of a particular pixel of upper pattern, Uk and lower pattern, Lk , would be 0 if the value of the Ik of the original image patch is between (Ict) and (Ic + t) respectively. Any values which are greater than Ic + t are assigned value 1 into upper pattern. Similarly for forming the lower pattern, any values that are lesser than (Ict) are assigned value 1 into lower pattern. Then ternary pattern would be produced by combining upper pattern and lower pattern. When the value of Lk is 1 then the respective position in ternary pattern would be -1 whereas in the position where the value of Uk is 1 then the respective position in the ternary pattern would be 1. All other positions of ternary pattern would be filled by 0 value. The final binary code of upper pattern & lower pattern and ternary code while reading the bit pattern starting from the east location are marked by red color in Fig. 7 with respect to the centre, then going around counter-clockwise.

Fig. 7
figure 7

Illustration of LTP operator

3.4 Simple linear iterative clustering superpixels

The concept of superpixel was first coined by X. Ren et al. in 2003 [36]. Superpixel is a group of connected pixels perceptually meaningful atomic regions with similar colors or gray-levels. Several approaches are there in literature to generate superpixels of an image. However, simple linear iterative clustering (SLIC) method is adopted in this work because it is faster and more memory efficient than others [2]. Normally, SLIC takes an image as input for clustering along with K as a parameter, which represents desired number of approximately equally sized superpixels. The color and position of each pixel of the input image would be denoted in the CIELAB color space, [lab]T and spatial space [xy]T respectively. Then a weighted distance measure integrates color and spatial proximity and groups pixels using k-means algorithm. The detail discussion of SLIC superpixels method is there in [1]. Figure 8 illustrates a visible and a IR face images from UGC-JU face database and their segmented images using SLIC of 100 number of supperpixels.

Fig. 8
figure 8

a Example of IR and visible face images of UGC-JU face database b segmented face images using SLIC of 100 number of superpixels

3.5 Maximum bipartite matching

A bipartite graph is an undirected graph G = (V, E), where each vertex belongs to one of two disjoint vertex sets, X or Y, and every edge connects a vertex in X to a vertex in Y [11]. A matching is the one-to-one pairing of some or all of the edges of one vertex set, X, with the edges of a second vertex set, Y, otherwise, unmatched. Figure 9 shows the notion of a matching in a bipartite graph. A complete matching is when every edge of X is paired with one edge of Y. A maximum weighted bipartite matching is the sum of the largest subset of edges in a bipartite graph such that no two selected edges share a common vertex. A new similarity measure between two fused face images has been proposed based on maximum bipartite matching algorithm in this work. The detail formulation would be discussed in Section 4. The steps of maximum bipartite matching algorithm are shown in Algorithm 1.

Fig. 9
figure 9

A bipartite graph G= (V, E) with vertex partition. a A matching with number 2, marked by orange color edges. b A maximum matching with number 3, marked by orange color edges

figure c

4 Proposed method

In this section, a pixel-level fusion method using AWT and FD is proposed to fuse IR and visible face images for authentication purpose, which is followed by similarity measures based on superpixels and maximum bipartite matching algorithm between two fused face images.

4.1 Fusion algorithm

In image fusion scheme, the source images namely, visible face, Ivi, and IR face, Ith, called as AI0s are decomposed independently using (1) with the help of a scaling function, h1, at desired level that could be obtained by (2) to form the lower-order approximation coefficients of Ivi and Ith. However, the higher-order detail coefficients of both the source images could be obtained using (4). The size of both the approximation and detail coefficients would be same as original image. The approximation coefficients provide the coarse or blurred form an image, which contains the base information. On the other hand, detail coefficients consist of edge and corner points present in an image. Most of the researchers selected the level of decomposition either randomly or experimentally for their works. In this study, a mathematical formula (log2min(S, T)), has been devised to find the level of decomposition, where S and T are the number of rows and columns of an input image respectively. Fusion rule depends on the information present in the source images and the information required in fused image too. The objectives of the proposed fusion strategy are to preserve the temperature distributions of the IR face images along with sharpness and contrast of the visible face images in fused face images. The fusion rule for the approximation coefficient would be devised in such a way, that it can preserve all the basic information from both the source face images without any loss of generality. Moreover, it can retain sharpness and contrast of the source face images. After realizing these objectives, two fusion rules have been proposed one for combining the approximation coefficients and the other is for the detail coefficients of both visible and IR face images. A pseudocode for the proposed fusion algorithm is given by Algorithm 2.

In order to merge the approximation coefficients of both Ivi and Ith, a fusion rule is defined in line number 20 of Algorithm 2, which combines both the approximation coefficients of the source face images by taking average of them in order to form approximation coefficients of the fused face image. However, FD using DBC method is considered to formulate another fusion rule for combining the detail coefficients of Ivi and Ith upto level-number of decomposition that preserves higher frequency features like temperature distributions of the IR face images in the detail coefficients of the fused image. Normally, FD using DBC method returns a real number for an image, which is a measure of image roughness or texture. However, DBC method does not directly apply on each level of decomposed detail coefficients of both Ivi and Ith because DBC method returns a single real value, which does not help in fusion process. So, each level of decomposed detail coefficients of both Ivi and Ith are considered separately. Then a 3 × 3 grid is moved from top-left corner to bottom-right corner of a particular detail coefficients of both Ivi and Ith. Here, FD using DBC method helps to estimate the number of boxes is required to represent the roughness of a particular grid. Thus, a simple if-else rule is proposed to form a decision map based on the number of boxes of a grid for each level of decomposed detail coefficients of both the Ivi and Ith. The decision map is a two dimensional binary array of same size as the input images. If the number of boxes of a particular grid for a sub-band of the detail coefficients of the Ith is greater than visible counterpart, Ivi then the respective position of the decision map corresponding to the above sub-band would be 1 otherwise 0 is stored.

figure d

Each binary value of the decision map (DP) designates the type of image (IR or visible) contributing to the formation of fused image. A consistency verification of the DP is done for elimination of noise. Noise is characterized by the presence of an element value not consistent with the others in a neighborhood. For handling this type of situation, a 3 × 3 window is scrolled over the output that is obtained after executing line number 16 in order to eliminate the presence of noise and the value of the element coinciding with the centre of the window is made equal to that of the majority in the 3 × 3 neighborhood. After noise removal, if the element (j, k) of the DP contains 1, coefficients for a particular decomposed details coefficients of the fused image at this pixel location will come from \(I_{th}^{d}\) and for the element value equal to 0, \(I_{vi}^{d}\) will be considered in line number 17. Then all the decomposed detail coefficients are added together in line number 18 to form the detail coefficients of the fused face image. Finally, the fused image is reconstructed by performing the inverse AWT using (5) in line 21. Figure 10 shows the fused face images obtained by Algorithm 2 of Figs. 12 and 3.

Fig. 10
figure 10

Some examples of fused images. a Fused face images for face images in Fig. 1. b Fused face images for face images in Fig. 2. c Fused face images for face images in Fig. 3

4.2 Similarity measure

Final step of a face authentication system is matching. Instead of considering conventional machine learning algorithms, a novel similarity measure is proposed, which checks how close two fused face images are. The advantage of the proposed similarity measure is that it does not require training set separately. It focuses on one-to-one corresponds between one query image and the stored images. This similarity measure problem could be thought of as maximum bipartite matching problem. The detail discussion of maximum bipartite matching problem is there in Section 3.5. The first step of the proposed measure is to generate superpixels of both a query and one stored images. Section 3.4 presents the superpixels generation method. The centroid of all the superpixels of a query image are stored in set X whereas Y set consists of centroids of all the superpixels of one stored image. These two disjoint sets are used to form a bipartite graph, where the centroid of all the superpixels in X and Y sets are thought as a vertex. Edges are formed between all the centroid vertexs in X to all the centroid vertexs in Y. However, there would be no connection between two vertexs in either X or Y since X and Y form a bipartite graph. All the edges are associated with a cost function, which would be estimated by combining spatial and gray proximity along with the distance between LTP codes. The cost function of an edge between ith superpixel in an image p to jth superpixel in an image q is calculated using (9).

$$ D_{p_{i}q_{j}}=\frac{m}{R}\times \underset{spatial\text{} distance}{d_{p_{i}q_{j}}^{xy}}+\underset{gray\text{} proximity}{d_{p_{i}q_{j}}^{g}}+\underset{ltp\text{} distance}{d_{p_{i}q_{j}}^{ltp}}, $$
(9)

where cost function, \(D_{p_{i}q_{j}}\), is the sum of gray proximity and ltp distance along with normalized spatial distance by grid interval R, which could be obtained by (10).

$$ R=\frac{\sqrt{S\times T}}{K}, $$
(10)

where S and T are the row and column numbers of a face image and K is the total number preferred roughly equally-sized superpixels. The value of K is assumed to be 100 for this work as per the method proposed in [1]. The spatial distance, \(d_{p_{i}q_{j}}^{xy}\), between ith superpixel in an image p to jth superpixel in an image q is computed by (11).

$$ d_{p_{i}q_{j}}^{xy}=\sqrt{(x_{i}-x_{j})^{2}+(y_{i}-y_{j})^{2}} $$
(11)

where (xi, yi) and (xj, yj) are the spatial coordinates of ith superpixel in an image p and jth superpixel in an image q respectively. A variable m is used in \(d_{p_{i}q_{j}}^{xy}\), which permit to regulate the compactness of a superpixel. When the value of m is large, the clusters are more compact with more spatial proximity. The range of m is [1 : 20] and the value of m is considered as 20 for this work. The gray proximity, \(d_{p_{i}q_{j}}^{g}\), between the same two points would be computed using (12), which is an absolute difference of the intensity values of the centroid of ith superpixel in an image p, gi and the centroid of ith superpixel in an image q, gj.

$$ d_{p_{i}q_{j}}^{g}=(g_{i}-g_{j}) $$
(12)

In order to calculate \(d_{p_{i}q_{j}}^{ltp}\), first two LTP codes of ith superpixel in an image p and jth superpixel in an image q would be computed using (8). Then bit-wise matching of these two ternary codes has to be performed using (13).

$$ d_{p_{i}q_{j}}^{ltp}=\sum\limits_{l=1}^{8}(C_{i}(l)-C_{j}(l))^{2}, $$
(13)

where Ci and Cj are two ternary codes of ith superpixel in an image p and jth superpixel in an image q respectively. The l is used as an index variable and it varies from 0 to 8 since 8-bits are present in a ternary code. The reason behind using (13) is as follows: if a ternary bit changes from 1 to -1 or vice versa then \(d_{p_{i}q_{j}}^{ltp}\) contributes a value 4 to (9) as a high penalty. It means these ternary codes are not same. On the other hand, if a ternary bit changes from 0 to ± 1 or vice versa \(d_{p_{i}q_{j}}^{ltp}\) contributes comparatively a low-penalty value 1 to (9) because this changes might be due to noise. After designing the cost function for each and every edge in the bipartite graph using two images p and q, Algorithm 1 is used to find maximum matching. Here, a complete matching would be obtained since both the images have same number of superpixels i.e. K= 100. After finding out a complete matching, similarity measure between two images has to be estimated using (14). Similarity measure between two images is the sum of all edges that help to form complete matching. Figure 11 illustrates to find the similarity between images.

$$ D(p,q)=\sum\limits_{i=1}^{K}\sum\limits_{j=1}^{K}D_{p_{i}q_{j}} $$
(14)
Fig. 11
figure 11

Illustration of the proposed similarity measure

5 Experimental results and discussion

All the experiments have been conducted on newly created UGC-JU face database [41], IRIS benchmark face database [12] and SCface face database [18] to evaluate the proposed system. In case of UGC-JU database, two types of sensors namely FLIR 7 sensor and Sony DSC-W350 digital sensor are used to capture IR and visible face images. Eighty four volunteers gave their concern to take images. Thirty nine different face images of a person with different pose changes (i.e. rotations about x-axis, y-axis, and z −axis) and different facial expressions (i.e. sad, angry, happy, fearful, disgusted, neutral, and surprised). All the captured images were 24-bit color images with resolutions of 320 × 240 pixels. On the other hand, IRIS face database is a benchmark face database, which consists of visible and IR face images of 30 persons. This database also comprises of various facial expressions, pose and illumination changes. Moreover, SCface face database consists of visible and IR images of 130 subjects captured by different surveillance cameras at various distances. Twelve frontal face images of thirty volunteers are used from the UGC-JU face databases in this work. The frontal face images comprise of variation in facial expressions (UGC-JU face database), facial expressions and illumination changes (IRIS face database and SCface face database). The 12 frontal face images of 30 persons are used, which are selected randomly since IRIS face database contains 30 images. On the other hand, twelve frontal face images of each subject are there in UGC-JU face database. Simultaneously, thirty subjects are selected from the SCface face database with twelve frontal face images, where the database contains twenty three frontal images of each person. Hence, to make uniformity of the proposed system, 12 frontal face images of 30 persons are used from these databases separately in this work. Some of the visible images and their corresponding IR images of three face databases are shown in Figs. 12 and 3. Prior to the fusion process, some preprocessing steps have been considered to keep face part only by removing rest of the body parts since this work focuses in human authentication problem. The pre-processing steps involve binarization of an image, finding the largest component as a face skin region [39], and scaling of the face region to the size of 112 × 92 pixels [39]. Then above mentioned image fusion method is applied on scaled visible and IR face images in order to get the fused face image. Some of the fused face images obtained by combining visible and IR face images of Figs. 12 and 3 are shown in Fig. 10a, b and c respectively. Every time our bare eyes cannot assess the quality progress of the fused face images over the source images. Various fusion metrics are available in literature to judge automatically the improvement of the fused images with respect to one or more source image(s), which is called as reference image(s). Some of the popular fusion metrics, ratio of spatial frequency error (rSFe) [35, 57], normalized mutual information [27, 35], edge information [56], universal image quality index [21, 33], extended frequency comparison index [44] are used to prove the usefulness of the proposed fusion method. The detail discussion of each of the above mentioned fusion metric is there in [44]. Tables 1234 and 5 show the quantitative results of using rSFe, NMI, EI, UIQI and EFCI metrics respectively for the fused images obtained by combining visible and IR face images of UGC-JU, IRIS and SCface face databases. The minimum, average and maximum values are reported in all the above mentioned tables. The last row of each of the above tables show the ideal value of these metrics. It helps to judge how good the proposed fusion algorithm is. All the results depict the usefulness of the proposed fusion algorithm since these values are closed to ideal values mentioned in the last row of Tables 1234 and 5.

Table 1 The ratio of spatial frequency error [14, 57] performance on UGC-JU face database, IRIS face database and SCface database
Table 2 The normalized mutual information [27, 35] performance on UGC-JU face database, IRIS face database and SCface database
Table 3 The edge information [56] performance on UGC-JU face database, IRIS face database and SCface database
Table 4 The universal image quality index [21, 33] performance on UGC-JU face database, IRIS face database and SCface database
Table 5 The extended frequency comparison index [44] performance on UGC-JU face database, IRIS face database and SCface database

Previous experiment helps to know how effective the proposed fusion method is. Now, its time to know whether the fused face images assist in authentication process or not. First, one out of the twelve fused face images of each person is chosen randomly and is considered as gallery image and rest of the fused face images are used as probe images. These gallery images will act as representative of their classes. Then a probe fused face image is selected at a time and tries to find out class belongingness of any of the gallery images using the proposed similarity measure. This process is continued till the probe images are exhausted. In this way, we can create an authentication model. Now, its time to evaluate the performance of this model by accuracy, precision and recall, because evaluating a model is an important task in biometric authentication system which delineates how good our predictions are. All these metrics rely on four terms namely, true positive (TP), true negative (TN), false positive (FP) and false negative (FN). These measures are stored in a 2 dimensional matrix, which is called as confusion matrix. Accuracy is one of the most intuitive performance metrics and it could be computed by (15).

$$ Accuracy=\frac{TP+TN}{TP+FN+FP+TN} $$
(15)

So, it is a ratio of the correctly predicted observation (TP + TN) to the total observations (TP + FN + FP + TN). Accuracy is good enough only when dataset is symmetric in nature, where false negatives and false positives are almost equal. It is not always true in practice. In that case even if accuracy is too high, the other metrics have to check in order to evaluate the performance of an authentication model. Precision is a ratio of the correctly predicted positive observations to the total predicted positive observations. It would be found by (16). High precision is always desired.

$$ Precision=\frac{TP}{TP+FP} $$
(16)

Recall is a ratio of the correctly predicted positive observations to all observations in actual class, which can be calculated using (17). The authentication model is good if the value of recall is greater than 0.5.

$$ Recall=\frac{TP}{TP+FN} $$
(17)

A comparative study has been presented in this section among the four state-of-the-art methods [5, 25, 47, 48] and the proposed method on IRIS benchmark face database, UGC-JU face database and SCface face database separately. Thirty experiments have been performed as 30 face images are there in each class. Every time 12 fused face images are selected randomly and used as gallery image and rest of the fused face images will act as probe image. Then the proposed similarity measure is used to estimate the class belongingness of each probe image. In each experiment, accuracy, precision and recall have been calculated and averages of these metrics are shown in Table 6. However, only pixel level fusion has been implemented because this work focuses on pixel level fusion only.

Table 6 Performance comparison of the state-of-the-art methods on IRIS benchmark face database, UGC-JU face database and SCface database

From Table 6, it is clear that the proposed method outperforms the state-of-the-art methods on IRIS benchmark face database, UGC-JU face database and SCface face database in terms of performance metrics namely accuracy, precision and recall. Moreover, in case of the proposed method, for each experiment, only one fused face image is misclassified for the UGC-JU face database, six fused face images are misclassified for the SCface face database and there is no misclassification event occurs for IRIS face database. The obtained results depict the usefulness of the proposed fusion scheme and similarity measure.

6 Conclusion

In this work, a fusion method has been presented for fusing visible and IR face images based on AWT and FD using DBC method. The proposed fusion scheme preserves the temperature distribution of IR face images in fused face images along with contrast information of visible face images. The use of AWT is more beneficial over mallat wavelet transform as AWT is translation invariant in nature. On the other hand, FD measures image roughness in the form of image texture. Simple if-else rule has been defined to combine useful information of IR and visible face images in fused face images. Some of the popular fusion metrics have been considered to judge the usefulness of the fused face images quantitatively over source images namely, visible and IR face images. A similarity measure is also proposed based on superpixels and maximum matching bipartite algorithm. A distance measure is also introduced as a cost function of each edge in the bipartite graph. Then a complete matching has been obtained from the bipartite graph that would be considered as similarity measure between two fused face images. All the experiments have been performed on IRIS benchmark face database, UGC-JU face database and SCface face database. A comparative study has also been done using four state-of-the-art methods and the proposed method. Some of the available performance metrics such as accuracy, precision and recall have been adopted to evaluate the proposed model over existing systems. All the results depict that the proposed face authentication system outperforms four state-of-the-art methods in terms of used performance metrics namely, accuracy, precision and recall.