1 Introduction

Biometric characteristics of a person are crucial for identification and verification. Face recognition is the most appealing modality for human identification. Unlike the fingerprint, face recognition, which is non-invasive, passive and straightforward biometric solutions, has been widely used in biometric technologies such as passports utilization and driver licenses. Recently, most studies of face recognition approaches make use of visual images [1]. However, they are not accurate enough in uncontrolled environments [2].

Different imaging modalities, including infrared (IR) imaging sensor, could be used to implement face recognition models [2]. The main idea of thermal imaging is that according to an object’s temperature and characteristics, each object emits infrared energy different than other objects. Thus, each object has a different thermal signature. This signature is primarily derived from the pattern of the superficial blood vessels existed under the facial skin. The thermal image is unique for each person since the vein and tissue structure of each face are unique [3].

Recently, thermal face images have been used in face recognition. For example, in [4], an approach based on Haar Wavelet transform and LBP feature extraction methods, as well as principal component analysis (PCA) for dimensionality reduction, was proposed. The experiments proved that using minimum distance and artificial neural networks (ANNs) classifiers the obtained results were 94.11 and 92.15%, respectively, using the Terravic Facial IR Dataset. Also, Seal et al. [5] proposed an approach used discrete wavelet transform (DWT) for feature extraction and dimensionality reduction. The experiments using their private database showed that the recognition rate was 95%. However, using Terravic Facial IR dataset achieved a recognition rate of 93%. Gaber et al. proposed a human thermal face recognition model which used the segmentation-based fractal texture analysis (SFTA) algorithm to extract texture features and then the Random Linear Oracle ensembles to identify the human face after applying two different dimensionality reduction techniques, namely linear discriminant analysis (LDA) [6] and PCA [7]. The experimental results proved that LDA-based approach was more efficient than PCA-based one and the best accuracy rate achieved was 94.12% using the Terravic Facial IR dataset [8].

Computational cost is one of the important factors in the success of any face recognition system. A superpixels method with optimizing its parameter using an optimization technique, such as grey wolf optimization (GWO) algorithm, should promote the computational cost as it minimizes an enormous number of pixels. Superpixels can be generated by many methods such as quick-shift [9], which can be controlled by the parameters of Ratio, Kernel Size and Distance.

In this paper, a human thermal face recognition model is proposed. This model consists of four main steps. Firstly, the GWO algorithm was employed for finding the optimal superpixelization parameters of the quick-shift segmentation method used for extracting superpixels of the thermal face. Secondly, the SFTA algorithm was used for extracting face features. Thirdly, rough set-based methods were utilized to select the most discriminative features. Fourthly, the AdaBoost classifier was employed to match the features of the training patterns and the unknown pattern. Terravic Facial IR dataset thermal images were used to evaluate the proposed approach.

The next sections are presented as follow: Sect. 2 gives the theoretical background. Section 3 presents the proposed thermal face recognition model. Section 4 shows the experimental results. Finally, Sect. 5 shows the conclusions and discussion.

2 Preliminaries

2.1 Quick-shift method

The method of quick-shift is used for extracting superpixels from a thermal face image [10]. The characteristics of superpixels depend on the ratio, kernel size and maximum distance parameters. The ratio indicates the trade-off between spatial and intensity consistency, whereas the kernel size controls the scale to estimate the density. The last parameter represents the maximum distance between pixels. The quick-shift’s parameters should be optimized to produce useful face extraction from thermal images. Hand segmentation of a few images can help to find the parameters’ values that show a good segmentation result [9].

2.2 Grey wolf optimization

Grey wolf optimization (GWO) algorithm simulates the movements of the wolves when they search for food and avoiding their enemies. The grey wolves live in packs or groups. Each pack contains four different categories [11]. The alpha (\(\alpha \)) or leaders are responsible for making decisions in the pack. The beta wolves (\(\beta \)) help the alpha wolves in decision making or any other activities in the pack. They are the best candidates to be the next alpha wolves. the delta wolves (\(\delta \)) have to submit to \(\alpha \) and \(\beta \) wolves. The omega wolves (\(\omega \)) have to submit to the other dominant wolves [11, 12].

Mathematically, in GWO algorithm, the fittest solution is known as alpha \((\alpha )\). Beta \((\beta )\) and delta \((\delta )\) are the second and third best solutions, respectively. The other solutions are supposed to be omega \((\omega )\). During the hunting process, grey wolves encircle the prey and the \(\alpha \), \(\beta \) and \(\delta \) wolves guide other wolves, while \(\omega \) wolves follow the three candidates as denoted in Eq. (1).

$$\begin{aligned} \overrightarrow{G}(t+1)=\overrightarrow{G}_p(t){-}\overrightarrow{A}. \overrightarrow{D}, \;\; \overrightarrow{D}=|\overrightarrow{C}.\overrightarrow{G}_p(t)-{\overrightarrow{G}(t)}|\nonumber \\ \end{aligned}$$
(1)

where t is the current iteration, \(\overrightarrow{A}\) and \(\overrightarrow{C}\) are coefficient vectors, \(\overrightarrow{G}_p\) is the position of prey, and \(\overrightarrow{G}\) is the position of grey wolf. The vector \(\overrightarrow{A}\) is defined as, \(\overrightarrow{A}=2\overrightarrow{a}.\overrightarrow{r_1}-\overrightarrow{a}\) and \(\overrightarrow{C}\) vector is given by, \(\overrightarrow{C}=2\overrightarrow{r_2}\), where the components of \(\overrightarrow{a}\) are decreasing linearly from 2 to 0 over the course of iterations, and \(r_1 and r_2\) are vectors with random values in [0,1]. Hence, \(\overrightarrow{a}\) is the updating or control parameter of the GWO algorithm that controls the trade-off between exploration and exploitation [11]. The values of a are calculated as follows, \(\overrightarrow{a}=2-t.(2/{Max}_{iter})\), where \({Max}_{iter}\) is the maximum iteration number allowed for the optimization. The best solutions, \(\alpha \), \(\beta \) and \(\delta \), guide the other search agents (including \(\omega \)) to change their positions as denoted in Eqs. (2, 3 and 4).

$$\begin{aligned}&\overrightarrow{D_i}=|\overrightarrow{C}.\overrightarrow{G_i}-\overrightarrow{G}| \; , i=\alpha , \; \beta \; \text {and} \; \delta \end{aligned}$$
(2)
$$\begin{aligned}&\overrightarrow{{G}'_i}=|\overrightarrow{G_i}-\overrightarrow{A}.\overrightarrow{D_i}|,\; i=\alpha , \; \beta \; \text {and} \; \delta \end{aligned}$$
(3)
$$\begin{aligned}&\overrightarrow{G}(t+1)=(\overrightarrow{{G}'_1}+\overrightarrow{{G}'_2}+\overrightarrow{{G}'_3})/3 \end{aligned}$$
(4)

2.3 Segmentation-based fractal texture analysis (SFTA)

SFTA is one of the methods that are used to extract features from grayscale images. SFTA consists of two steps. First, an input grayscale image, I, is decomposed or divided into a set of binary images using multi-level threshold algorithm, such as Two-Threshold Binary Decomposition method. Second, three features, namely fractal dimension, mean, and size, are extracted from each binary image region’s boundary [13].

In the first step, the input grayscale image (I) is decomposed into a set of binary images (\(I_\mathrm{bi}, i=1,2,\ldots ,n_t\)), where \(n_t\) represents the total number of thresholds or levels. The threshold values are computed using Otsu’s algorithm (more details about Otsu’s algorithm are in [14]). The input image is then decomposed into a set of binary images (\(I_\mathrm{b}\)) by applying two-threshold segmentation method. The goal of the second step is to extract features from the region’s boundary of the binary images that are calculated in the first step. The SFTA feature vector contains the fractal dimension that represents the complexity of the object’s boundary, mean and size, which are computed from the region’s boundary of each binary image. Hence, the length of the SFTA feature vector is proportional to the value of the threshold parameter, which is a user-defined parameter [13].

2.4 Rough set

In data analysis, rough set method is used for calculating the dependencies between features. Let \(P,Q \subseteq A\), and P depends totally on, i.e., Q (\(Q\Rightarrow P\)). This means that the features from P are determined by the features from Q. The degree of dependency \(k(0\le k\le 1)\) is given by \(k=\gamma (Q)=\frac{|POS_P(Q)|}{|U|}\), where |U| denoted the cardinality of the universe U which consists of a non-empty finite set of objects, \(POS_P(Q)=\bigcup _{x\in U/Q} \underline{P}X\) is the positive region of the relation U / Q with respect to P, \(\underline{P}X=\{x\in U|[x]_P\subseteq X\}\) is the lower approximation where \(X\subseteq U\) and \(k=\gamma (Q)\) represents the dependency between condition features and decision feature. The value of k is (1) one when P depends totally on Q, (2) zero when P does not depend on Q and (3) \(0\le k \le 1\) when P depends partially on Q. The quality of approximation of classification is measured by the degree of dependency [15].

In rough set methods, the main goal is to find the minimal subset of features (R), i.e., reduct, that achieved classification performance approximately the same as the original features (C). This can be achieved by finding a reduct that achieves the smallest cardinality [15].

Fig. 1
figure 1

Block diagram of the proposed segmentation method

3 Proposed thermal face recognition model

3.1 Segmentation phase

In this phase, a modified version of our method [9] was used to extract a human face from its thermal image. In this version, the segmentation method is based on the superpixels (quick-shift) and the GWO algorithm. Firstly, the model selects a thermal face image \(I_i\) for the i th input image from total number of images N in a group for \(i=1,2,3,\ldots ,N\). The GWO algorithm is then used to search for best solutions, i.e., best values for quick-shift parameters (Ratio, KernalSize and MaxDist). The quick-shift method is then applied with its automatically predetermined parameters to produce the superpixels. The superpixels image is then thresholded using the Otsu’s method, where each superpixels image based on the optimum threshold is converted to a binary image \(I_\mathrm{b}\). Finally, we extract the pixel values from the relevant original thermal image. Based on GWO and the superpixels with automatic thresholding, the best results can be achieved by extracting faces from thermal images. Figure 1 shows the steps of the segmentation phase. More details about this phase are given below.

3.1.1 Representation of position

The positions of all grey wolves’ were initialized randomly, where the position of each wolf represents the values of the parameters of the quick-shift method, and the positions are changed iteratively until it reaches near the optimal solution. The lower boundaries of the Ratio, KernalSize and MaxDist parameters were 0.2, 2 and 4 respectively, while the upper boundaries were 0.8, 12 and 20, respectively. The Otsu’s thresholding method is then used to find the optimal threshold. This threshold is used to generate a binary image \(I_\mathrm{b}\) from the superpixels image. Finally, the relevant pixel values, from the original thermal image, are extracted or segmented (\(I_\mathrm{Seg}\)) by multiplying the original image by the binary image. After evaluating all grey wolves’ positions, the first, second and third best positions are assigned to \(\alpha \), \(\beta \) and \(\delta \) wolves, respectively. The other positions are assigned to \(\omega \) wolves. The \(\alpha \), \(\beta \) and \(\delta \) wolves guide the other wolves as in Eqs. (2, 3 and 4). The positions of wolves are changed iteratively until the stopping criteria are met.

3.1.2 Fitness function

The fitness function of this algorithm is defined as follows, \(MaxS \left( I_\mathrm{Seg}, (B(I_\mathrm{Seg}) \times I) \right) \), where MaxS is the maximum similarity, and \(I_\mathrm{Seg} = B(S(I)) \times I \) is a segmented image with S(I) as the superpixel generation of an image I and B as the binary image generated from the superpixel image. The similarity is equal to the ratio between the number of similar pixels to the total number of pixels using the generated images based on the population \(G_1, G_2,\ldots , G_N\).

3.2 Feature extraction phase

SFTA was utilized in this phase for extracting features from all images, i.e., training and testing images. In the training phase, the features were represented by a feature matrix, while in the testing phase, they are represented as a vector.

3.3 Feature selection phase

In this phase, a set of features were selected using rough set-based methods which increase the classification accuracy and reduce the classification time. To achieve this aim, the training data are used as an input to rough set-based methods to find the minimal feature subset. In our proposed model, three different rough set-based methods are employed for feature selection: (1) quick reduct feature selection (QRFS) [15], discernibility matrix-based feature selection (DMFS) [15] and entropy-based feature selection (EBFS) [15].

3.4 Classification phase

The AdaBoost classifier was employed for classification in this phase. The aim of AdaBoost classifier is to combine the outputs of a number of simple classifiers or weak learners such as decision trees and neural networks. AdaBoost has two main parameters: (1) the number of iterations (T) and (2) the weights of the training patterns (w) that are initialized to be equal.

In AdaBoost, the simple classifiers are used to train the model using the training patterns; this is called training step. In this step, the parameters of AdaBoost are first initialized. For each iteration (t), some of the training patterns are selected based on the weights, \(w^t\), of these patterns to form a distribution (\(D_t\)). The selected patterns are then used to train the current simple classifier (\(C_t\)). The error rate \(\epsilon _t\) of \(C_t\) is then calculated as follows, \(\epsilon _t=\sum _{j=1}^{N} w^t_j l_j^t\), where N represents the total number of training patterns, \(l^t_j=1\) if \(C_t\) is misclassified \(x_j\); otherwise, \(l^t_j=0\), \(x_j\) is the j th pattern. If \(\epsilon _t\ge 0.5\), the weights are reinitialized again to be equal. The weight of the current weak learner (\(\alpha _t\)) is then calculated as follows, \(\alpha _t=\epsilon _t/(1-\epsilon _t)\). The weights of the training patterns are then updated to be used in the next iteration. In the testing step, to classify an unknown pattern, \(x_\mathrm{test}\), the outputs of all weak learners are aggregated using the weighted voting method to estimate the final decision [16].

In this phase, the selected features of the training samples were used to train the AdaBoost classifier. The class of an unknown image was determined using the weak learners that were trained in the training step. The weighted voting method is then used to calculate the weight of each class, and assign the class with the maximum weight to the unknown image.

4 Experimental results and discussion

4.1 Experimental setup

To evaluate proposed model, the Terravic Facial infrared (IR) dataset [17] was used. The dataset contains 20 classes with grayscale images (\(360 \times 240\)) and each class represents a single person. Each person has some images with various variations (front, left, right; indoor/outdoor; glasses). This work used 18 classes as the other two classes (the fifth and sixth classes) were corrupted. For a fair comparison, the experiments were conducted on a Core i5-2400 CPU @ 3.10 GHz PC with 4.00 GB. The implementation was compiled using MATLAB R2012a (7.14) under Windows 10.

4.2 Experimental scenarios

In this section, five experiments were conducted to test the proposed model. More details of each scenario are presented in the next sections.

4.2.1 Segmentation experiment

In this experiment, the proposed segmentation method was evaluated against indoor and outdoor thermal images. In the GWO algorithm, the number of search agents, n, was ten and the maximum number of iterations, t, was 20. Figure 2 shows the results for the eighth class. As shown, the proposed segmentation method achieves reasonable results because the difference between the face area and the other objects, e.g. clothes, glass and other surroundings, is evident. This experiment showed the robustness of the proposed segmentation method for the indoor and outdoor thermal images.

4.2.2 Thermal face recognition using segmented/non-segmented images

The aim of this experiment is to evaluate our proposed model using segmented and non-segmented images. In this experiment, different values of the threshold parameter, \(n_t\), were used and the size of the AdaBoost classifier was three. Moreover, only ten images from each class were used to train the model, while the rest of the images were used to test the model. This is because increasing the number of training images increased the computational and classification time. For example, if we increased the number of training images to 190 images for each class, then the number of features will be \(190\times 200\times 10=380000\) features when only ten features will be extracted from each image compared with only 20000 features when ten images were used. Thus, more classification time will be required which is not suitable for real-time applications. Table 1 summarizes the results of this experiment.

Fig. 2
figure 2

Sample of a thermal image (a indoor and c outdoor) and their extracted/segmented face (right)

Table 1 Accuracy (Acc.) and CPU time of the proposed model using segmented and non-segmented images

As shown in Table 1, the accuracy was increased when the value of threshold parameter was increased until it reached to a value (approximately 93%), after that value, the accuracy does not improve anymore. On the other hand, the CPU time was increased without achieving noticeable progress in the accuracy. Secondly, the proposed model achieved better accuracy results using the segmented images than using non-segmented images. In addition, the best accuracy was obtained when the value of the threshold parameter was equal to or more than six. Thirdly, the CPU time was proportional to the value of threshold parameter.

To conclude, the segmented images achieved accuracy better than the non-segmented ones, and the best accuracy was obtained when \(n_t\ge 6\).

4.2.3 Feature selection experiment

Due to a high accuracy of the proposed model using segmented images over non-segmented images, in this experiment, the segmented images were used. The aim of this experiment was to test whether applying rough set reduction method could improve both of the identification accuracy and system performance. To achieve this aim, in this experiment, three well-known rough set methods (QRFS, DMFS and EBFS) were used to reduce the number of features.

Table 2 The number of selected features and reduction rate [# features (reduction rate)] of QRFS, EBFS and DMFS methods
Fig. 3
figure 3

CPU time of the three rough set-based feature selection methods, i.e., QRFS, EBFS and DMFS

Fig. 4
figure 4

Accuracy of the proposed model using the original and selected features

Fig. 5
figure 5

A comparison between QRFS, EBFS, DMFS and the original features in terms of classification time

Table 2 summarizes the results of this experiment, and Fig. 3 shows the CPU time of the three feature selection methods. Moreover, a comparison between the accuracy obtained using the original feature (with no reduction) and the features that were selected using QRFS, DMFS and EBFS methods is depicted in Fig. 4. A similar comparison was conducted for the required CUP time in the same cases and it is given in Fig. 5. From these two figures, it can be remarked that EBFS-based feature reduction method is the best in terms of the accuracy and CPU time. Moreover, Table 2 shows that the three methods achieved high reduction rate while achieving high accuracy rate too. Moreover, the reduction rates were proportional to the number of features. For example, when \(n_t=1\) the number of features was six and the reduction rate was 0%. On the contrary, when \(n_t=10\) the number of features was 60 and the reduction rate ranged from 86.67 to 90%.

As shown in Fig. 3, the CPU time of the two feature selection methods (QRFS and EBFS) was much lower than DMFS method since the DMFS method complexity is \(O((N+log M)M^2)\), where N indicates the number of features and M is the number of samples. Therefore, the time required for calculating the discernibility matrix was increasing exponentially with increasing number of patterns in the dataset. On the contrary, the complexity of EBFS and QRFS are \(O(NM^2)+O(M^3)\) and \(O(MN^2)\), respectively [15]. Hence, the required computational time for both QRFS and EBFS methods is lower than DMFS.

Figure 4 shows that the rough set-based feature selection methods achieved accuracy relatively equal to the accuracy of the original features. Additionally, EBFS obtained the best accuracy. Regarding the computational time, Fig. 5 shows a significant difference between the classification time of the selected and original features. This is because the number of the selected features was much smaller than the number of original features.

To conclude, rough set-based feature selection methods removed irrelevant features and, hence, reduced the classification time than the original features. Moreover, in EBFS method, the selected features obtained accuracy relatively equal to the accuracy of the original features.

Fig. 6
figure 6

A comparison between QRFS, EBFS and DMFS methods in terms of classification accuracy using different ensemble sizes (L) and two different threshold values

Fig. 7
figure 7

A comparison between QRFS, EBFS and DMFS methods in terms of classification time using different ensemble sizes (L) and two different threshold values

Fig. 8
figure 8

Accuracy and CPU time of the proposed model using different number of training images

4.2.4 Ensemble size experiment

The aim of this experiment was to test whether the size of the AdaBoost classifier could affect the accuracy of the classification and (2) the required CPU time. In this experiment, the proposed model was evaluated using four different sizes of the AdaBoost classifier (\(L=3\), \(L=13\), \(L=23\) and \(L=33\)). The selected features using the rough set-based methods were used to train AdaBoost classifier. In addition, different values of threshold parameters were used (\(n_t=4\) and \(n_t=7\)). The results of this experiment are shown in Figs. 6 and 7.

From Figs. 6 and 7, two remarks can be concluded. Firstly, the accuracy increased when the ensemble size increased until it reached to a value, after that value, the accuracy does not improve anymore. As shown in Fig. 6, the accuracy remains constant when the ensemble size became greater than or equal to 23. This is because a large number of weak learners may maintain a constant and small training error, and this may lead to the overfitting problem and more complex model. Secondly, the CPU time also increased when the ensemble size increased too. This is because increasing the number of weak learners led to an increase in the CPU time that is required to train the AdaBoost model.

4.2.5 Different numbers of training images

The aim of this experiment is to evaluate the influence of the number of training images on the proposed model. The number of training images was ranged from 10 to 25, the range of values of threshold parameter was from six to ten, the number of weak learners was 13, and the features that were selected using EBFS method were used. The results obtained from this experiment are presented in Fig. 8.

As shown in Fig. 8a, it can be remarked that the accuracy was proportional to the number of training images. This is because a small number of training samples makes the model more sensitive to small variations in training samples, i.e., high variance. From the results in Fig. 8b, it is apparent that increasing the number of training images increased the CPU time.

Compared with some of the related work which used Terravic dataset, our proposed model achieved promising results (approximately 99%) while the model that were proposed in [4, 5] and [8] achieved 92.2–94.1, 93 and 94.1%, respectively. This achievement was obtained due to: (1) the proposed segmentation method, which extracts only the face and removes the background or any other noise, (2) using SFTA algorithm which extracts discriminative features, (3) using the rough set-based feature selection methods which remove the irrelevant features and improve the classification accuracy and (4) using the AdaBoost classifier which increases the weight of critical samples and hence improves the classification performance.

5 Conclusions and future work

This paper proposed a face recognition model using thermal face images. The model has four phases: (1) face segmentation using both of quick-shift and GWO method, (2) features extraction using SFTA method, (3) feature selection using different Rough set-based methods, i.e., QRFS, DMFS and EBFS, and (4) classification/identification using the AdaBoost classifier. Many experiments were conducted to evaluate the proposed model (i) using segmented and non-segmented images; (ii) using the original and the selected features; (iii) using different sizes of the AdaBoost ensemble; (iv) using different numbers of training images. Experimental results proved a competitive performance of the proposed model using the segmented images used our proposed segmentation method. Using the segmented images, the accuracy was ranged from 85 to 92%, while the results of the non-segmented images were ranged from 78 to 82%. This reflects how the segmentation phase is important for our model. Moreover, EBFS method reduced the number of features (with 90% reduction rate) and achieved accuracy better than the original features, and hence it reduces the classification time. Additionally, the EBFS method obtained results better than QRFS and DMFS. Also, our experiments proved that the performance of the proposed model proportional with the number of training images and the size of the AdaBoost ensemble. However, increasing the size of AdaBoost increases the complexity of the model and may lead to an overfitting problem. The best accuracy achieved was about 99%; when the segmented images were used, the threshold parameter was 7, 25 images were used to train the model, and 23 weak learners were used in the AdaBoost classifier.

Several directions for future studies can be suggested. First, for higher-dimensional datasets, to speed up the computation, parallel algorithms can be employed. Second, try other optimization methods to explore the effectiveness of the proposed model for detecting object(s) in different thermal datasets such as Terravic Weapon IR dataset and Terravic Motion IR dataset.