Keywords

1 Introduction

Thermal imaging, also referred to as long-wave infrared (LWIR) imaging uses the thermal subband of the electromagnetic spectrum (7 to 14 m) to acquire illumination invariant images, allowing LWIR imaging sensors to operate even in absolute darkness. This sensor property makes LWIR imaging especially interesting for applications with high variance in lighting conditions. Currently, two main focus areas of thermal imaging research are illumination-invariant face recognition for survelliance purposes [2, 3, 14] and the extraction of vital parameters such as respiratory and heart rate (RR/HR) from facial images [10]. Furthermore, emotion detection [6, 23] and affective state analysis such as stress detection and polygraphy using thermal infrared images [19, 21] are emerging research areas. All of these applications require analyzing facial regions of interest (ROIs) which need to be defined first. In most of the published cases, ROIs are defined manually due to the lack of reliable face and facial landmark detection algorithms for thermal infrared images. In the other cases, ROI definition is performed automatically or semi-automatically, however always under the constraint that certain prerequisites such as a full frontal view of the face or having the face restricted to a certain image area are given. Only a small number of publications [18, 22] use a state-of-the-art face detection algorithm, in all reviewed literature the detection method was the Haar-based Viola-Jones face detection algorithm [25]. To the best of our knowledge there has been no application of more recent face detection methods to thermal infrared images. Furthermore, there is a lack of literature comparing face detection methods developed especially for the thermal infrared spectrum with well-established approaches that have been developed for the visual spectrum. We believe that the reason for lack of relevant publications is the fact that all current methods are based on automated classification using machine learning. Regardless of the actual classifier and feature descriptor used, image classification based on machine learning requires extensive amounts of manually annotated training data which is not available for the thermal infrared domain. Therefore, no quantitative comparison of domain-specific versus machine learning-based face detection algorithms has been performed so far. Our work aims at addressing this issue by focusing on several aspects: First, we introduce an extensively annotated thermal face database that can be used to train face detection algorithms that have been originally developed for the visual spectrum and have not been applied to infrared images due to the lack of sufficient training data. We use the database to train a selection of face detection algorithms that have been proven to perform well on images in the visible spectrum. Finally, we thoroughly evaluate the performance of these algorithms and compare it to the results of reference algorithms developed especially for face detection in thermal infrared images.

The structure of the paper is the following: Our database is described in Sect. 2.1, followed by an overview of the algorithms that we trained with it in Sect. 2.2. A description of the performed experiments and their results is given in Sect. 2.3. The paper is concluded by a result discussion (Sect. 4) and a conclusion (Sect. 5).

2 Materials and Methods

In this section we first describe the image database created to train and evaluate face detection algorithms. Subsequently, all evaluated face detectors are briefly introduced.

2.1 Thermal Face Database

Since most algorithms for facial landmark detection require a ground truth database for training, the first required step is the creation of a database with face and non-face images in the thermal infrared domain. Many current face detection and tracking algorithms developed for the visual spectrum are trained using large sets of in-the-wild images such as the LFPW [1] or HELEN [15] databases which are acquired under unconstrained settings with a wide variety of head poses, lighting conditions and facial expressions. Up to now there existed no corresponding database meeting these requirements in the LWIR domain. Available thermal infrared datasets offer images with limited temperature contrast and spatial resolution and only a small set of pre-defined facial expressions and head poses. To overcome this shortcoming and allow to test the applicability of established algorithms in the thermal infrared, we have created a database of thermal videos of currently 82 persons in which each participant performed both arbitrary and protocol-specified head motions and facial expressions. All our images were acquired with an InfraTec 820 HD microbolometer array camera at its native 1024\(\,\times \,\)768 pixels with a relative thermal resolution of 0.03 K. The videos were taken from a distance of 0.8 m to 1.0 m and all participants were placed in front of a thermally homogeneous backdrop. From the video sequences, 2935 single frames that cover a wide and realistic range of head positions and facial expressions have been selected. A set of 68 facial landmark points has been manually marked in each image (Fig. 1). The points selected for landmarking are consistent with the landmarks used in a number of current face databases in the visual spectrum including LFPW and HELEN, allowing an efficient training of algorithms developed for the visible spectrum on thermal infrared data. Our fully annotated face database surpasses available thermal infrared face databases such as [26] or the databases listed in [11] in terms of spatial and thermal resolution as well as variety of head poses and number of annotated landmark points. The database will be made available to the public in the near future.

Fig. 1.
figure 1

A set of sample images from the thermal face database. The manually created landmark annotations are shown as markers.

2.2 Face Detection Algorithms

In this work, we compare five machine-learning based algorithms that have already been successfully applied to face detection in the visible spectrum to two approaches that have been developed especially for thermal infrared images. It should be noted that only one of the five classifiers (namely the Viola-Jones-Method) has been used in research literature for thermal infrared face detection. To the best of our knowledge, this work is the first contribution to evaluate the performance of the remaining four algorithms in LWIR. The algorithms transferred from the visible spectrum are:

Fig. 2.
figure 2

Possible Haar features for face detection

  • The Haar cascade classifier (VJ) as presented by Viola and Jones [25]. This work was the first robust face detection algorithm capable of delivering real-time performance even on handheld machines with limited computing power such as consumer-grade digital photo cameras. VJ detects faces by systematically analyzing subimages extracted using a multiscale sliding window and applying differently sized Haar feature detectors in a cascading manner. In this approach, the subimage is matched with the most relevant Haar features first (Fig. 2). If the feature response is positive then further analysis using more refined features is performed by following the cascade. Should the feature response be negative at any point, then the subimage is rejected as containing no face. The algorithm returns a face detection only if the subimage passes all steps of the cascade. The detector itself is trained by providing it with a set of face and non-face images to which a variety of Haar descriptors is applied. The algorithm learns the most significant features that allow differentiating between face and non-face images and arranges the best performing feature combinations in a fast decision tree using a boosting algorithm such as AdaBoost or GentleBoost. Using boosting for feature selection and efficient coding techniques for feature computation the algorithm is capable of performing face detection in photographs and videos in real-time. As stated above, this is the only algorithm that has already been applied successfully to thermal infrared images by different authors such as [24] and more currently [4]. The implementation used in our work is based on the OpenCV [12] variant of the cascade classifier, a method that improves the original work by Viola and Jones by combining the Haar descriptors in the cascade into groups instead of applying them individually at each step.

  • The Haar cascade classifier (VJ-LBP) with local binary patterns (LBPs) as feature descriptor as presented by Liao et al. in [16]. This approach is similar to the first method, however in VJ-LBP local binary patterns are used instead of Haar features. Since they only require integral computations, LBPs can be computed more efficient than Haar features. Furthermore, using cascade classifiers with LBPs instead of Haar features has been reported to result in improved detection rates. As with the above method, we used the LBP cascade classifier available in OpenCV.

  • Histograms of Oriented Gradients (HOG), an image descriptor and object detection algorithm introduced by Dalal and Triggs [7] and currently being one of the most widely used feature descriptors. HOG features are computed by analyzing image gradients and grouping them into local histograms. To use HOG for face detection, the feature vector is computed for a training set of face and non-face images and the results are used to train a classifier - usually a support vector machine (SVM) - that learns how to distinguish the HOG feature representation of a face from background features. In our work, we used the implementation available in the dlib library [13] to train and test a HOG-based face detector.

  • The Deformable Parts Model (DPM), introduced by Felzenszwalb in [8]. This approach uses optimized HOG features for object description, however for classification and detection the method assumes that the sought objects are composed of different components (in our case face parts such as mouth or eyes) and that these components may be arranged differently in different images. Instead of assuming a fixed spatial configuration as in regular HOG, the method learns the feature representations of different object parts as well as their spatial distribution. The resulting detector can adapt to different spatial configurations induced by changes in head pose or camera position.

  • Pixel Intensity Comparisons Organized in Decision Trees (PICO), presented by Markuš et al. in [17]. This method uses binary pixel intensity comparisons for object detection. Similar to VJ, the algorithm performs face detection by scanning sliding subimages and applying a decision tree boosted by GentleBoost. Each decision is defined by direct comparison of the intensity values of two pixels in the current subimage. Similar to Haar feature selection in VJ, pixel locations chosen for comparison are optimized by applying the best discriminating combinations first. This method drastically increases the tree’s detection speed since feature computation can be omitted.

We compare these approaches to two algorithms developed primarily for the detection of faces in thermal infrared images. The two algorithms were chosen as they both represent common approaches presented for thermal infrared face detection. Generally, many algorithms designed for thermal face detection apply thresholding for pre-segmentation of the image or feature localization. The aim of the thresholding is either a foreground-background-segmentation under the assumption that the person is the dominant heat source in the image, or a threshold operation with a threshold close to the image’s maximum temperature to locate the hottest spots in the image. The idea behind hottest spot localization is the fact that in many cases the inner corners of the eyes are the hottest regions of the face and therefore the eyes can be easily located with thresholding. Either one or both of the approaches are the basic idea behind the two chosen methods as well as behind current algorithms such as [5, 20] or [27]. The methods that we selected for implementation were:

  • Eye Corner Detection (ED), presented by Friedrich and Yeshurun in [9]. In this work, the initial step is also a silhouette extraction followed by a facial feature detection. The original paper states that temperature-based feature detection may be performed on the areas around the eyes, with the eyes themselves being the warmest and the eyebrows being the coldest regions of the face. The authors of [9] have tested different approaches and have identified the eyebrows to be the most reliably detectable face part on their dataset. We performed the same tests on our database and found out that in our case and similar to the results of [20], the inner corners of the eyes can be located more reliably than eyebrows since there tend to be several cold regions in the face. Therefore we implemented the method as described in the paper, however with the difference that our approach used a feature better suited to our dataset (Fig. 3).

  • Projection Profile Analysis (PPA), an algorithm described by Reese et al. in [22]. This method performs a silhouette extraction by foreground-background thresholding and computes projection histograms of the resulting binary image (Fig. 4) on the horizontal and vertical image axes. Subsequently, the face is localized by defining extreme points in the profile curves extracted from the projection histograms. These points are then localized by analyzing the extracted profile curves and their 1st and 2nd order derivatives. The approach assumes that the person is the dominant object in the image and that the image displays the whole head and the upper part of the torso. These prerequisites are given in our database.

Fig. 3.
figure 3

The Eye Corner Detection algorithm. Left: original input image. Center: image after thresholding. Right: sum of all pixels per row of the thresholded image. The vertical coordinate of the eye centers is the row with the biggest sum.

Fig. 4.
figure 4

The Projection Profile Analysis (PPA) approach: the algorithm performs a foreground-background segmentation and subsequently defines the face’s bounding box by finding extrema in the derivatives of the profile curves. Note that for comparison with the ground truth we crop away the top 1/3 of the result to match the manual annotations better.

2.3 Evaluation Methodology

Depending on the algorithm, the results returned by the different methods are not directly comparable. The five machine learning-based methods all return a face bounding box that can be compared with the manually defined ground truth bounding box in the test images. However, the results computed by the two thermal infrared face detection methods need to be processed for a quantitative evaluation as they do not return a bounding box directly. For ED which returns eye corner coordinates, the returned coordinates \(x_1, y_1, x_2, y_2\) and their distance \(d = x_2 - x_1\) were used as base for bounding box computation. The final coordinates for the upper left and lower right corner of the bounding box are then defined as \(x_{ul} = x_1 - 2d, y_{ul} = y_1 - d, x_{lr} = x_2 + 2d, y_{lr} = y_1 + 4d\). For PPA, the returned bounding box is refined by cropping off the top 1/3 of the algorithm result since the algorithm returns a bounding box for the whole head including face, forehead and hair.

3 Experiments and Results

For algorithm training and evaluation using cross-validation, the 2935 images of the database were split into 10 subsets by sequentially assigning all images of an individual to the subset with currently least images in it. This way, we could achieve approximately equally sized subsets in which no subject’s images were split between different sets, thereby ensuring no overlap between training and test sets. We then performed algorithm training using leave-one-subset-out cross-validation, i.e. each image was used for testing once and for algorithm training in the other nine iterations.

The ground truth bounding box was defined by the bounding box of each image’s landmark annotations, expanded by 5% in each direction (Fig. 5a). We have decided to enlarge the original bounding boxes since preliminary results have shown that the feature descriptors used by our face detection algorithms perform better on images that contain some background area around the face. The error metric used for evaluation was the Intersection over Union (IoU), defined for the overlapping area of ground truth G and detection bounding box D returned by the detector as the ratio of overlapping area and united area (see also Fig. 5a):

$$ \mathrm {IoU} = \frac{\mathrm {G}\cap \mathrm {D}}{\mathrm {G} \cup \mathrm {D}}. $$
Fig. 5.
figure 5

Left: ground truth definition by expanding the bounding box of the landmarks by 5%. Right: a visualization of the definitions of the intersection \(\mathrm {I} = {\mathrm {G} \cap \mathrm {D}}\) (blue) and the union \(\mathrm {U} = {\mathrm {G} \cup \mathrm {D}}\) (white) of two results for IoU computation (Color figure online).

Commonly, a successful detection is defined as a result with IoU \(>0.5\). The results for the different algorithms are shown in Fig. 6. It can be seen that all learning-based approaches outperform the algorithm-based methods by a considerable margin.

Fig. 6.
figure 6

Percentage of faces that are detected with a given IoU.

Table 1. Performance overview of the tested algorithms. Positive/negative rates, precision and recall are given for an IoU of 0.5. Detection and training times are given for an Intel i5-2500K CPU with 3.3 GHz.

4 Discussion

Results show that classifier-based detection methods that are commonly used for face detection in the visual spectrum can be applied to thermal infrared images if trained with a well suited database. Each of the tested algorithms has shown better detection performance than the domain-specific solutions that were implemented and tested as well. At the same time, the performance indicators given in Table 1 show different methods are more recommended than others depending on the given detection task. ED shows the weakest performance. An inspection of the values returned by the detector reveals that the algorithm is extremly sensitive to changes in pose and facial expression. A slight tilt of the head or opening the mouth usually resulted in failed detections. PPA shows a significantly better performance, however still behind the other algorithms. VJ - the only classifier-based face detection algorithm that has been reportedly used for face detection in thermal infrared images before - as well as VJ-LBP and HOG show good similar detection rates. DPM shows the best detection and false positive rates of all algorithms at the cost of having the longest detection time of the tested methods, making it the algorithm of choice for all cases except those with very strict time constraints. In cases where detection time is crucial, PICO is the method of choice since it is as fast as the two domain-specific algorithms while yielding better results and remaining robust to pose changes.

5 Conclusion and Future Work

In this paper, we performed a performance comparison between face detection algorithms presented for the visual spectrum and applied to thermal infrared images on one side and approaches developed especially for thermal infrared images on the other. We have shown how machine-learning based face detectors can be trained to detect faces in thermal infrared images and that these algorithms outperform specialized approaches. We conclude that due to the better performance of trainable face detection algorithms there is generally no need to develop specialized face detection approaches for thermal infrared images. The only exception are highly controlled scenarios with high performance requirements in which the specialized approaches would performed better due to their lower computational cost.

Since we have shown that algorithms for face detection can be transferred from the visual to the thermal infrared spectrum with good results, we will analyze the applicability of other algorithms that have been developed for facial images in the visual spectrum to thermal images. Furthermore we plan to test the algorithms on images taken in less constrained environments to test the robustness of the different approaches in more realistic settings.