Keywords

1 Introduction

Visible images refer to images formed by visible light (light with a wavelength between 390 nm–780 nm). The photos taken by ordinary cameras are visible images. In recent years, target detection, especially face detection for visible images, is developing rapidly and many effective algorithms can be used. For example, it is a good choice to use R-CNN (Region-based Convolutional Neural Networks) [5] and its derivative algorithms, including Fast R-CNN [4], Faster R-CNN [14], Mask R-CNN [6] and R-FCN [3]. Different versions of YOLO (You only look once) are also effective, including YOLOv1 [11], YOLOv2 [12], YOLOv3 [13], YOLOv3 [1] and PP-YOLO [10].

Because of COVID-19, intelligent thermal imagers are widely used all over the world. To prevent interference caused by the ambient temperature, the intelligent thermal imager needs to narrow down the temperature measurement area. Given that the skin of the human face is exposed to the air, the human face is considered a suitable temperature measurement area. Therefore, an intelligent thermal imager usually needs to find the position of the human face in the infrared image as the temperature measurement area [16]. Face detection in thermal infrared images becomes an important issue.

In recent years, there are some papers on face detection in thermal infrared images [7,8,9, 15]. Most previous papers used high-resolution single-person thermal infrared images with little noise in the environment. However, in actual application scenarios of intelligent thermal imagers, thermal infrared images may have low resolution, many people, or much environmental noise. The facial features in the thermal infrared image are probably not visible due to low resolution, which means they may perform badly in a real situation.

In addition, since intelligent thermal imagers usually require real-time temperature measurement, they usually do not use target detection algorithms directly on infrared images because their processing speed is not fast enough. Instead, many intelligent thermal imagers copy the faces detected in visible images to the corresponding infrared images [16]. However, the results of this method are not very accurate. For example, Fig. 1 is a screenshot of HY-2005B intelligent thermal imager, produced by Wuhan Huazhong Numerical Control Co., Ltd. [17].

Fig. 1.
figure 1

A screenshot of an intelligent thermal imager

Figure 1 shows that faces detected in visible images are quite precise, while faces detected in infrared images have noticeable offsets. The main reason is that these two different types of images are captured by two different cameras and there is a small distance between them (shown as Fig. 2), which leads to different optical axes of the two cameras. This means that even the same pixel in the infrared image and the visible image corresponds to a different position.

Fig. 2.
figure 2

The infrared camera and the visible light camera

In conclusion, it is significant to find a method to quickly and accurately detect human faces in thermal infrared images with corresponding visible images.

2 Principle of Image Calibration

Given that the distance between the two cameras is small, it can be assumed that the mapping between infrared images and visible images is a rigid transformation. In other words, the pixel (xy) in the visible image and the pixel \((x',y')\) in the infrared image satisfy the following relationship (shown as Eq. (1)).

$$\begin{aligned} \left\{ \begin{aligned} x'&= a_{11}x + a_{12}y + a_{13}\\ y'&= a_{21}x + a_{22}y + a_{23}\\ \end{aligned} \right. \end{aligned}$$
(1)

where \(a_{ij}(i =1,2;j = 1,2,3)\) are all constants.

The matrixed equation is as follows

$$\begin{aligned} \begin{bmatrix} x'\\ y' \end{bmatrix} = \begin{bmatrix} a_{11}&{}a_{12}&{}a_{13}\\ a_{21}&{}a_{22}&{}a_{23} \end{bmatrix} \cdot \begin{bmatrix} x\\ y\\ 1 \end{bmatrix} = \mathbf{A} \cdot \begin{bmatrix} x\\ y\\ 1 \end{bmatrix} \end{aligned}$$
(2)

If the coordinates of the three corresponding point pairs are known and they are \((x_k,y_k)(x'_k,y'_k)(k = 1,2,3)\), then the parameters \(a_{ij}(i =1,2;j = 1,2,3)\) can be determined by the following equation:

$$\begin{aligned} \left\{ \begin{aligned} x_k'&= a_{11}x_k + a_{12}y_k + a_{13}&(k = 1,2,3)\\ y_k'&= a_{21}x_k + a_{22}y_k + a_{23}&(k = 1,2,3)\\ \end{aligned} \right. \end{aligned}$$
(3)

This method is simple and easy to implement. By using this method, people can determine the position of human faces in the infrared image through the face box in the corresponding visible image.

However, it should be noticed that this method ignores the distance between the object and the camera, which will affect the calibration. For example, if all the calibration point pairs are obtained at a distance of 5 m from the camera, the calibration of the object at 2 m will have an offset. This paper will analyze the causes of offset in detail and propose ways to reduce it.

3 Optical Analysis for Calibration Offset

This section tries to explain the reasons why the calibration offset occurs by performing an optical analysis.

3.1 Optical Model Derivation

The optical model of the intelligent thermal imager is shown in Fig. 3. \(\overline{BG}\) represents the optical axis of the visible camera, while \(\overline{CT}\) represents the optical axis of the infrared camera. \(\overline{FG}\) and \(\overline{ST}\) are the light screens of the visible camera and infrared camera respectively. Note that the light screens are on the same plane. The point P represents the object being photographed.

Fig. 3.
figure 3

The optical representation of an intelligent thermal imager

The geometric similarity tells that

$$\begin{aligned} \frac{FG}{BP} = \frac{AG}{AB} \end{aligned}$$
$$\begin{aligned} \frac{ST}{CP} = \frac{ET}{CE} \end{aligned}$$

Namely,

$$\begin{aligned} \frac{d'}{D+d} = \frac{f'}{H} \end{aligned}$$
(4)
$$\begin{aligned} \frac{d''}{d} = \frac{f''}{H+f'-f''} \end{aligned}$$
(5)

Since \(H\gg f',f''\) and \(f' \approx f''\), \(H\gg f'-f''\). Thus, the Eq. (5) can be simplified as

$$\begin{aligned} \frac{d''}{d} = \frac{f''}{H} \end{aligned}$$
(6)

By combining the Eqs. (4) and (6), the following relationship can be obtained

$$\begin{aligned} d'' = d' \cdot \frac{f''}{f'} - \frac{Df''}{H} \end{aligned}$$
(7)

Let \( k = \frac{f''}{f'}\) and \(C = \frac{Df''}{H}\). Then the Eq. (7) becomes

$$\begin{aligned} d'' = d' \cdot k - C \end{aligned}$$
(8)

Suppose that at a fixed distance \(H_0\), take any two calibration points. The distances between these two points and point C are \(d_1\) and \(d_2\) respectively. \(d_1'\) and \(d_1''\) correspond to \(d_1\), while \(d_2'\) and \(d_2''\) correspond to \(d_2\). Then, the following linear equations can be obtained

$$\begin{aligned} \left\{ \begin{aligned} d_1'' = d_1' \cdot k - C\\ d_2'' = d_2' \cdot k - C\\ \end{aligned} \right. \end{aligned}$$
(9)

By solving the Eq. (9), k and C can be determined.

The calibration method introduced in Sect. 2 regards the distance between the object and the two cameras H as a constant \(H_0\). In other words, This calibration method only uses constants k and C for calibration, regardless of the distance between the object and the camera. However, H is changing, which causes C is also changing. That is the reason why the calibration has an offset.

3.2 Analysis for Calibration Offset

Assume that the calibration is performed at \(H_0\). Then, the calibration offset is \(\frac{Df''}{H} - \frac{Df''}{H_0}\). When \(H>H_0\) (that is, when the actual object is farther from the camera), \(\frac{Df''}{H} - \frac{Df''}{H_0}<0\) because D and \(f''\) keep unchanged. This means that \(d''\) will be smaller than it should be, if we still use k and C obtained by \(H_0\) for calibration in this case. Similarly, when \(H<H_0\) (that is, the actual object is closer to the camera), \(d''\) will be larger than it should be.

However, it should be noticed that \(\frac{1}{H}\) is an inverse function. Hence, the calibration offset at \(H<H_0\) is much larger than the calibration offset at \(H>H_0\). For example, suppose \(H_0 = 5\) meters. On the one hand, when \(H = 1\) meter, the calibration offset is

$$\begin{aligned} \frac{Df''}{H} - \frac{Df''}{H_0} = \frac{4}{5} Df'' \end{aligned}$$

On the other hand, when \(H = 9\) meters, the calibration offset is

$$\begin{aligned} \frac{Df''}{H} - \frac{Df''}{H_0} = - \frac{4}{45} Df'' \end{aligned}$$

In this case, \(\frac{4}{5} Df'' > \frac{4}{45} Df''\).

3.3 Methods to Reduce Calibration Offset

The first two methods try to reduce the calibration offset on the physical level, while the latter two methods try to rectify the calibration results from the perspective of image processing.

Method 1: Using the Exact \(\boldsymbol{H}\) Instead of \(\boldsymbol{H_0}\). This method can even eliminate the calibration offset. However, it should be noted that this method requires real-time measurement of the distance between each object and the camera, which needs very professional equipment and incurs huge costs. Therefore, this method is theoretically usable but not practically affordable.

Method 2: Decreasing the Distance Between the Two Cameras. Since the calibration offset \(\frac{Df''}{H} - \frac{Df''}{H_0}\) is linear to the distance between the two cameras D, decreasing it is another choice. However, it should be noticed that the ideal situation \( D = 0 \) cannot be implemented physically because the visible camera and the infrared camera are two different cameras.

Method 3: Looking for the Largest Connected White Area. This method mainly has 4 steps (shown in Fig. 4). The largest connected white area is the precise face box as required.

Fig. 4.
figure 4

A flowchart of the procedure of Method 3

Take Fig. 5 as an example. The red rectangle in Fig. 5 shows the face box obtained by using the calibration method above.

Fig. 5.
figure 5

An example for Method 3 (Color figure online)

First, expand the red rectangle by half of its side length to obtain the green rectangle in Fig. 5.

Second, set all the pixels with a gray-scale value between 221 and 255 to 1 (i.e. white) and setting the other pixels to 0 (i.e. black).

Next, traverse all the white pixels and determine whether the four pixels up, down, left, and right are all 1 (i.e. white). Save the positions of the pixels that do not meet this condition and set them to 0 (i.e. black).

Finally, use the seed-filling algorithm or the two-pass algorithm to find all the connected white regions, which are both traditional and classical algorithms to do that. Then compute the area of all connected regions and find the largest one, which is the blue rectangle in Fig. 5.

Method 4: Using Target Detection Algorithms. Method 3 uses classic image processing methods to detect faces. Thus, combining some more advanced algorithms seems to be a good option, especially some mature and effective target detection algorithms, such as Faster-RCNN [14] or YOLOv3 [13].

The first step of Method 4 is the same as the one of Method 3, which is expanding each face box from the center. However, Method 4 directly uses target detection algorithms to detect a human face in the expanded area, while Method 3 uses 3 more steps to process the infrared image.

4 Experiments and Results

4.1 Experimental Setup

Data Collection. Since there are few pairs of infrared and visible images in the public data set containing clear human faces, this paper uses a self-collected data set containing 944 pairs of infrared and visible images. Some pairs of images are shown in Fig. 6. All the infrared images are grayscale images consisting of 256 * 320 pixels, while all the visible images are RGB images consisting of 1024 * 1280 pixels. The faces in each image are labeled with rectangular boxes as ground truths. Labels are stored in XML files.

Fig. 6.
figure 6

A few examples of the data set

For data augmentation, this paper performs a horizontal flip for all image pairs. Besides, this paper standardizes all the images before training, evaluation, and final testing.

This paper shuffles all preprocessed image pairs and extracts 70% of them as the training set, 15% as the validation set, and the remaining 15% as the test set.

Performance Measurement. This paper mainly uses two parameters to measure the performance of methods: average precision (AP) when the threshold is intersection over union (IoU) \( = 0.5\) and frames per second (FPS) in the following experimental environment

figure a

For each method, the AP and FPS of the optimal model on the test set will be collected as its final performance.

Optimization and Evaluation. This paper uses a trained PP-YOLO model to detect faces in visible images, whose FPS is about 50 in the above environment [10].

For Method 4, this paper uses two algorithms, YOLOv3 [13] and Faster R-CNN [14], to detect faces in infrared images. The optimization and evaluation strategy is shown in Table 1, where LR is the abbreviation of learning rate.

Table 1. Optimization and evaluation strategy

This paper mainly compares the performance of the following five methods.

figure b

The optimization and evaluation strategy of PP-YOLO + YOLOv3 is the same as YOLOv3 only. Similarly, the optimization and evaluation strategy of PP-YOLO + Faster R-CNN is the same as Faster R-CNN only.

Random Noise. In order to compare the robustness of methods, this paper conducts experiments in two situations: with random noise and without random noise.

The random noise is added by randomly scaling the brightness and contrast of images. The probability of whether randomly scaling contrast or brightness is 0.5. When scaling is required according to the random result, take a random value from [0.5, 1.5] as the scaling factor \(s_b,s_c\) and distort the image according to the following formulas.

$$\begin{aligned} \text {Brightness : } \mathbf {Image}(i,j) \Longrightarrow s_b \cdot \mathbf {Image}(i,j) \end{aligned}$$
$$\begin{aligned} \text {Contrast : } \mathbf {Image}(i,j) \Longrightarrow s_c \cdot \mathbf {Image}(i,j) + (1-s_c) \cdot \overline{\mathbf {Image}} \end{aligned}$$

where \(\mathbf {Image}(i,j)\) refers to each pixel and \(\overline{\mathbf {Image}}\) is the mean of the image. Namely,

$$\begin{aligned} \overline{\mathbf {Image}} = \frac{1}{M \cdot N} \sum _{i=1}^M \sum _{j=1}^N \mathbf {Image}(i,j) \end{aligned}$$

4.2 Results

The experimental results are shown in Table 2.

Table 2. The experimental results

Experiments show that Method 3 has a much higher FPS than the other methods, which means that Method 3 can quickly find the positions of human faces in infrared images. However, the AP of Method 3 is much worse than Method 4’s AP. That is mainly because Method 3 has weaker robustness than that of Method 4. For example, if a person wears a mask, Method 3 tends to output the part without the mask, while the ground truth often includes the mask.

However, to an intelligent thermal imager, only the face area for temperature measurement needs to be determined. Those occluded parts of the face (such as the area covered by a mask) will not be used for parking spaces. Therefore, although Method 3 cannot detect a complete face in some cases, it can still be used in this scenario.

Therefore, if high precision is required, Method 4 is better; if low latency is required, Method 3 is better.

5 Conclusion

This paper mainly proposes two methods to solve the problem of face detection in infrared images. One (called Method 3) is to uses image processing methods and face detected from visible images to determine the position of the face in the infrared image, while the other (called Method 4) is to uses target detection algorithms on infrared images, including YOLOv3 and Faster R-CNN.

To be specific, Method 3 mainly consists of 4 steps:

figure c

The first step of Method 4 is the same as the one of Method 3, which is expanding each face box from the center. However, Method 4 directly uses target detection algorithms to detect a human face in the expanded area, while Method 3 uses 3 more steps to process the infrared image.

Experiments show that Method 3 can quickly find the positions of human faces in infrared images. However, Method 3 has weaker robustness than that of Method 4, which is the main reason why the performance of Method 3 is worse than Method 4’s performance. Therefore, if high precision is required, Method 4 is better; if low latency is required, Method 3 is better.

Besides, the methods proposed in this paper still have room for improvement in the correlation between visible and infrared images. This paper uses the calibration method proposed in Sect. 2 to associate visible and infrared images. In the future, other association methods are also worth trying (such as Cross-Modal CNN [2]).