Keywords

1 Introduction

In the past year, the whole world has been affected by the pandemic COVID-19. Under the influences of the epidemic, people will choose to wear masks before they go out. Although wearing a mask is an efficient way to prevent the epidemic in daily life, it also brings a challenge to face recognition. For example, when checking tickets at a train station, there is dilemma—it will affect tickets inspection if people wear masks, but it is difficult for people to take off their masks to face the virus. Especially, identifying people wearing masks in 3D-scenes can be a great challenge [1, 2].

Traditionally, a 3D-scene can be constructed as follows. First, taking a picture of the scene through a sensor and then, obtaining multiple photos and shooting from as many angles as possible when conditions permit. Finally, processing these images to obtain three-dimensional video images. In recent years, with the development of Internet technology, the efficiency of building the 3D visualization was improved [3]. At the same time, 3D models have the advantage of displaying huge amounts of information, which can be utilized to identify people wearing masks in 3D-scenes.

Objectives of this study are (1) construct a 3D-scene and manually build a 3D model (2) fusion the 3D data and the 3D model to obtain the 3D visualization scene, and (3), develop a system to recognize the occluded faces in the 3D-scene [4]. In Sect. 16.2, we mainly describe our platform which was constructed in the school office. Then, in Sect. 16.3, we show the 2D face recognition and collect frames, respectively, accordingly whether wear a mask or not. In Sect. 16.3, we will carry out the 3D face recognition in the real scene and finally solve the occluded faces recognition in the real 3D-scene.

2 Construction of the Platform

The actual equipment installation of the experimental platform: a five-way ceiling panoramic view, two face capture cameras, one set of VR equipment, computer workstations, laptops, temperature measurement integrated prototypes, and switch sockets. The installation is as follows:

  1. (1)

    The ceiling panorama is installed on the indoor roof, and the grooves are wired;

  2. (2)

    VR demonstration in the entrance area, it is easy to demonstrate the area;

  3. (3)

    Face capture camera is connected to system but not fixed for the convenience of subsequent development.

  4. (4)

    Computer as our workstation; booting can be normal use;

  5. (5)

    Integrated temperature measurement prototype, connected to system, temperature measurement as in normal use.

The hardware situation f indoor can be described as follows:

  1. (1)

    Set up a complete temperature measurement development environment indoors for further development;

  2. (2)

    Ceiling design can make more space for the interior to be used and prevents a lot of accidental damage;

  3. (3)

    HTC Vive Pro2.0 VR headset is used for VR equipment, which allows users to use VR in a certain area and feel the VR effect;

  4. (4)

    Face capture camera, normal operation; HR-IPC2143 intelligent face recognition gun-type network camera is used in face capture machine, which can provide high face recognition accuracy with low power consumption;

  5. (5)

    Computer workstation for normal use; DELL 5540 mobile workstation was used;

  6. (6)

    Using an eight inches dual visual temperature measuring living face recognition machine.

3 Identifying People Wearing Masks

3.1 Collecting and Classifying Faces Data

At first, we just experiment on a video to recognize whether people wear a mask or not. And we have saved face images of every frame as shown in Fig. 16.1.

Fig. 16.1
figure 1

Original data: saved face images of every frames

Then, we automatically saved faces that were without masks and with masks separately after recognition. As shown in Fig. 16.2, we can see some images especially with paper over the face which didn’t save well from some relative images.

Fig. 16.2
figure 2

Classified data: automatically saved images with and without masks

We measured the accuracy of the corresponding 2D faces with and without masks as shown in Table 16.1. It mainly reads from our prerecorded video which includes 1321 frames.

Table1 Accuracy of faces data classification

3.2 2D-3D Face Recognition with Mask

First of all, the camera needs to be calibrated before positioning [4]. In this case, we only need to use the camera’s internal parameters and distortion parameters. We can get two-dimensional coordinates through camera identification, and then define a world coordinate system. The three-dimensional coordinates of the target point are defined, so we get the coordination of the camera [5]. Because of the relative position, we can get the coordination of the target point which is relative to the camera. Then the coordinate of the target point in the world coordinate system can be obtained by Euler Angle transformation and TF transformation. Because coordinate translation means matrix addition and subtraction, coordinate rotation means matrix multiplication. The advantage of homogeneous coordinates is through adding a dimension and expressing the addition multiplication in a formula.

$$\left( {\begin{array}{*{20}c} {\text{x}} \\ y \\ z \\ \end{array} } \right) \sim \left[ {\begin{array}{*{20}c} {fx} & 0 & {cx} \\ 0 & {fy} & {cy} \\ 0 & 0 & 0 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {r_{11} } & {r_{12} } & {\begin{array}{*{20}c} {r_{13} } & {t_{1} } \\ \end{array} } \\ {r_{21} } & {r_{22} } & {\begin{array}{*{20}c} {r_{23} } & {t_{2} } \\ \end{array} } \\ {r_{31} } & {r_{32} } & {\begin{array}{*{20}c} {r_{33} } & {t_{3} } \\ \end{array} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} X \\ Y \\ Z \\ 1 \\ \end{array} } \right]$$
(16.1)

The above formula is for coordinate transformation, \(\left[ {R|{{t}}} \right]\) is an augmented matrix which includes rotation and translation.

Euler angles define the order of rotation of objects, and the degrees they have rotated about an axis. A lot of people tend to ignore the rotation order, and a lot of books call it a rule, which can be interpreted as a rule of the rotation order of Euler angles [6]. (α, β, γ) in different rotation order will have different results, firstly rotate α about the X-axis, or rotate β about the Y axis, the final result is different. There are many rules for Euler Angle, such as Z-X-Y, X-Y-Z, X-Y-X, and Z-X-Y, which have many permutations and combinations.

In the next formula about \({{x}} = r\cos \phi\), \({{y}} = r\sin \phi\), \({{x}}^{\prime } = r\cos (\theta + \phi )\), and \({{y}}^{\prime } = r\sin (\theta + \phi )\), putting \({{x}}^{\prime }\) and \({{y}}^{\prime }\) into the x and y, we can get the matrix form:

$$\left[ {\begin{array}{*{20}c} {{\text{x}}^{\prime } } \\ {y^{\prime } } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\cos \theta } \\ {\sin \theta } \\ \end{array} } & {\begin{array}{*{20}c} { - \sin \theta } \\ {\cos \theta } \\ \end{array} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {\text{x}} \\ y \\ \end{array} } \right]$$
(16.2)

We're going to scale it up, so the final form is as follows.

Rotating about the X-axis:

$$R_{x} (\theta ) = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 \\ 0 & {\cos \theta } & { - \sin \theta } \\ 0 & {\sin \theta } & {\cos \theta } \\ \end{array} } \right]$$
(16.3)

Rotating about the Y-axis:

$$R_{{\text{y}}} (\theta ) = \left[ {\begin{array}{*{20}c} {\cos \theta } & 0 & {\sin \theta } \\ 0 & 1 & 0 \\ { - \sin \theta } & 0 & {\cos \theta } \\ \end{array} } \right]$$
(16.4)

Rotating about the Z-axis:

$$R_{{\text{z}}} (\theta ) = \left[ {\begin{array}{*{20}c} {\cos \theta } & {{ -}\sin \theta } & 0 \\ {\sin \theta } & {\cos \theta } & 0 \\ 0 & 0 & 1 \\ \end{array} } \right]$$
(16.5)

We need to find the rotation matrices of each axis, and multiply them in order to get the whole rotation matrix. Since the rotation matrix is left-multiplied, rotation matrix [7] is \(R{ = }R{\text{x}}RyRz\) for the Y-X-Z Euler angle. If it's a Z-Y-X regular Euler angle, the corresponding combined rotation matrix is \(R{ = }R{\text{x}}RyRz\). R is a 3 × 3 matrix, the rotation matrix for the entire transformation of coordinates. We can figure out the angle by using the inverse trigonometric function:

$$\theta_{{\text{x}}} = \arctan \frac{{r_{32} }}{{r_{33} }},\,\theta_{{\text{y}}} = \arctan \frac{{ - r_{31} }}{{\sqrt {r^{2}_{32} + r^{2}_{33} } }},\theta_{{\text{z}}} = \arctan \frac{{r_{21} }}{{r_{11} }}$$
(16.6)

3.3 2D Alignment of the Fundamental Ratio

It is easy to verify the relative translation between frames in the scene, which can be obtained as follows:

$$E_{{\text{i,j}}} \sim [t_{ij} ]_{X} R_{i,j} \sim E_{c(i),c(j)}^{\prime }$$
(16.7)

where ∼ means multiplication which is equal to non-zero scale factor, \(\left[ \right]_{x}\) is the symbol of skew symmetric matrix and represents the cross product. Therefore, when the camera calibration matrices \(K\) and \(K^{\prime }\) of sequence V and \(V^{\prime }\) are the same, the corresponding uncalibrated matrix \(F_{{{\text{i}},j}}\) with \(F_{c(i),c(j)}^{\prime }\) should be equal and can be used for video synchronization [8]. \(R_{i,j}\) is the rotation matrix.

But in the more common case, K and K’ are constants, but they are different in sequences.

In the case of a simplified camera model, such as unit aspect ratio and zero deviation, it can be verified that:

$$F^{2 \times 2} \sim \left[ {\begin{array}{*{20}c} { \in_{{1{\text{st}}}} t_{i,j}^{s} r_{1}^{t} } & { \in_{{1{\text{st}}}} t_{i,j}^{s} r_{2}^{t} } \\ { \in_{{{\text{2st}}}} t_{i,j}^{s} r_{1}^{t} } & { \in_{{{\text{2st}}}} t_{i,j}^{s} r_{2}^{t} } \\ \end{array} } \right]$$
(16.8)

where \(\in_{{{\text{rst}}}}\) for r, s, t = 1,2, …,3 means permutation tensor, \({\text{r}}_{i}\) means columns of \(R_{i,j}\). \(t_{j}\) means camera translation. It is worth noting that \(F^{2 \times 2}\) is the state of observation, its elements are \(F_{i,j}\). The ratios have nothing to do with the internal parameters of the camera, and only reflect the self-motion of the camera. In this article, we call these fundamental ratios [9]. Therefore, we can extract an independent four-dimensional feature \(V_{{\text{f}}}\):

$$V_{{\text{f}}} = sig(F_{11} )[F_{11} ,F_{12} ,F_{21} ,F_{22} ]/||F^{2 \times 2} ||_{F}$$
(16.9)

among them \(\left\| \cdot \right\|_{F}\) is Frobenius norm. When two cameras are associated with a similar transformation matrix \(H_{{\text{s}}}\), we can prove that two cameras have the same motion trajectory. Under this premise, they can be aligned. Proportional ambiguity \(H_{{\text{s}}}\) is defined as \(H\left( \begin{gathered} 0.8I \hfill \\ 0 \hfill \\ \end{gathered} \right.\left. \begin{gathered} 0 \hfill \\ 1 \hfill \\ \end{gathered} \right)\). The position and posture of the camera are related by the proportional ambiguity, but there exists noise pollution. We calculate the camera pair which corresponds to each different position.

\(V_{f}\) has five freedom degrees: rotation \(R_{{\text{i,j}}}\) and \({\text{t}}_{i,j}\). There are three freedom degrees, but there is a fuzzy scale. In addition, for the same basic matrix, there are four possible settings for the relative camera position and orientation. In fact, based on the proposed method has not available in some situations, such as pure translation, where the camera center is fixed (that is, there is no change in camera position) or the basic ratio [10] is calculated in a flat scene. However, as shown in this paper, similar camera self-motion will produce the same \(V_{f}\). This can be used to synchronize video sequences.

In the process of calculating the basic ratio, SIFT features [11] are used to complete the correspondence between the initial frames and the frame, and the MAPSAC algorithm that minimizes reprojection is used to calculate the purely rotated planar homography and the basic matrix error of general camera motion [12]. The eigenvalue decomposition of the matrix is used to calculate the outer pole of a pure translation straight line. In the two frames, i and l may be far apart. Therefore, when there is no correspondence between the calculation of the basic matrix, the observation graph theory is used to calculate the basic matrix between two frames [13]. In other words, three views \(({{i}},{{j}},{{k}})\) with \(({{j}},{{k}},{{l}})\). The basic matrix inside is available.

Finally, in order to improve the robustness of the proposed method, we use a coarse-to-fine framework, because the coarse-level synchronization captures global features, so errors will not propagate to the rest of the regular path in the frame correspondence calculation [14].

The calibration error in the time axis model is used:

$$E({{j}},{{j}}^{\prime } ) = {\text{dist}}({{j}},{{j}}^{\prime } ) + \min \{ E({{j}},{{j}}^{\prime } - 1),E({{j}} - 1,{{j}}^{\prime } - 1),E({{j}} - 1,{{j}}^{\prime } )\}$$
(16.10)

among them, the dist \(E(j,j)\) , the mean square error between them is calculated.

Therefore, by a set of parameters \({{c}}(j)\) the determined synchronization calculation is as follows:

$${\text{c}}({\text{j}}) = \arg \mathop {\min }\limits_{c(1) \le \cdots \le c(N)} \sum\limits_{{{\text{j}} = 1}}^{N} {E({\text{j}},c({\text{j}}))}$$
(16.11)

where N means the number of input video frames, and then the dynamic program is used to solve the optimization problem which is defined in the equation.

4 3D Face Recognition in the Real Scene

4.1 Video Image Format Conversion

Many video images are usually in YUV format [15] obtained from the camera, but we only use process images in RGB format in the PC. Here we need to convert YUV image format to RGB format. But YUV and RGB are two different color decoding schemes. In YUV, Y is brightness, Chroma is represented by U and V, which are used to specify the color of pixels in the acquired image, and described the saturation and color of the image. Therefore, if a picture only has Y channel data, it can still display the complete picture, but the picture is black and white. And we can convert an image in YUV format to an image in RGB format through the following formula, as it has mentioned in Keith Jack’s book [16]:

$$\begin{gathered} \text{B} = 1.164(\text{Y} - \text{R}) + 2.018(\text{U} - 128) \hfill \\ \text{G} = 1.164(\text{Y} - 16) - 0.813(\text{V} - 128) - 0.391(\text{U} - 128) \hfill \\ \text{R} = 1.164(\text{Y} - 16) + 1.596(\text{V} - 128) \hfill \\ \end{gathered}$$
(16.12)

Noting in the above formula, the range for RGB is [0, 255], the range for Y is [16, 235], and the range for UV is [16, 239]. If the result is out of this range, the processing is truncated.

It is the simplest and most direct way to convert YUV format into RGB format. By accessing each pixel in the image pixel by pixel, the conversion from YUV to RGB format image can be completed.

4.2 The 3D-Visualized Face Recognition

Through panoramic camera to obtain panoramic video, we need to built a three-dimensional model and fuse with it, and then a three-dimensional visualization scene can be established for face recognition. SketchUp is used as a design tool oriented to the creation process of design schemes for 3D architectural design. The program runs, as shown in Fig. 16.3. First, draw a floor plan of the room, which can be done using the Line Tool. Then, using the push-pull tool to build a preliminary three-dimensional model. Adding the hand-built model from SketchUp Pro to the folder where the Holographic Camera software is located to perform the 3D fusion [17, 18]. Finally, the three-dimensional model is optimized to get the following effect pictures. Meanwhile, performing 3D fusion, we must obtain images through a holographic camera as shown in Fig. 16.3. Then by fusing the image obtained by the holographic camera with the 3D model, the image after the virtual and real fusion can be obtained [19]. Based on the above three-dimensional fusion process, we can expand face recognition in the three-dimensional visualization scene. First, we obtain the face image and then achieve 3D fusion.

Fig. 16.3
figure 3

The process for 3D-visualized face recognition

A homogeneous representation of camera coordinates to video image coordinates, there we use Binocular camera and we have shown the 2D-3D transformation above [20], as shown in Figs. 16.3 and 16.4.

Fig. 16.4
figure 4

The actually experimental results

In this way, the occluded face recognition in the 3D visualization scene is completed. From the above results, it can be seen that the establishment of a three-dimensional visualization scene has a positive effect on the recognition of occluded faces, which can assist the recognition of occluded faces. In this paper, we have collected many faces which include wearing a mask and no mask as in our face library and it helps us to further achieve 3D face recognition.

5 Conclusions and Perspectives

This paper mainly proposes to apply 3D technology to recognize people and use 2D scene to collect the face. And we will further statistic the accuracy of the standard face library. Through simple attempt, we solve the occluded faces problem that are difficult to recognize. During the COVID-19 pandemic, we compared contact authentication such as fingerprints, contactless face recognition authentication which has become an important tool during the May 1st Conference. The most important thing of face recognition is the biological information of the face, including facial contour, position of the nose and mouth, etc. The more feature information, the more accurate of face recognition results. However, during the epidemic, people wear masks in and out of public places, which greatly affects the accuracy of face recognition and two-dimensional occluded face recognition. After combining 3D video face recognition technology, through the recognition of face images and video, the face images in the 2D scene are obtained and we get the face library. With the help of more facial feature information, face recognition can be easier used in lots of scenes, the accuracy of face recognition can be greatly improved eventually.