Keywords

1 Introduction

In the 4th Industrial Revolution, informatics and data science provided substantial support for automated production [1]. The use of this astonishing innovation resulted in a multitude of accomplishments, such as intelligent monitoring systems, sophisticated transportation infrastructure, automated financial systems and industrial assembly robot manipulators [2,3,4,5].

In keeping with this trend, we introduce the MTCNN model of face recognition for intelligent mechatronic systems. By comparing the pre-selected facial features from the image database with the person’s face, the system is able to contemplate the correction of the face. Recent research, Local Binary Pattern (LBP), transformed the input image to a binary image, then partitioned the face into blocks and calculated the histogram density per block to produce the histogram feature [6]. However, the extraction of features from the histogram may be influenced by external factors such as input image quality, illumination, etc. In addition, the Dlib method correlated with HOG and SVM [7] was employed. However, there was a possibility that the accuracy would suffer if the face angle was altered. In addition, a well-known research using the FaceNet face recognition system [8] calculated the distance between face vectors using the triplet loss function. However, this technique has the problem that the quantity of math operations the computer must execute grows exponentially as the volume of input data and the overlapping features between the faces rise. To reduce this effect, the researchers employ the ArcFace model, which calculates the distance between face vectors and creates a deviation angle and an addictive angular margin m to separate the characteristics of face vectors. ArcFace has developed and enhanced FaceNet, resulting in a decomposition that prevents misidentification when the original data image resembles a photo taken from a direct angle. The experimental results were reached at a frame rate of 14–16 FPS and an accuracy of approximately 96%. The findings of facial recognition for security are linked with finger gestures for home automation control. All monitoring data in real time is shown via the IoT smart home interface.

2 Methodology

The paper proposes to implement a face recognition process summarized as shown in Fig. 1.

Fig. 1
A flow diagram for face recognition. The process starts with the input image, which undergoes mobile net, followed by face detection with face crop and arc face. The resulting image is processed with K N N or S V M, followed by vector face classification, face recognition, and gesture recognition.

Face and gesture recognition process

MTCNN, an innovative algorithm for detecting faces and facial landmarks with great speed and precision, is utilized in this procedure. The MTCNN method consists of three neural networks (NN) representing three stages. In the first stage, we employ a CNN shallow to create candidate bounding boxes rapidly. The second phase refines the acquired bounding boxes using a more sophisticated CNN network. In the final step, a more advanced CNN network is used to improve the data and generate facial landmarks. Then, Arcface takes each individual’s face image as input and generates a vector of 512 numbers reflecting the most prominent facial traits. The term for this vector in machine learning is embedding vector. Next, a classifier is utilized to determine the distance between facial traits in order to distinguish between many identities. Due to their effectiveness in multi-class classification, Support Vector Machine (SVM) [9] and K-Nearest Neighbors (KNN) [10] are two of the most popular face recognition methods. Eventually, when the faces have been identified, users continue to run the IoT system via hand gestures.

2.1 MTCNN

The image is first rescaled to get an image pyramid that helps the model to detect faces of different sizes (Fig. 2).

Fig. 2
3 flowcharts exhibit the framework of M T C N N for face detection at different sizes. The P net and R net include face classification and bounding box regression. The O net includes face classification, bounding box regression, landmark location, and head pose estimation.

MTCNN architecture

2.2 ArcFace

Deep Convolutional Neural Network (DCNN) models have become prevalent for the extraction of facial features due to their exceptional benefits. There are two primary techniques to develop a classification model from vectors with facial features in order to improve the accuracy of face recognition: the triplet loss function and the softmax loss function. The softmax loss function is typically applied to situations involving face recognition [11]. The softmax loss function combines the cross entropy loss function with the softmax activation function [12]. Using the softmax function, however, causes the linear transformation matrix to grow according to the number of classes being classified. The softmax loss function L1 is depicted here:

$$ L_{1} = - \frac{1}{N}\sum\limits_{i = 1}^{N} {\log \frac{{e^{{{W_{{y_{i}{x_{i} }} }}^{T} + b_{{y_{i} }} }} }}{{\sum\nolimits_{j = 1}^{n} {e^{{{W_{{y_{j}{x_{j} }} }}^{T} + b_{{y_{j} }} }} } }}} $$
(1)

where \(x_{i} \in R^{d}\) represents the depth feature of sample i, of class \(y_{i}\). Embedded feature size d is set to 512. \(W_{j} \in R^{d}\) represents the jth column of the weight vector \(W \in R^{d \times n}\) and \(b_{j} \in R^{n}\) is the bias. The batch and numeric class sizes are N and n respectively.

Since embedded features are dispersed around the center of each feature on the hypersphere, an additive angular margin penalty m is introduced between xi and \(W_{{y_{i} }}\), while intra-class compactness and inter-class differentiation are improved. Due to the fact that the suggested additive angular margin penalty is equivalent to the geodetic distance margin penalty in the normalized hypersphere, the method is referred to as ArcFace Lost L2 (see Eq. 2).

$$ L_{2} = - \frac{1}{N}\sum\limits_{i = 1}^{N} {\log \frac{{e^{{s\left( {\cos \left( {\theta_{{y_{i} }} + m} \right)} \right)}} }}{{e^{{s\left( {\cos \left( {\theta_{{y_{i} }} + m} \right)} \right)}} + \sum\nolimits_{j = 1,j \ne i}^{n} {e^{{s\left( {\cos \left( {\theta_{{y_{j} }} + m} \right)} \right)}} } }}} $$
(2)

Figure 3 shows the process of training a DCNN for face recognition by the ArcFace loss function.

Fig. 3
A flow diagram for the training of D C N N. The normalized feature with normalized weights is allowed to logit to form arc cos of cos theta y i. The function is allowed to additive angular margin penalty. Then the rescaled feature allows to logit and ground truth one hot vector, forming cross entropy loss.

Procedure for training a DCNN for recognition by ArcFace loss

3 System Structure

3.1 Hardware and Software

Process and System Design

The face and hand gesture recognition system using Jetson Nano is capable of handling multiple video streams (see Fig. 4).

Fig. 4
A block diagram of the recognition system. It is equipped with face detect, hand gesture, I M X 219 77 camera 8 M P, Jetson nano 2 gigabits, L C D screen, and Rasbery Pi 3 model B, including sensors and devices.

Block diagram of the proposed hardware system

In this system, based on the input data set, the camera conducts face recognition in real time; if the face matches, the system will continue to allow the operator to manipulate gestures to control the IoT system (see Fig. 5).

Fig. 5
A flow diagram exhibits the workings of the I o T system. The account is logged in through the credential and face recognition login page, resulting in check history data.

The control interface of IoT system

4 Experimental Results

On training dataset with 5 frames per class. We evaluate the face recognition model using ArcFace with models such as: Dlib, LBP. The ArcFace algorithm achieves predominant performance on the Jetson Nano embedded computer with the accuracy of 95–97 (%) and the frame rate up to 25 (FPS). The results are depicted more clearly in Table 1.

Table 1 List of facial recognition test results of some models

The ArcFace model balances accuracy and speed in facial recognition. With the application of hand gesture recognition after verifying the user’s identity, the user can control the sensor devices, the light in the system is clearly displayed as being shown in Fig. 6.

Fig. 6
2 screenshots for the control features of lights and fans on or off. a. 2 snaps have a photo of a person with temperature, humidity, grid data, light, fan, and volume, in which the light is on and off using the data grid. b. The fan is switched on and off using the data grid.

Control features with display of biological indicators

5 Conclusions

Face recognition that improves safety and security has proven to be a formidable obstacle for researchers. We tried the application of the ArcFace model in face recognition and achieved generally favorable results. Real-time testing and evaluation with 30 distinct input faces demonstrate an inference rate of 16 FPS and an accuracy of roughly 96%. In addition, the gesture recognition function that integrates controls with a set of six operations has a 96% accuracy rate. The entire system was developed and implemented on Jetson nano, which yields the best results compared to other embedded computers (Raspberry Pi 3B+, etc.). In addition to facial recognition’s security contribution in smart administration system user authentication, gesture combinations are automatically identified by hand figures, enabling automatic control and monitoring of IoT devices.