Keywords

1 Introduction

The research and application of computer vision, computer graphics, image processing and machine learning in recent years brings us a lot of advantages [3,4,5,6]. They are applied in many fields like 3D simulation, game industry, medical diagnostics or digital heritages. These knowledge and solutions are also popular and widely used in the systems of security, access control, check-in and check-out, objects tracking and monitoring, identify verification services, etc., and proved usefully and efficiently. The researches in face recognition system have been becoming popular along with development of artificial intelligence and deep learning techniques [7, 8]. A facial recognition system is an application capable of matching a human face from a digital image or a video frame of a camera with a given image in the database system. Using the machine learning methods and deep learning techniques are best solutions to obtain the high accuracy and adapt real time processing. Transfer learning and image augmentation are additional techniques in image processing and machine learning that can help improving precision of the face recognition systems.

The existing identify systems like fingerprints, palm recognition, ID and passwords, etc., are developed and widely used in practice. However, these methods existed limitations such as fake images of fingers and palms, or missing the user name and password to check-in. Therefore, the needs for using face recognition system is increasing recent years because of its accuracy and necessary in both security system and objects monitoring system.

In order to build a facial recognition system, the training dataset plays an important role in implementation of the system. Numerous facial datasets have been published online so that anyone can download them for free testing. These datasets are very huge, containing millions of images of various peoples. Therefore, researchers can skip the data collection phase and concentrate on training the model. The computer hardware is also getting more and more powerful to be able to train models. Wearing a face mask is now commonly used as part of standard to prevent infection during the pandemic of COVID-19. This is requirements for us in public spaces or on the means of transportation. Therefore, a system that recognizes people with face mask is necessary in any organizations or companies.

The system ought to be real-time and robotized. One suitable solution for these requirements is deep learning. This technology can help us recognize people with face mask automatically. Although the numerous facial datasets are available and free for accessing in the several researches. However, the facial dataset with mask is not available at present. In this research, we apply the image processing techniques to wear a mask on each face of the existing facial data. The dataset is then used for training data of our model. After reviewing the state-of-the-art methods, we propose a method for face recognition in both mask and without mask. Our contribution focus on data creation and data augmentation. The improved points have been shown the accuracy in face recognition with mask.

The remainder of the paper is structured as follows. Section 2 presents the state-of-the-art methods, several applications, tools and techniques for building application in practice. Section 3 describes in detail our proposed method including system design of the application. We present the implementation and obtained results in Sect. 6. Section 5 includes discussion and evaluation. The last section is our conclusion and future work.

2 Related Works

In this section, we explore the several methods for face recognition and its application in practice. The two important points that researchers want to obtain are accuracy and time processing. In order to improve accuracy of image classification, Alex Krizhevsky et al. [9] (called AlexNet) presented a new CNN architecture model based on increasing layers with the support of computational power. The advantage of this CNN architecture is that the model can extract more features from each layer. After that, they are combined to enrich information of the object for prediction step. Moreover, the activation function ReLU is used as an additional computation step to speed-up the training process of the model compared to sigmoid function. However, the limitation come from size selection (11\(\,\times \,\)11) of the first convolutional layer that can be difficult to deal with smaller size of pictures in practice. Karen et al. [10] proposed an idea to build templates using blocks. The VGG Block includes three parts: a convolutional layer, an activation function, and a pooling layer. The order and number of the block are optional. At the end, a fully connected layer and SoftMax is inserted to make prediction. Unlike AlexNet [9], that model use 3\(\,\times \,\)3 filters to help reduce computation, which reduces time and parameters. In contrast, they increase the depth of CNN that lead to increase accuracy of the network model.

In the traditional approach, the neural nework layers are stacked together as thickly as possible to create a model that can learn complex rules. However, a very deep neural network can cause the problem of overfitting. Another problem is vanishing gradient: the gradient is too small, that make it impossible for layers to learn in backpropagation. Furthermore, since the picture’s information area changes with each image, choosing the optimal kernel size is crucial. Szegedy et al. [11] suggested a new solution for building neural networks. Instead of being “deeper," the network would become “wider." Different kernels operate on the same input. Then, to form block output, all kernel’s output are concatenated along the channel dimension. For example, a block includes four parallel paths. To extract information from different spatial sizes, the first three paths used different convolutional layers from 1\(\,\times \,\)1, 3\(\,\times \,\)3, and 55. The max pooling layer (3\(\,\times \,\)3) is used in the last path. All output must be the same size across the fourth paths, which is a requirement for the concatenation step. Each layer must have appropriate padding. The neural network (as previously indicated) is consuming cost. Therefore, the authors include an extra layer 1\(\,\times \,\)1 convolution before 3\(\,\times \,\)3 and 5\(\,\times \,\)5 convolutions to reduce the number of input channels.

Kaiming et al. [12] presented another solution (namely residual block) to defeat the vanishing/exploding gradient. Instead of making the network wider, this approach introduces a technique called skip connection. It let data skip several layers and connect directly to the output. The implementation of this block is simple. Input x and f(x) are directly added together to create an output of block. This kind of architecture requires x and f(x) to have the same shape. If the result x + f(x) approximates x (can be assumed as x + f (x) = x), the residual block is easy to learn (it means all weights and bias of the layer will be pushed to 0).

Christian et al. [11] presented a method for improving inception deep learning model for image classification. With numerous variants of the inception network were established, each of later versions is better compared to the previous one. In the next research project [13], authors suggested a solution to improve the accuracy and reduction of the computational complexity. Instead of using large filters (e.g. 5\(\,\times \,\)5 or 7\(\,\times \,\)7), it can be expensive in computation, authors suggested to decompose them into smaller filters.

Considering the spatial factorization, the researchers can continue to factorize n\(\,\times \,\)n kernel into a combination of 1\(\,\times \,\)n and n\(\,\times \,\)1 kernels. When the number of filters is the same, 1 x n and n\(\,\times \,\)1 kernels are (1\(\,\times \,\)n + n\(\,\times \,\)1)/\(n^2\) = 2/n cost of n\(\,\times \,\)n kernel. Therefore, this combination is \(1 - 2/n\) cheaper than the original kernel. For instance, a 3\(\,\times \,\)3 convolution is divided into 1\(\,\times \,\)3 convolution followed by a 3\(\,\times \,\)1 convolution. They found this method is \(1 - 2/n=0.3\) cheaper than the single 3\(\,\times \,\)3 convolution. In practice, they figured out that this factorization does not work on early layer, but it produces incredibly good results on the grid of medium size. In order to reduce the grid size, Christian [13] use a pooling layer and a convolutional layer to process and combine them to produce the output of block. From these improvements, many architectural models had been introduced: Inceptionv1, Inceptionv2, Inceptionv3, Inceptionv4. The model in this research is based on Inception Resnet v1. The inception-resnet v1 is introduced in [14]. It is a hybrid network insprired by the inception model and residual model. This combination increases the number of layers while keeping the accuracy and performance.

Another research namely FaceNet is introduced by Florian et al. [15]. The purpose is to represent faces in Euclidean space, where the distance can used to compare similarity. The architecture of FaceNet is described as follows. A batch of images fed into a deep architecture (this architecture is designed to turn images into vectors). These vectors are then normalized to unit vectors using the L2 method. The pictures (as vector form) is then processed through triplet loss to distributed embedding the notation of similarity and dissimilarity. To the face-mask recognition, Warot Moungsouy et al. [16, 17] proposed a method based on residual inception networks. Authors introduced a masked-face dataset based on the Casia-WebFace dataset [1]. It consists of 2236161 masked-face images. Then, both Casia-WebFace dataset and new dataset are combined to train the model. The proposed method was based on FaceNet using Inception-Resnet-v1 architecture. They test with several model and the best model was the fine-tuned FaceNet with the retrain Inception Block A on the new dataset. This model achieved 99.2% accuracy on masked-face test dataset. They also figure out that adding masked-face image into training data, it improve the accuracy of the model 0.6%

Besides, the transfer learning is known as a machine learning form where a model is built for a specific task and then reused on a second task as the starting point to be modified. It is used in deep learning as a pre-trained model in computer vision and natural language processing tasks to develop neural network models on these problems. The transfer learning is very useful in deep learning problems because most real-world problems usually have billions of labeled data, and this requires complex models. It is an effective technique for optimization, time saving and achieving better performance. Developers can use transfer learning to merge different applications into one. They can quickly train new models for complex applications. Moreover, transfer learning is a good tool to improve the accuracy of computer vision models. At the end of this research work, we present a facial recognition application for attendance system based on a deep learning model. We utilize transfer learning by using three pre-trained convolution neural networks and train them on our data which contains 10 different classes where each class includes 20 facial images. The three networks showed very high performance in terms of high prediction accuracy and reasonable training time. Therefore, face recognition based on deep learning can greatly improve the recognition speed and accuracy. The last and not lead, many existing API, Libraries, tools and techniques can help us to implement our application like OpenCV, TensorFlow, PyTorch.

3 Our Proposed Method

3.1 Overview

This section present our proposed method. We create an effective model for both recognizing masked faces and without masked faces. As mentioned in the Introduction, the method consist of three steps. In the first step, we obtain the datasets from the publish sources Casia-WebFace [1] and LFW [2]. They are the face images without face masks. Therefore, after pre-processing these data (i.e. noise removal) we use the image processing techniques to wear a mask on each face. In the second step, we use the new datasets for building our training data model based on inception resnet-v1. The last step is creating an application for testing our face recognition system. The detail of each step is described in the following sections.

3.2 Pre-processing Data

Cleaning Data: The dataset Casia-WebFace [1] is used for face verification and face identification tasks including 494414 face images of 10575 peoples. Each image has a size of 250\(\,\times \,\)250 \(\,\times \,\) 3 (width \(\,\times \,\) height \(\,\times \,\) channel) and saved as a jpg file. To avoid any noise in the image, an algorithm iterates through all images and detects face coordinates. Thus, the face in the image will be extracted from the background. In this section, the MTCNN model is used for face detection. This model includes three nets: P-Net, R-Net and O-Net to process and obtain the coordinates of the bounding box as a rectangle. Finally, we apply an image processing method to format, resize image to have a uniform of width and height, change the color to RGB and save image into a new folder (see Fig. 1). The algorithm (Algorithm 1) below is proposed to clear data.

Fig. 1.
figure 1

Using MTCNN to clean data

figure a

Creating Mask: After cleaning process, a new dataset of faces is created, however these faces do not have masks. Thus, we now create a mask on each face image. Firstly, a list of mask images will be retrieved from the internet that contain ten different mask images. All of them are processed by using an edited photo application to filter out the background. The purpose is to randomly select masks, thereby creating diverse images when feeding them into the model. The functions of application allows process the picture. Using a certain threshold to enable the mask separated from the background and obtain a mask image. In order to determine the coordinates points on the face for processing, a technique called facial landmark [18, 19] is used. This technique will mark important points of the face according to sixty-eight coordinates. We based on this information to wear mask on the face. Many solutions can be applied to cover a mask on the face like determining points around mouth, nose or chin, etc.

In this case, we map the mask to corresponding points on the face using the homography algorithm. Each face has different features and situation, applying this algorithm gives us a better result because the mask is rotated and resized based on the shape of face. We want to overlay both cheeks and nose with a mask, so the corresponding points in this case from 1 to 15 and 29. Figure 2 illustrates how the points on the face and the mask are connected. If the coordinates of the face change, the coordinates of the mask also change, so the mask will fit more close to the face (see Fig. 2). The next algorithm (Algorithm 2) show the way to wear a mask on the face:

Fig. 2.
figure 2

Creating a mask for the face

figure b

3.3 Creating Trained Model

In this section, we apply the FaceNet concept to create our model. As mentioned in [15], the picture is converted into 128-D embedding using triplet loss function. In the very first model, triplet loss function is used to train the model, but the result is bad. The loss value decreases gradually and the model eventually corrupts. The problems come from batch size and the label of each batch. The Casia-WebFace dataset has 10575 labels with a batch size of 512. Therefore, there is a small chance for images of the same person to exist in one batch, while triplet loss function requires triplets (anchor, positive, and negative). For this reason, we apply the FaceNet concept to convert image into embedding; but other factors like loss function and model architecture will be changed. After training, we will remove the last layer (dense net 10575) and add an l2 normalizer to the model. The final model for embedding is described as in Fig. 3. We use the Inception-resnet-v1 [14] to train our model. The model is added more Batch Normalization Layers to normalize input data and modify the last layer to get the 128-element embedding. This model is a stack of Stem, Inception-Resnet-A, Reduction A, Inception-Resnet-B, Reduction-B, Inception-Resnet-C. The cross-entropy is chosen to train the model, so the last layer must encode the image into an n-number vector, where n is equal to the number of people in the dataset. Due to the requirement of cross entropy, dense net has 10575 units before going to loss function because there are 10575 different persons in the dataset and one image belongs to only one person. But we do not want to classify 10575 people in a dataset or train again each time we add a new person. Therefore, we keep a dense net with 128 units to learn the features of the face before going to a dense net of 10575 (see Fig. 4).

Fig. 3.
figure 3

Stem block

Fig. 4.
figure 4

Our proposed architectural model

Training Process: Although the dataset has 10575 people, but the image number of each person are not the same that lead to generate noise. To solve this problem, for each epoch, a training dataset must be created in which each person has the same number of images. After that, each person includes fifteen images are chosen from the “no mask” dataset and fifteen images are chosen from the “mask” dataset. We apply processing techniques to augment the data. This will help the model predict better in different conditions (e.g. angle, light, size). Because the data is processed at random, we make a double of data to increase the amount of data in one epoch. Currently, we have 634500 (30\(\,\times \,\)10575\(\times \)2) images in the training dataset. We use several functions like change contrast, change brightness, flip, and crop an image, then combine them together to generate new images.

Fig. 5.
figure 5

Our application for testing and evaluating

Prediction Process: This model was inspired by the FaceNet model. It means the model learns to distinguish between different people and group the images of the same people into the same cluster. This approach is more efficient and we do not need to train our model again each time a new image is added. In detail, we generate an embedding of a new person and save it. To recognize an image, the model converts the image to embedding and uses a function to measure the similarity of this embedding with database embedding. The obtained result is a pair with the most similarities.

3.4 Building an Application

In this section, we build an application for testing our model. It is served as a real application to test and evaluate our model. Our application is performed on a single Laptop (MacBook Pro 8-Core CPU, 14-Core GPU) and its camera. The user interface of the application is created as follows (see Fig. 5).

4 Implementation and Results

In this section, we implement our proposed method and application for face recognition system. We perform our method by using Python, TensorFlow. The model and data are trained on Google Collab Pro. The application is then built based on our model to describe how it is used in the face recognition system. We implement all functions based on Python API to process data, to augment images, to train the training data. The Inception-Resnet v1 is implemented using Keras. To wear a mask on each face image, we use the API “FacialMaskDataSet.py”. The output of this step is presented as follows:

Fig. 6.
figure 6

Wearing a mask on each face image [1]

The face detection is performed by using existing source code in [20]. The obtained result is matched with data in the database (see Fig. 6 and Fig. 7)

Fig. 7.
figure 7

Face recognition is worked well with both masked face and non-masked face

5 Discussion and Evaluation

In order to evaluate our model, we test and run with many times of epochs. After running 49 epochs (see Fig. 8a), The loss value decrease dramatically in the first ten epochs; after that, this value decreases slowly and is stable from 35 to 49.

Fig. 8.
figure 8

Evaluation of the stable and accuracy of our model

The goal is to evaluate how our model can process both the face with and without mask. We combine the “no mask LFW" and “mask LFW" into one dataset. The obtained result is plotted in Fig. 8b. The accuracy increases after each epoch. Starting from the epoch twenty, accuracy increase insignificantly, staying around 90%.

To compare with the existing method, the FaceNet-PyTorch [21] in python library includes several versions of the FaceNet model. In this research work, we use the Casia-Webface version. This model is trained with the same dataset that we used. The dataset LFW and its variants will be reused in this experiment.

Our model will be the model at epoch 49, which is the latest model. At first, two models will convert images from the “no mask LFW" dataset into database embedding. Each label will have only one embedding. Then two models will predict data in two cases: LFW (with mask) and LFW (without mask). In this test we do not use a threshold to maximize the accuracy of two models, so the unknown answer will be zero (see Table 1).

Table 1. Comparison of the precision, recall, accuracy and F1 between our model and the existing methods

Comparing to other models, the accuracy of our model is better in case with mask and a little lower in case without mask. In order to improve for both cases (with and without mask), we can increase the number of epochs to run until meet the accuracy of our expectation.

6 Conclusion and Future Work

In this research work, we explore the several methods for face recognition and their application in practice. The methods are based on machine learning techniques are very popular and suitable for almost cases in practical applications nowadays. We proposed a method that is based on the Inception Resnet-v1 to implement our model. The obtained results have been shown the accuracy of our model (see Table 1). It is higher than the existing methods in case with mask. It can be adapted with real context of pandemic of Covid-19 at present. Besides, we built successful an application that will be applied in the company for timekeeping. This application can help the company to check-in, check-out and control their staffs every days. The work in the future can be extended an applied in other security systems as mentioned in the Introduction.