Keywords

1 Introduction

Facial recognition is a biometric identification process that consists of identifying a person based on digital images of his face. It is widely used in various fields such as security and robotics. The research on facial recognition can be traced back to the 1960s, and it is still continuous, it gained more popularity recently due to the advancements in computer hardware and artificial intelligence. Despite the achievements made in this area, it still faces major challenges caused by the variations that the images of an individual’s face could contain, such as the variation in light, the face pose variation, the variation in age, changes in facial expressions and face occlusions [1].

The main steps of a facial recognition system are the face detection, the extraction and classification of facial features. The face detection is the process of identifying the location of a face within an image. Object detection algorithms such as the Viola Jones algorithm [2] are used for face detection, and recently deep learning based methods have also been used like Faster R-CNN and YOLO [3, 4]. The extraction and the classification of facial features step consists of extracting the important facial features, which are used to distinguish between the individuals. Three approaches for feature extraction can be used: global approaches which consider the entirety of the face images to extract global features, local approaches extract local face features, the last approach is the hybrid approach which is a combination of the two approaches [5].

Classical facial recognition methods use simple handcrafted features to describe the content of the image, machine learning algorithms are then employed for the classification. Recent years have seen a great development in deep learning which spread its uses to facial recognition. The convolutional neural networks (CNN) are the most used type of neural networks in facial recognition.

In this paper, a facial recognition method based on the detection of regions of interest and the convolutional neural network is proposed. The main focus of this method is to achieve high recognition rates on small data sets with limited number of subjects.

The paper is organized as follows: In Sect. 2, the state-of-the-art methods for facial recognition are explored, including classical methods, a brief overview of CNN and deep face recognition. In Sect. 3, our proposed deep learning based facial recognition method is presented. The description of the data sets as well as the results and discussions of the experiments are presented in Sect. 4. The last section is dedicated to the conclusion.

2 Related Work

Many researches were conducted to improve the robustness of the facial recognition methods. The classical methods were based on techniques such as edges and contours, Gabor filters [6] are an example of these methods, which have been successfully applied in many image processing tasks including face recognition. The Local Binary Pattern (LBP) [7] is another method that can be used for facial recognition, it is a powerful texture descriptor that is invariant against the change of illumination. Other variants of the LBP were proposed which achieved better recognition rates [8,9,10]. The Eigen face [11] and the Fisher face [12] are two other facial recognition methods which are based on dimensionality reduction algorithms, the Eigen face is based on PCA (principal component analysis) and the Fisher face is based on LDA (Linear Discriminant Analysis).

Recently, deep learning based methods particularly convolutional neural networks have gained much popularity in various fields, including facial recognition. A CNN architecture is composed of different types of layers: a convolution layer which is the core component of the CNN is used to extract the features from an image and return feature maps, it consists of a combination of convolution operations and activation functions. For each layer, the convolutions are calculated between the feature maps of the previous layer and a set of kernels whose weights are learned during the training, followed by an activation function applied on the resulting feature maps. The convolution layer is usually followed by a pooling layer. The pooling layer is used to reduce the dimensionality of the feature map and retain only the most important features. The input image is divided into a set of windows of the same size, each window is down-sampled by outputting its maximum or average value and discarding all the other values. The fully connected layers are a multilayer perceptron which takes a flattened vector from previous layers and outputs a class for the input image. The number of output nodes of the last fully connected layer is the same as the number of classes [22]. In facial recognition, the convolution layers are used for the automatic features extraction from the face images, while the fully connected layers are used for the classification.

Deep face [13] was the first proposed CNN model for facial recognition, it was developed by Facebook AI research in 2014, mainly composed of nine layers, containing more than 120 million parameters. It achieved an accuracy of 97.35% on the LFW face data set when trained on the SFC (social face classification data set) which contains over 4.4 million images. FaceNet [14] is another CNN model that was developed in 2015, composed of 22 layers with more than 140 million parameters, trained on more than 200 million images, it achieved an impressive accuracy of 99.63% on the LFW data set. Another CNN model is the VGG-face [15] with 22 layers and more than 138 million parameters, it was trained on 2.6 million images, achieving an accuracy of 98.95% on the LFW data set. DeepID [16] is another popular CNN model for facial recognition reaching a high accuracy of 97.45% on the LFW data set when it was trained on 0.2 million images, this CNN model has more than 101 million parameters. These CNN models are complex and have high number of parameters, and they were trained on massive face data sets. A lot of work has been done on shallow CNN models with small number of parameters, which showed good results on fairly small data sets [17,18,19].

3 Proposed Method

Our method is composed of three main modules: the regions of interest extraction module, the convolutional neural network and finally the decision module (Fig. 1).

A points of interest detection algorithm [20] is used mainly for two reasons: firstly, because these algorithms are generally very fast and efficient. Secondly, the detected points of interest in a facial region in an image have a high probability of being detected in the same facial regions in other images of the same person (Fig. 2). For each image a maximum of 28 regions of size 32 × 32 pixels are extracted, the minimum distance between the centers of any two regions is set to 20 pixels.

Fig. 1.
figure 1

The architecture of the proposed method.

Fig. 2.
figure 2

Points of interest detected in a region are detected in the same regions in other images of the same person, 5 out of 8 points of interest detected in both images appear in similar regions.

The regions of interest are passed to CNN which returns a class for each region. The choice to use a CNN as a classifier is due to its efficiency in image classification tasks, particularly in facial recognition [13,14,15,16,17,18,19]. In case of small databases, a shallow neural network may achieve slightly higher recognition rates than a large neural network [32], for that reason we chose a shallow CNN model composed of 10 layers with 4 blocks of convolution and pooling layers followed by a fully connected layer of 512 nodes, and finally a softmax classifier. All blocs contain a batch normalization layer between the convolution and the pooling layers except for the first block. The batch normalization layers were used to prevent overfitting. We used filters of shape \(3\times 3\) for all the convolution layers, the first convolution layer employs 32 filters, where the second and the third contain 64 filters each, the last convolution layer contains 128 filters as shown in Fig. 3. The model was trained over 25 epochs using Adam optimizer and categorical cross entropy as a loss function. The number of layers and parameters as well as the number of epochs were chosen using a grid search technique.

Fig. 3.
figure 3

The proposed CNN architecture.

The CNN outputs of each of the regions are assembled into a “regions prediction vector”, and each element of this vector corresponds to the predicted class of a region by the CNN, the decision module takes this vector and returns the class with the most occurrences, which is the predicted class of the original input image.

4 Experiments and Results

In order to evaluate the results of our proposed method, we compared its recognition rate to those obtained with LBP, Eigen face, Fisher face, a CNN model similar to the one used in our method as well as the results of recent works on Georgia Tech Face Database and AR Face Database.

Fig. 4.
figure 4

Sample images of two subjects (male and female) from the Georgia Tech Face Database.

The first data set is Georgia Tech Face DatabaseFootnote 1 which contains images of 50 people taken in sessions between 06/01/99 and 15/11/99 at different times at the Georgia Institute of Technology’s Image and Signal Processing Center. Each individual in the database is represented by 15 color JPEG images, and the images of each subject are taken under different conditions, such as variation in exposure, variation in brightness, different facial expressions (as shown in the Fig. 4). The average size of the faces of these images is 150 × 150 pixels. We used a k-fold cross validation to divide this data set into training and testing where 80% of the images are used for training, and 20% are used for testing.

Fig. 5.
figure 5

(A) represents all the images of an individual (26 images), the images are divided into 2 sets, (B) which contains only the fully visible face images used for training, (C) which contains the face images with occlusions used for testing.

The second data set is AR Face Database [21] which contains more than 4.000 color images corresponding to 126 faces of people (70 men and 56 women). Images show frontal faces with different facial expressions, lighting conditions and occlusions (sunglasses and scarves). In this work, we selected 100 subjects (50 men and 50 women) where each subject has 26 images. This data set is used mainly to test the robustness of our method against occlusions. For the training subset we chose to include only the images of fully visible faces where each subject has 14 images (Fig. 5. B), the rest of the images, which are images of individuals with glasses and scarves are used for testing (Fig. 5. C).

Table 1. The average recognition rates of classical methods and our method on the Georgia Tech Face Database.
Table 2. The average recognition rates of recent works and our method on the Georgia Tech Face Database.

Table 1 shows the performance of different methods, as well as our method, we can notice that the LBP achieved the lowest recognition rate compared to other methods. Eigen Face performed slightly better than LBP, surpassed by the Fisher Face. CNN achieved a better recognition rate compared to previous methods, which proves its robustness against different variations in the face images. Compared to recent works on this data set, our method surpassed them as shown in Table 2.

Table 3. The recognition rates of classical methods and our method on the AR Face Database.
Table 4. The average recognition rates of recent works and our method on the AR Face Database.

The AR Face Database is the database on which our proposed method shows its potential and its robustness against occlusions. Table 3 and Table 4 present the recognition rates of classical methods and recent works on this data set, which used the same approach that we used to divide the data set into training and testing as well as the recognition rate of our method. We can observe that our method surpassed both the classical methods and the recent works, which confirms its effectiveness against face occlusions.

Fig. 6.
figure 6

History of training and testing accuracy of the CNN of our proposed method on the AR Face Database.

In Fig. 6, we notice that the testing accuracy of the CNN (the classification of the regions of interest) is very low (around 30%) but the face image recognition achieves high rates. The low accuracy of the CNN is caused by the misclassification of the occluded regions of the faces, since these regions have low similarity to the regions used for training, their classification is inconsistent which reduces their effect as noise.

5 Conclusion

The integration of deep learning into facial recognition has already proven to show an improvement in the recognition rates of facial recognition systems. In this paper, we propose a facial recognition method based on shallow convolutional neural networks and Harris corner detection algorithm, which showed great performance when tested on Georgia Tech Face Database and AR Face Database, and proved its robustness against pose variation, illumination variation, change in facial expressions and particularly face occlusion. In addition, we obtained better results than state-of-the art methods. The proposed facial recognition approach can be useful in recognizing facial identity even with facial occlusion, and could be extended to explore larger databases.