Keywords

1 Introduction

In the past few years, the face recognition task has seen a tremendous growth in terms of the robust recognition and applications in various spheres of human lives. Face Recognition has been seen with a significant usage in multiple domains like biometric-based security tools and criminal identification system among many others. Such applications of the face recognition has lead to researchers and developers to work and design face recognition systems strongly built to work in an unconstrained environment as its usage is expected to grow exponentially in the forthcoming years [35].

The advancements in deep learning have significantly accelerated the growth and performance of face recognition. AlexNet [14], proposed by Krizhevsky et al., is marked as the birth of the Convolutional Neural Networks (CNNs) which became a revolutionary architecture developed for the task of image classification and won the ImageNet Large Scale Challenge in 2012 [19]. Since then, many CNN based approaches have been introduced for face recognition such as, DeepFace [26], DeepID2 [24], FaceNet [20], SphereFace [17], and ArcFace [3]. The CNN based approaches [5, 10, 11, 14, 29] have shown a tremendous growth in the performance as compared to the hand-crafted features [1, 6,7,8, 13, 22]. The above growth was accompanied by the development of large-scale face datasets for training and testing the CNN based models, which majorly include CASIA-Webface [33], MS-Celeb-1M [9], Labeled Faces in the Wild (LFW) [12] and YouTube Faces (YTF) [31] among other face datasets. In this work, the CASIA-Webface and MS-Celeb-1M face datasets are used for training. However, the LFW and YTF face datasets are used for the testing.

The trend of CNN over time shows that the deep CNN architectures perform better as compared to the shallow networks. It was the motivation for the deeper architectures like GoogleNet [25] and ResNet [10]. The residual network shows that the performance of the deeper plain model is not improved because it is hard to optimize such model [10]. Thus, researchers also started exploring the relevance of loss functions in optimizing the deep networks. The Cross-Entropy (i.e., Softmax) loss is very widely used for optimizing the deep learning models. Recently, the work in loss functions has been quite significant with functions like SphereFace (i.e., Angular-Softmax) [17] and ArcFace [3], specially designed for the face recognition task and have shown very promising gain in the performance. Some other existing loss functions are Marginal loss [4], Soft-margin softmax loss [15], Large-margin softmax loss [18], Additive margin softmax [27], Minimum margin loss [30], Cosface: Large margin cosine loss [28], and AdaptiveFace: Adaptive margin loss [16]. Moreover, in another work, we have conducted a performance analysis of different loss functions and found that the ArcFace outperforms other losses [23].

A few attempts are also made to utilize the complexity of data in training such as the hardest positive pairs and hardest negative pairs are computed using margin sample mining loss by Xiao et al. [32]; an adaptive hard sample mining strategy it used by Chen et al. [2] to pick the hard examples in the training pair images; and an auxiliary embedding is used by Smirnov et al. [21] to pick the hard examples in mini-batches. Note that these methods try to find out the hard examples first and then use it for training. Whereas, the proposed method gives the high priority to hard examples inherently during training based on the performance of model in that iteration.

The main drawback of above mentioned loss functions is associated with its inefficiency while modelling the hard examples which lead to mis-classification. The loss due to the more number of easy examples dominates over the loss due to the less number of hard examples. This is because while training is in progress, the number of hard examples decreases while the number of easy examples increases as network learns over iterations. In this paper, we address the above mentioned problem by giving more importance to hard examples through loss function in each iteration. We propose the Hard-Mining loss which increases the loss for the hard samples leading to high loss and decreases the loss for the easy samples leading to low loss. As a result, the average loss contains the significant contributions from the hard examples.

This paper is structured as follows: Sect. 2 proposes the Hard-Mining loss and existing losses in the Hard-Mining framework; Sect. 3 describes the experimental setup and details about the architecture and training and testing face datasets used. Section 4 presents the experimental results and comparisons; and finally, Sect. 5 concludes the paper with summarizing remarks.

2 Proposed Hard-Mining Loss

The loss functions are used in deep learning to judge the goodness of any model under given parameters. The stochastic gradient descent (SGD) optimization is widely adapted to train the Convolutional Neural Networks (CNNs). The SGD computes the gradient of loss function w.r.t. to the parameters which is used to update that parameter such that in the next iteration, the loss should decrease. Thus, the loss functions judge the performance of the designed architecture as well as guide the learning process. It is shown in introduction that most of the existing losses are not able to penalize the mis-classification efficiently caused by harder examples. In this paper, we propose the concept of Hard-Mining loss which increases the loss for harder examples and decreases the loss for easier examples such that the average loss should have the better representation of hard examples. A comparison between the Cross-Entropy loss and proposed Hard-Mining loss is presented in Fig. 1 as a function of probability of being classified in the correct class. In this section, first we present the Cross-Entropy loss, then we propose the idea of Hard-Mining loss, and finally we extend the existing losses such as Cross-Entropy, Angular-Softmax, and ArcFace in the proposed Hard-Mining framework.

2.1 Cross-Entropy Loss

The Cross-Entropy (or softmax) loss has been majorly used to judge the performance of CNN models for image classification task [10, 14]. Mathematically, the Cross-Entropy loss can be given as

$$\begin{aligned} \mathcal {L}_{CE}=-\frac{1}{N}\sum _{i=1}^{N}\log \frac{e^{W^T_{y_i} x_i+b_{y_i}}}{\sum _{j=1}^{n}e^{W^T_j x_i+b_j}}, \end{aligned}$$
(1)

where W is the weight matrix, b is the bias term, \(x_i\) is the \(i^{th}\) training sample, \(y_i\) is the class label for \(i^{th}\) training sample, N is the number of samples, \(W_j\) and \({W}_{y_i}\) are the \(j^{th}\) and \(y_i^{th}\) columns of W, respectively. The Cross-Entropy loss is used as the baseline by the recent loss functions such as Angular-Softmax and ArcFace over the face recognition problem. Hence, we also use the Cross-Entropy loss as the baseline along with Angular-Softmax and ArcFace losses.

The behavior of the Cross-Entropy loss w.r.t. the probability of being classified in the correct class for an example is plotted in Fig. 1. It can be observed from this analysis that the Cross-Entropy loss gradually follows a downward slope and there is no big difference between easy and hard examples. We believe that if the probability is more than 0.5 then the loss should be minimum. Whereas, if the probability is less than 0.5 then the loss should be on higher side. This is our intution to propose the Hard Mining Loss described next.

Fig. 1.
figure 1

Loss value vs Likelihood (i.e., probability for correct class) plot for the Cross-Entropy loss and Hard-Mining loss functions. Note that the Hard-Mining loss is computed on the output of Cross-Entropy loss.

2.2 Hard-Mining Loss

Motivated from the fact that the loss for harder examples should be more, we propose the idea of Hard-Mining loss. The proposed Hard-Mining loss increases the loss if the probability is less than roughly 0.5, while at the same time it also decreases the loss if probability is more than 0.5 roughly. The Hard-Mining loss is defined as

$$\begin{aligned} \mathcal {L}_{{HM}}= \alpha \times \mathcal {L} \times \sigma (\beta \times \mathcal {L}) \end{aligned}$$
(2)

where \(\mathcal {L}\) is the loss generated by any other loss function such as Cross-Entropy, Angular-Softmax, etc., \(\alpha \) and \(\beta \) are the hyperparameters and \(\sigma \) is the sigmoid function given as:

$$\begin{aligned} \mathcal {\sigma }(x)= \frac{1}{1 + e^{-A(x-B)}} \end{aligned}$$
(3)

where A and B are the hyperparameters.

Note that the Hard-Mining operation is generic in nature, i.e., it can be used along with any existing loss function. In this paper, we use the Hard-Mining operation along with Cross-Entropy, Angular-Softmax, and ArcFace losses.

2.3 Hard-Mining Cross-Entropy Loss

As mentioned previously, the Hard-Mining concept is generic and can be used with existing losses. Primarily, we define the Hard-Mining loss with Cross-Entropy loss. The Hard-Mining Cross-Entropy loss (\(\mathcal {L}_{HM\_CE}\)) is defined as

$$\begin{aligned} \mathcal {L}_{HM\_CE}= \alpha * \mathcal {L}_{CE} * \sigma (\beta * \mathcal {L}_{CE}) \end{aligned}$$
(4)

where \(\alpha \) and \(\beta \) are the hyperparameters, \(\sigma \) is defined in (1), and \(\mathcal {L}_{CE}\) is the Cross-Entropy loss given in (1). Algorithm 1 shows the step-by-step instructions for the proposed Hard-Mining Cross-Entropy loss (\(\mathcal {L}_{HM\_CE}\)).

The behavior of Hard-Mining operation on Cross-Entropy loss is depicted in Fig. 1. Note that the values of hyper-parameters \(\alpha \), \(\beta \), A, and B are set to 1.5, 1.1, 35, and 0.75, respectively. It can be seen that the Hard-Mining operation increases the loss for hard examples (i.e., with less than half probability) while it decreases the loss for easy examples (i.e., with more than half probability). Our definition of hard/easy examples is relative to the probability of being classified in the correct class in a given iteration. Thus, the hard examples at the start of the training might become easy examples after training of some iterations.

figure a

Since, the Cross-Entropy is a very widely used loss function in various machine learning problems, it is paramount that we study the performance of Hard-Mining operation with loss functions specially designed for the face recognition problem. We consider two loss functions (i.e., Angular-Softmax [17] and ArcFace [3]) designed for the face recognition problem in the proposed Hard-Mining loss framework.

2.4 Hard-Mining Angular-Softmax Loss

The Hard-Mining Angular-Softmax loss (\(\mathcal {L}_{HM\_AS}\)) is defined as follows:

$$\begin{aligned} \mathcal {L}_{HM\_AS}= \alpha \times \mathcal {L}_{AS} \times \sigma (\beta \times \mathcal {L}_{AS}) \end{aligned}$$
(5)

where \(\alpha \) and \(\beta \) are the hyper-parameters, \(\sigma \) is given in (3), and \(\mathcal {L}_{AS}\) is the Angular-Softmax loss defined in the SphereFace model [17] and given as

$$\begin{aligned} \mathcal {L}_{AS}=-\frac{1}{N}\sum _{i=1}^{N}\log \big ( \frac{e^{\Vert \varvec{x}_i\Vert \psi (\theta _{y_i,i})}}{e^{\Vert \varvec{x}_i\Vert \psi (\theta _{y_i,i})}+ \sum _{j\ne y_i}e^{\Vert \varvec{x}_i\Vert \cos (\theta _{j,i})}} \big ) \end{aligned}$$
(6)

where \(x_i\) is the \(i^{th}\) training sample, \(\psi (\theta _{y_i,i})=(-1)^k\cos (m\theta _{y_i,i})-2k\) for \( \theta _{y_i,i}\in [\frac{k\pi }{m},\frac{(k+1)\pi }{m}]\), \(k\in [0,m-1]\) and \(m\ge 1\) is an integer controlling the size of angular margin.

2.5 Hard-Mining ArcFace Loss

ArcFace loss has been used in the recently developed ArcFace model for face recognition [3]. In a recent performance comparison study, ArcFace has been figured as the outstanding loss for face recognition [23]. The Hard-Mining ArcFace loss (\(\mathcal {L}_{HM\_AF}\)) is defined as

$$\begin{aligned} \mathcal {L}_{HM\_AF}= \alpha \times \mathcal {L}_{AF} \times \sigma (\beta \times \mathcal {L}_{AF}) \end{aligned}$$
(7)

where \(\alpha \) and \(\beta \) are the hyper-parameters, \(\sigma \) is given in (3), and \(\mathcal {L}_{AF}\) is the ArcFace loss [3] and given as

$$\begin{aligned} \mathcal {L}_{AF} =-\frac{1}{N}\sum _{i=1}^{N}\log \frac{e^{s \cdot (\cos (\theta _{y_i}+m))}}{e^{s \cdot (\cos (\theta _{y_i}+m))}+\sum _{j=1,j\ne y_i}^{n}e^{s \cdot \cos \theta _{j}}}, \end{aligned}$$
(8)

where s is the radius of the hypersphere, m is the additive angular margin penalty between \(x_i\) and \({W_y}_{i}\), and \(\cos (\theta +m)\) is the margin which makes the class-separations more stringent.

3 Experimental Setup

In this section, we discuss the CNN architectures, training and testing datasets used for the experiments along with other settings like optimizers, learning rate, epochs, etc.

3.1 CNN Architectures

Several CNN architectures have been developed for different computer vision tasks. The recent trend is to utilize the power of residual learning. The ResNet model uses the residual blocks [10] which is very commonly used nowadays. In this paper, we consider ResNet architecture with 18 depth (i.e., ResNet18) for all the experiments.

3.2 Training Datasets

In our experiments, we primarily use two publicly available datasets such as CASIA-Webface [33] and MS-Celeb-1M [9] as the training datasets. The CASIA-Webface is one of the most widely adapted and available dataset used for the face recognition task. It contains 4,94,414 colored face images belonging to 10,575 different individuals. Second dataset used in our experiments is the MS-Celeb-1M dataset which consists of 1,00,000 face identities with each class containing 100 images leading to about 10M images, which are scraped from public sources. Being a humongous dataset, it contains a lot of noise and variations which impact the performance of the trained model. Hence, we use a cleaned and refined subset of the dataset as per the cleaned list provided by the ArcFace [10] authors.

Table 1. Verification accuracies (%) using ResNet18 model over LFW and YTF face recognition testing datasets under different loss functions. The training is performed over CASIA-WebFace dataset.

3.3 Testing Datasets

We use the Labeled Faces in the Wild (LFW) [12] and Youtube Faces (YTF) [31] as the testing datasets in this paper. The LFW dataset contains 13, 233 images of 5749 identities. The YTF dataset consists of 3, 425 videos of 1, 595 different people with images available in frame-by-frame format and retrieved through the provided meta data. Both the datasets use the standard LFW benchmark for face verification, which provide the verification accuracies over the testing dataset. These accuracies are used as the performance measure in the state-of-the-art face recognition works. Hence, we also use the accuracy as the performance measure in this paper.

3.4 Input Data and Network Settings

Following the recent trend [3, 17], we use the MTCNN [34] to align the face images. The images are normalized by subtracting 127.5 from each pixel and then being divided by 128. The batch-size is kept at 64 with the initial learning rate as 0.01. The learning rate is multiplied by 0.1 at \(8^{th}\), \(12^{th}\) and \(16^{th}\) epochs. The model is trained up to 20 epochs. The Stochastic Gradient Descent with Momentum (SGDM) is used as the optimizer to train the network. The values of hyper-parameters \(\alpha \), \(\beta \), A, and B are empirically set to 1.5, 1.1, 35, and 0.75, respectively, in this paper.

4 Experimental Results and Observations

In order to show the effect of the proposed Hard-Mining loss, the face recognition experiments are conducted in this paper with ResNet18 model. Three existing loss functions, namely Cross-Entropy, Angular-Softmax and ArcFace, are used in the framework of the proposed Hard-Mining loss. The training is performed over the CASIA-WebFace and MS-Celeb-1M datasets and testing is performed over the LFW and YTF datasets.

Table 2. Verification accuracies (%) using ResNet18 model over LFW and YTF face recognition testing datasets under different loss functions. The training is performed over MS-Celeb-1M dataset.

The results in terms of the verification accuracies are reported in Table 1 using ResNet18 model for the CASIA-WebFace training dataset over the LFW and YTF testing datasets. It can be seen that an improvement is obtained by the Hard-Mining Cross-Entropy loss, Hard-Mining Angular-Softmax loss, and Hard-Mining ArcFace loss as compared to the Cross-Entropy loss, Angular-Softmax loss, and ArcFace loss, respectively, over both the LFW and YTF datasets.

The results in terms of the verification accuracies are reported in Table 2 using ResNet18 model for the MS-Celeb-1M training dataset over the LFW and YTF testing datasets. It is noticed from this result that the performance of Hard-Mining operation based losses is either better or comparable over LFW dataset w.r.t. the losses without Hard-Mining operation. Moreover, Hard-Mining operation is also suited with Cross-Entropy loss over YTF dataset when training is performed over MS-Celeb-1M datasets.

The experimental results suggest that increasing the loss for harder examples and decreasing the loss for easy examples in each iteration enforce the network to learn the characteristics of hard-examples as well. Overall, the proposed Hard-Mining loss is well suited for the face recognition problem along with the existing loss functions.

5 Conclusion

In this paper, a concept of Hard-Mining loss is proposed which increases the loss for hard examples being mis-classified and decreases the loss for easy examples. By doing so, we enforce the network to learn the characteristics of hard examples. The proposed concept is generic in nature and can be used with any existing loss function. We have tested the proposed Hard-Mining loss with Cross-Entropy, Angular-Softmax and ArcFace losses. The experiments are performed over CASIA-WebFace and MS-Celeb-1M training datasets and LFW and YTF testing datasets using ResNet18 model. It is observed from the experiments that the proposed Hard-Mining loss boosts the performance of existing losses in most of the cases.