Keywords

1 Introduction

Over the past few years, due to the success of convolutional neural networks, the accuracy of face recognition has improved greatly. Although there are many new loss functions [1,2,3,4,5], the most commonly used one is still softmax loss, which mainly optimizes the inter-class difference, and gives same weight to all samples. Although most training samples are easy samples in face recognition, there are still hard samples. These hard samples may degrade the generalization performance of the model. Focal loss [6] is proposed for dense object detection, it down-weights the loss assigned to easy samples, and focuses on training hard samples in order to prevent the vast number of easy samples from overwhelming the model during training. Although its performance is better, it is difficult to apply to face recognition, because most of the time, the number of training samples of one subject is not large. Meanwhile, we think it is unreasonable to measure the difficulty of training samples by probability. One main difference between face recognition and detection is the variation of one person is small (although there are still changes in pose, expression and illuminations), and thus we can obtain the feature’s centers of each subject. We think it is more reasonable to use the angle between features and its corresponding centers than probability to measure whether it is easy sample or hard one. We also think it may degrade the generalization performance of the model when focus on training hard samples, so we give greater weight to those easy samples. In this paper, we use cosine distance of features and its corresponding centers as the weight and propose a new loss function called C-Softmax loss.

The advantages of C-Softmax loss is as follows: 1. It is easier to convergence than L-Softmax [4] and A-Softmax [5]. When training data has too many subjects, the convergence of L-Softmax and A-Softmax will be more difficult than softmax loss, and thus they used a learning strategy. The proposed loss is based on softmax loss, so it is easy to convergence. 2. It does not need any pre-trained model. Both COCO loss [7] and NormFace [8] use a pre-trained model and fine tune the model by their loss. We use softmax loss in the first few epochs to get the rough centers, which could not be considered as the pre-trained model, because the total number of training epoch remains unchanged, and the performance of the model is poor at this time. 3. It does not need to design pair selection procedure like triplet loss [2] and contrastive loss [3].

Although C-Softmax has many advantages, it still faces some problems. One main problem is it has to maintain feature centers like center loss [1], and we update feature centers the same way center loss does. Another problem is we have to train the model by softmax in the first few epochs, and decrease the number of epoch by C-Softmax loss, so as to keep the total number of training epoch unchanged.

2 Related Work

Given an input image xi with label yi, original softmax loss function is defined as:

$$ Ls = - \frac{1}{m}\sum\limits_{i = 1}^{m} {\log (\frac{{e^{{W_{{y_{i} }}^{T} f(x_{i} ) + b_{{y_{i} }} }} }}{{\sum\nolimits_{j = 1}^{n} {e^{{W_{j}^{T} f(x_{i} ) + b_{j} }} } }})} $$
(1)

where m is the batch size, n is the number of training class, f(xi) is the feature, \( {\mathbf{W}} \in R^{n \times d} \) and \( {\mathbf{b}} \in R^{n} \) are the weight and bias of the fully-connected layer before softmax loss, Wj is the j-th column of W and d is the feature dimension.

Focal loss [6] is proposed for dense object detection. It is used to handle extreme imbalance between foreground and background classes. The \( \alpha \)-balanced variant of the focal loss is defined as:

$$ FL(p_{t} ) = - \alpha_{t} (1 - p_{t} )^{\gamma } \log (p_{t} ) $$
(2)
$$ p_{t} = \left\{ {\begin{array}{*{20}c} {p_{i} } & {if \, y_{i} \text{ = } 1} \\ {1 - p_{i} } & {otherwise} \\ \end{array} } \right. $$
(3)
$$ \alpha_{t} = \left\{ {\begin{array}{*{20}c} \alpha & {if \, y_{i} \text{ = } 1} \\ {1 - \alpha } & {otherwise} \\ \end{array} } \right. $$
(4)

where \( \alpha \in \left[ {0,{ 1}} \right] \) is a weighting factor, \( p_{i} \in \left[ {0,{ 1}} \right] \) is the model’s estimated probability for the class with label \( y_{i} \) = 1, \( \gamma \) is set to 2 in the paper.

We can apply focal loss to face recognition. But the performance is worse than softmax loss. We think the reason is that it is unreasonable to use probability to measure the degree of difficulty of samples, and it may degrade the performance when focus on training hard samples. Inspired by focal loss, we modified softmax loss and proposed weighted Softmax loss via Cosine Distance (C-Softmax) to train deep models for face recognition.

3 Proposed C-Softmax Loss

Given two vectors \( f \in R^{d} \) and \( C \in R^{d} \), the cosine distance of them is:

$$ d = \frac{{f \cdot c^{T} }}{{\left\| f \right\|_{2} \left\| c \right\|_{2} }} $$
(5)

The range of the cosine distance is [−1, 1]. The greater the distance, the more similar these two vectors is. The proposed C-Softmax loss is defined as:

$$ CS_{i} = - w_{i}^{r} \times \,\log (p_{i} ) $$
(6)

where wi is the modified cosine distance of the current features fi and the corresponding centers ci. \( \gamma \) is set to 2, so there is no hyper-parameter in C-Softmax loss. As the angle between the feature and its corresponding center is greater than 90°, the weight is negative, so wi is defined as follows to keep its monotony.

$$ w_{i} = \left\{ {\begin{array}{*{20}c} d & {if \, d \, \ge { 10}^{ - 6} } \\ {10^{ - 6} } & {otherwise} \\ \end{array} } \right. $$
(7)

We do not use \( \alpha \)-balanced variant of C-Softmax loss in order to keep it concise. If all the weights are 1, then C-Softmax loss becomes softmax loss. If the weight of hard examples are greater than easy ones, C-Softmax loss is more like focal loss.

4 Results and Analysis

4.1 Experiment Details

Experiment Settings:

We implement the proposed loss using PyTorch [11] framework. The face landmarks are detected by MTCNN [12]. The aligned face images are of size 112 * 96. The weight decay is 5e−4. The batch size is 256 and we use stochastic gradient descent to train the model. The learning rate begins with 0.1 and is divided by 10 at 11, 16 and 19 epochs, and finishes at 20 epochs. There are three ways to obtain the centers. 1 initialize the centers randomly and train the model by C-Softmax from the beginning. 2 fine tune the model by C-Softmax loss from a pre-trained model and the corresponding centers. 3 train the model by Softmax for a few epochs and by C-Softmax for the remaining epochs. For the first one, the centers could not be 0 because the cos distance between vector 0 and any vector is 0, result in C-Softmax loss always be 0. When the centers is initialized improper (cosine distance of the features and its centers is negative), the performance of C-Softmax loss will be bad. We will get the best performance with the second way, but it will consume twice as much time (train by softmax and fine tune by C-Softmax). We choose the third way. The feature’s centers are more stable when the epochs trained by softmax loss increases and the epochs trained by C-Softmax decreases as the total number of training epochs is fixed. We found the performance is the best when trained with softmax for 3 epochs. So we set all centers to be 0 at the beginning, train the model by softmax loss for 3 epochs, and update the centers like center loss. We use C-Softmax loss to train the model from epoch 4, the training finishes at 20 epochs.

Network Structure:

We compare the performance of different loss functions with four network structures. Model-A is the same as [5]. model-B has Batch Normalization (BN) [13] layer after FC1 layer. Model-C has BN layer after each convolution layer and FC1 layer. Model-D uses RReLU [14] instead of PReLU [15] as activation function, and it has BN layer after each convolution layer and FC1 layer.

Training:

We use CASIA-WebFace [9] to train our CNN models. CASIA-WebFace has 494414 face images belonging to 10575 different individuals. In [16] they reported 17 overlapped identities between CASIA-WebFace and LFW [10], and 42 overlapped identities between CASIA-WebFace and MegaFace [17] set1. We checked their result and found 3 mismatched overlapped identities, meanwhile we also found another 5 overlapped identities, so there are totally 19 overlapped identities between CASIA-WebFace and LFW. We removed all these 61 identities, and use the remaining 447020 images from 10541 identities to train the model.

Evaluation:

We extract the features from the output of the FC1 layer, and if there is BN layer after FC1 layer, we thus use the output of BN layer as the features instead. Features from the original image and its horizontally flipped one are extracted, and then merged by element-wise mean as the representation. The dimension of the feature is 512. We use LFW [10] and MegaFace [17] set1 for evaluation. We follow the unrestricted with labeled outside data protocol [18] on both datasets. We also evaluate the performance through BLUFR protocols [19], it is more challenging and generalized for LFW because it utilize all 13233 images while the standard evaluation protocol only evaluated on 6000 image pairs.

4.2 Experiment Results

The 3 to 5 columns in Table 1 show the performance of different network structures trained with A-Softmax loss [5], softmax loss, center loss [1], focal loss [6] and the proposed C-Softmax loss. We can see that the performance of A-Softmax with model-A and model-B are both good, but when BN layer is added after convolution layer, DIR@FAR = 1% drops from 82.03% to 75.99%. Although it increases to 80.61% when use RReLU (model-D), the performance is still lower than the original model.

Table 1. Performance (%) comparison for different loss functions with different structures on LFW and MegaFace dataset.

When BN layer is added after FC1 layer (changed from model-A to model-B), and trained with softmax loss, focal loss and center loss, the performance of DIR@ FAR = 1% increase greatly. The performance are further improved when BN layer is added after each convolution layer (model-C). When we replace PReLU with RReLU, the performance of these three loss all decrease (model-D). Although focal loss outperforms softmax loss in dense object detection [6], its performance is worse than softmax loss in face recognition.

Although the performance of the C-Softmax loss is not very good to train model-A, it works quite well with other three model structures. DIR@FAR = 1% increases to 86.17% when trained model-D, and it outperforms the performance of model-B trained with A-Softmax loss, which is 82.03%. Meanwhile, C-Softmax loss outperforms both focal loss and softmax loss when trained with same model (except model-A), and the improvement is obvious. The improvement benefits from not only the cosine distance instead of probability as the measurement of easy or hard samples, but also gives greater weight to easy samples than hard samples. We ignored some difficult samples, but the generalization performance of the model was improved. If the proportion of hard samples in the training datasets is low, and we focus on training them, it may degrade the generalization performance of the model, like focal loss used in face recognition. Otherwise we should give greater weight to hard samples and focus on training them, like focal loss used in object detection [6].

As is analyzed in [13], the distributions of features trained by softmax changed significantly over time without BN layer, both in mean and variance, and the features are not necessarily discriminative [5]. On the contrary, A-Softmax can learn discriminative features [5]. Focal loss and C-Softmax loss are both based on softmax loss, so the features are not as discriminative as A-Softmax loss. This is why the performance of model-A trained by softmax loss, focal loss and C-Softmax loss are poor. BN layer makes the distribution of the features more stable as training progresses and reduces the internal covariate shift [13], so the performance of the model trained by softmax loss, focal loss and C-Softmax loss improved greatly when BN layer is added, and the features are necessarily discriminative. At this time, BN layer may affect discriminant performance of A-Softmax loss.

From the above analysis we can also see that no loss function can work quite well with all structures. A-Softmax is more suitable for models without BN layer after convolution layer, while others are more suitable for models with it. A-Softmax and C-Softmax are more suitable for models with RReLU layer, while others are more suitable for models with PReLU layer. And we should train model with the most suitable loss function, so as to get best performance.

Table 2 list the accurate of different methods on LFW. Some methods use their own dataset, like FaceNet [2]; some methods trained on MS-Celeb-1M [20], like SeqFace [21], ArcFace [22]; some methods trained on CASIA-WebFace [9], like LGM [23], NormFace [8]. We have the following observations. First, the performance of the methods trained on large datasets (The number of images is more than 1M) are quite good. Second, the performance will be further improved with more layers. The number of layers of SeqFace [21], SeqFace [21] and Ring Loss [24] are all greater than or equal to 64 layer, and their accurate are very high. Third, the performance of the proposed method is equal or better than LGM [23], NormFace [8] and AM-Softmax [16] when trained on the same dataset (Strictly speaking, the training images we used is the least). Generally speaking, we obtain state of the art performance by using the least number of training images.

Table 2. Detailed information and verification accuracy (%) of different methods on LFW

The last two columns in Table 1 show rank-1 identification accuracy with 1M distractors and verification TAR for 10−6 FAR of various loss functions on MegaFace set1. C-Softmax outperforms the other loss functions and gets the best result when trained with the most suitable model.

To make our experiment more convincing, we also trained simplified Inception V3 [27], DenseNet [28], ResNeXt [29] with softmax loss, center loss [1], focal loss [6] and C-Softmax loss. The depth of Inception V3 is 37. The depth of ResNeXt is 29 with cardinality = 32 and bottleneck width = 4d. The depth of DenseNet is 21 with growth rate = 32, dense blocks = 4 while each have 2 layers. Table 3 lists the results. C-Softmax loss outperforms other loss functions and gets the best result with all these models.

Table 3. Performance (%) on LFW dataset with other well known models

5 Conclusion

Inspired by focal loss, we proposed a new loss function called C-Softmax loss in this paper. Firstly, we use the cosine distance of the features and the corresponding centers as the measurement of whether the sample is easy or hard, and add it as the modulating factor to the softmax loss. Secondly, we give greater weight to easy samples than hard samples in training phase. There is no hyper-parameter in the proposed loss. The results show that the proposed loss function provides a significant and consistent boost over softmax loss and focal loss, and can be used to train other well known models like ResNet, ResNeXt, DenseNet and Inception V3.