1 Introduction

Recently, deep CNNs (Convolutional Neural Networks) have boosted the FR (Face Recognition) performance to an unprecedented level. It mainly benefits from the large scale training data (Deng et al. 2009; Russakovsky et al. 2015) and the advanced network architectures (Krizhevsky et al. 2012; Simonyan and Zisserman 2014; Szegedy et al. 2015; He et al. 2016). In contrast to the conventional approaches (Guillaumin et al. 2009; Cao et al. 2010; Yin et al. 2011; Huang et al. 2012a) in face recognition, deep face recognition (Huang et al. 2012b; Cai et al. 2012; Sun et al. 2013; Liu et al. 2016; Lu et al. 2015; Wang et al. 2018; Ding and Tao 2018; Ranjan et al. 2017) typically can achieve a better performance.

Since Sun et al. (2014a) and Taigman et al. (2014) reported their work on face recognition via feature learning, most of the related work focused on how to learn effective features from the network. Desired features are expected to be intra-class invariant and inter-class discriminative.

However, faces of the same identity could look much different when presented in different poses, illuminations, expressions, ages, and occlusions, and then caused the intrinsically large intra-class variations and high inter-class similarity that faces exhibit. Therefore, reducing the intra-class variations while enlarging the inter-class differences is a central topic in face recognition. There are two main aspects of the work in order to achieve a better performance in face recognition, focusing on the network structure construction (Simonyan and Zisserman 2014; Szegedy et al. 2015; He et al. 2016) (e.g. VGGNet, GoogLeNet and ResNet) and loss function design (Sun et al. 2014b; Schroff et al. 2015; Wen et al. 2016b).

Constructing highly efficient loss function for discriminative feature learning in CNNs is non-trivial. Softmax loss is able to directly address the classification problems. However, the softmax loss only encourage the discriminative of features. The resulting features are not sufficiently effective for face recognition. As an alternative approaches, contrastive loss (Sun et al. 2014b; Hadsell et al. 2006) and triplet loss (Schroff et al. 2015) respectively constructs loss function for image pairs and triplets. The networks of DeepID (Sun et al. 2014b) was trained by using a combination of classification and verification loss. The contrastive loss is used as verification loss and the softmax loss is used as the classification loss. However, the network structure is too complicated to implement for users.

As alternative approach, triplet loss (Schroff et al. 2015) construct loss function for triplets which include an anchor, a positive sample with the same label as anchor and a negative sample. However, non-discriminative triplet samples may be selected, where the distance of positive pair is much smaller than the negative one. Therefore, the network based on triplet loss may suffer from slow convergence and instability. By carefully selecting the image triplets, the problem may be partially alleviated. But it significantly increases the computational complexity and the training procedure becomes inconvenient.

In this paper, a joint loss function based on triplet loss is proposed which consists of a hard sample triplet (HST) which selects the triplets carefully and an absolute constraints triplet (ACT) loss which constrain the maximum intra-class distance is smaller than any inter-class distance. The loss inherits the advantage of triplet loss that aims to separate the positive pair from the negative by a distance margin in an embedding space. Meanwhile it directed against the weakness of triplet loss by imposing an absolute constraint on the loss based on the criterion that an intra-class distance should be smaller than any inter-class distance, as verified by the experiments.

Figure 1 shows the framework of the proposed algorithm. Input datas are sent to the CNN network, and distance matrix of features extracted by CNN is calculated. The features are \(\ell _{2}\)-normalized and then they are sent to the proposed loss in which HST is employed to select the triplets carefully and ACT is employed to further enhance the discriminative power for learning face representations. The maximum intra-class distance and the minimum inter-class distance of one triplet are sent to the loss function.

Fig. 1
figure 1

Illustration of the proposed algorithm training with joint loss

The main contributions of this paper are summarized as follows:

  • A joint loss function which consists of HST and ACT loss is proposed to satisfy the requirement that maximum intra-class instance is smaller than any inter-class instance of the deep features. With the supervision of the loss, the highly discriminative features can be obtained for robust face recognition, as supported by our experimental results.

  • The proposed loss function is flexible to implement in the CNNs, and the CNN models can be directly optimized by the standard SGD (Stochastic Gradient Descent).

  • The comparable performance of our new approach is verified on Labeled Faces in the Wild (LFW) (Huang et al. 2007) and YouTube Faces (YTF) (Wolf et al. 2011) datasets.

2 Related work

There is a vast corpus of face verification and identification works. Face recognition via deep learning has achieved a series of breakthrough in these years (Taigman et al. 2014; Sun et al. 2014b; Schroff et al. 2015; Parkhi et al. 2015a; Yin et al. 2017; Wen et al. 2018). Sun et al. 2014a addressed the open-set FR using CNNs supervised by softmax loss, which essentially treats open-set FR as a multi-classification problem. Since then, people mostly focus on how to learn a discriminative feature from the deep neural network to improve the verification/identification performance.

Schroff et al. (2015) treated the FR problem using the same loss function as (Sun et al. 2014a), and it also proposed a multi-stage approach that aligns faces to a general 3D shape model. The author also experimented with a so called Siamese network where they directly optimize the \(\ell _{1}\)-distance between two face features in face verification problem.

Schroff et al. (2015) used the triplet loss to learn a unified face embedding. Training on nearly 200 million face images, they achieved current state-of-the-art FR accuracy.

Sun et al. (2014a, b) proposed a compact and therefore relatively cheap method to compute network. Both PCA and a joint Bayesian model that effectively correspond to a linear transform in the embedding space were employed. The networks were trained by using a combination of classification (softmax loss) and verification loss (contrastive loss). The main difference between contrastive loss and triplet loss was that only pairs of images were compared in contrastive loss, whereas the triplet loss encouraged relative distance constraint. As can we see, most widely used loss function for deep metric learning are contrastive loss and triplet loss, and both generally impose Euclidean margin to features.

Inspired by linear discriminant analysis, Wen et al. (2016a) proposed center loss for CNNs and also obtains promising performance. There are also some loss function which improved based on the softmax loss. Liu et al. (2016, 2017a) mapped the features to angular space to obtain the discriminative features, and they have achieved excellent performance on face recognition. In this paper, an improved loss function based on the triplet loss is proposed, which is able to get a comparable result on face recognition only with a single model.

3 The joint loss

In this section, we elaborate the proposed approach. The brief review of triplet loss and introduction of the hard sample triplet (HST) loss are referred to Sect. 3.1. In Sect. 3.2, the ACT loss is presented in detail.

3.1 Brief review of the triplet loss and HST loss

Schroff et al. (2015) proposed to employ the triplet loss to train CNNs for face recognition. The representation of a face image x is \(\ell _{2}\)-normalized as the input of the triplet loss. The \(\ell _{2}\)-normalized face representation as \(f(\mathbf{x })\) are donated. The designed representation of an anchor image \(f(\mathbf{x }^a)\) of a specific subject is expected to be closer to the positive image \(f(\mathbf{x }^p)\) which with the same label than the negative image \(f(\mathbf{x }^n)\) with the different label. These three features\((f (\mathbf{x }^a),f(\mathbf{x }^p),f(\mathbf{x }^n))\) compose a triplet. The triplet are expected to satisfy the formula as below:

$$\begin{aligned} \Vert f(\mathbf{x }^a)-f(\mathbf{x }^p)\Vert _{2}^{2}+\beta <\Vert f(\mathbf{x }^a)-f(\mathbf{x }^n)\Vert _{2}^{2} \end{aligned}$$
(1)

where \(\beta \) is the margin that satisfy the constraint between the positive pair \((f(\mathbf{x }^a),f(\mathbf{x }^p))\) and the negative pair \((f(\mathbf{x }^a),f(\mathbf{x }^n))\). The triplet loss function is formulated as below:

$$\begin{aligned} \mathbf{L }_{triplet}(f)=\frac{1}{2N}\sum _{i=1}^{N}\left[ \Vert f(\mathbf{x }_{i}^{a})-f(\mathbf{x }_{i}^{p})\Vert _{2}^{2}-\Vert f(\mathbf{x }_{i}^{a})-f(\mathbf{x }_{i}^{n})\Vert _{2}^{2}+\beta \right] _{+} \end{aligned}$$
(2)

where N is the number of the triplets in a batch, and \((f(\mathbf{x }_{i}^{a}),f(\mathbf{x }_{i}^{p}),f(\mathbf{x }_{i}^{n}))\) stands for the i-th triplet. The loss is illustrated in Fig. 2a. However, non-discriminative samples may results in slow convergence and instability of the network, and the generalization of the model learned by triplet loss may be poor.

Fig. 2
figure 2

Illustration of triplet, HST and ACT loss. a The triplet loss where the three dots with different colors stand the triplet. b The HST loss where the maximum intra-class distance and the minimum inter-class distance are sent to the loss function. c The ACT loss where any inter-class distance is required to be larger than the maximum intra-class distance (Color figure online)

The problem can be solved by using an alternative loss function called hard sample triplet (HST) loss (Hermans et al. 2017) which was proposed for person re-identification. Specifically, there are S subjects with different labels and N images for each subject in a batch. That is to say, there are S*N images in a batch or the batch size is S*N. We denote the batch set by \(\chi ={(f(\mathbf{x }_i),y_i)}_{i=1}^{S\times N}\), where \(f(\mathbf{x }_{i})\in {\mathbb {R}}\) is the feature vector extracted from the i-th image labeled \(y_{i}\). We denote the distance between any two feature vector \(f(\mathbf{x }_{i})\) and \(f(\mathbf{x })_{j}\) by \(d(f(\mathbf{x }_{i}),f(\mathbf{x }_{j}))\). For each positive sample pairs \((f(\mathbf{x }_{i}),f(\mathbf{x }_{j}))\) with \(y_{i}=y_{j}\), we calculate their distance and compose an intra-class distance set in which the maximum intra-class distance can be selected. Similarly, for the negative sample pairs \((f(\mathbf{x }_{i}),f(\mathbf{x }_{k}))\) with \(y_{i}\ne y_{k}\), we can obtain the minimum inter-class distance which is sent to the loss function, together with the maximum intra-class distance. The HST loss function is formulated as below:

$$\begin{aligned} \mathbf{L }_{HST}=\frac{1}{S\times N}\sum _{\mathbf{x }_{i}}\left[ \max \limits _{\mathbf{x }_{j},y_{i}= y_{j}}d(f(\mathbf{x }_{i}),f(\mathbf{x }_{j}))-\min \limits _{\mathbf{x }_{k},y_{i}\ne y_{k}}d(f(\mathbf{x }_{i}),f(\mathbf{x }_{k}))+\alpha \right] _{+} \end{aligned}$$
(3)

where \(\alpha \) is the margin that satisfy the constraint between maximum intra-class distance and minimum inter-class distance. Illustration of the HST is shown in Fig. 2b. With the help of “hard samples” satisfying the criteria that the maximum intra-class distance should be smaller than any inter-class distance, the HST loss partially alleviated the problem of slow convergence and instability occurred in the conventional triplet network.

3.2 The ACT loss

In this section, we present the proposed ACT loss as a comparison with triplet loss and HST loss.

3.2.1 Problem analysis

For triplet loss, the optimization of loss function is based on the selected triplets. However, the inter- and intra-class distances distribution are not explicit. Randomly selected triplet training samples without constraints may make the network unstable and hard to converge. It would be difficult to find an ideal threshold for face verification.

A constraint is imposed on triplet samples in HST loss in which the maximum intra-class distance and the minimum inter-class distance are sent to the loss function. As illustrated in Fig. 2b, the selected “hard sample” help to pull samples of the same identity closer while push samples of the different identity away. However, because HST considers inter- and intra-class distances relative to a certain identity when constructing a triplet, namely it considers only the relative distances of positive sample and the negative sample to a special anchor, neglecting the distance between the other negative pairs. Therefore, some negative pairs may fall into positive pairs in the distance space due to the different distance distribution of different class, as illustrated in Fig. 3 where \(\mathbf{A }_{i}\mathbf{B }_{j}\) means the distance between i-th sample of class A and j-th sample of class B.

Fig. 3
figure 3

Illustration of distance distribution of positive pairs and negative pairs. The green dots denote intra-class distances,and other color dots denote inter-class distances.The red dot stands the case that the negative pair fell into the positive pair (Color figure online)

3.2.2 The ACT loss

Different from the HST loss, the ACT loss imposes an absolute constraint that the maximum intra-class distance is smaller than any inter-class distances. In formulation, that is, for each batch, we aim to minimize a loss function as follows:

$$\begin{aligned} \sum _{\mathbf{x }_{i}}\left[ \max \limits _{\mathbf{x }_{j},y_{i}=y_{j}}d(f(\mathbf{x }_{i}),f(\mathbf{x }_{j}))-\min \limits _{\mathbf{x }_{m},\mathbf{x }_{n},y_{m}\ne y_{n}}d(f(\mathbf{x }_{m}),f(\mathbf{x }_{n}))+\beta \right] _{+} \end{aligned}$$
(4)

where \(\beta \) is a slack parameter. As illustrated in Fig. 2c, with the absolute constraint, the ACT loss pushes any two negatives apart while pull the positives close. Therefore, compared with the HST loss, it more enhances the discriminating of the learned features.

3.3 The joint loss

To hold the advantage of HST which present CNNs from slow convergence and instability, we reserve the HST loss in the algorithm. Therefore, the final loss consists of two parts. The first part is the HST loss, and the second part is the ACT loss which push the maximum intra-class distance smaller than any inter-class distance. The final loss function is formulated as follows:

$$\begin{aligned} \mathbf{L }=\mu \mathbf{L }_{HST}(\theta )+(1-\mu )\mathbf{L }_{ACT}(\theta ) \end{aligned}$$
(5)

where

$$\begin{aligned} \mathbf{L }_{HST}= & {} \frac{1}{S\times N}\sum _{\mathbf{x }_{i}}\left( \max \limits _{\mathbf{x }_{j},y_{i}= y_{j}}d(f(\mathbf{x }_{i}),f(\mathbf{x }_{j}))-\min \limits _{\mathbf{x }_{k},y_{i}\ne y_{k}}d(f(\mathbf{x }_{i}),f(\mathbf{x }_{k}))+\alpha \right) _{+} \end{aligned}$$
(6)
$$\begin{aligned} \mathbf{L }_{ACT}= & {} \frac{1}{S\times N}\sum _{\mathbf{x }_{i}}\left( \max \limits _{\mathbf{x }_{j},y_{i}=y_{j}}d(f(\mathbf{x }_{i}),f(\mathbf{x }_{j}))-\min \limits _{\mathbf{x }_{m},\mathbf{x }_{n},y_{m}\ne y_{n}}d(f(\mathbf{x }_{m}),f(\mathbf{x }_{n}))+\beta \right) _{+} \end{aligned}$$
(7)

where \(d(f(\mathbf{x }_{m}),f(\mathbf{x }_{n}))\) with \(y_{m}\ne y_{n}\) is the distance between any two images with different labels in a batch; \(\mu \) is the parameter to balance the HST loss and the ACT loss. The value of \(\mu \) is same to the way of (Cheng et al. 2016) and \(\mu \) is set to be 0.6; \(\alpha \) and \(\beta \) are the margin force the HST loss and absolute constraint separately. We traverse the value of the margin parameter \(\alpha \) within (0.1, 0.2, 0.3, 0.4, and 0.5), and then we choose 0.4 as the margin. Similarly, we traverse the value of the margin parameter \(\beta \) within (0.8, 1.0, 1.2, 1.4, and 1.6), and choose 1.2 as the margin. Equation 5 is optimized using the standard stochastic gradient descent with momentum (Jia et al. 2014).

In algorithm 1, the learning details in the CNNs with joint supervision is summarized, where \(\eta (t)\) is learning rate and it starts from 0.01 and divided by 10 every 10,000 iterations.

figure a

4 Experiment

In Sect. 4.1, we introduced the datasets used in the experiments. The necessary implementation details are given in Sect. 4.2. In Sects. 4.3 and 4.4, extensive experiments are conducted on several public domain face datasets to verify the effectiveness of the proposed approach.

4.1 Introduction of the LFW and YTF datasets

CASIA-WebFace The CASIA-WebFace Dataset (Yi et al. 2014) is used as the training data in our experiments. The CASIA-WebFace database contains 494,414 images of 10,575 subjects. The dataset is collected with a semi-automatically way from Internet, which used under the unrestricted environment.

LFW LFW (Labeled face in the wild) (Huang et al. 2007) dataset contains 13,233 web-collected images from 5749 different identities, with large variations in pose, expression and illuminations. Following the standard protocol of unrestricted with labeled outside data.

YTF YTF (YouTube Faces) dataset (Wolf et al. 2011) consists of 3425 videos of 1595 different people, with an average of 2.15 videos per person. The clip durations vary from 48 frames to 6070 frames, with an average length of 181.3 frames.

IJB-A IJB-A (IARPA Janus Benchmark A) includes 5396 images and 20412 video frames for 500 subjects, which is a challenging with uncontrolled pose variations. Different from previous datasets, IJB-A defines face template matching where each template contains a variant amount of images. It consists of 10 folders, each of which being a different partition of the full set.

4.2 Implementation details

Preprocessing All the faces in images and their landmarks are detected by the recently proposed algorithm (Zhang et al. 2016). 5 landmarks (two eyes, nose and mouth corners) for similarity transformation are used. Finally, the faces are cropped to \(112\times 96\) RGB images, and each pixel in RGB images is normalized by subtracting 127.5 then dividing by 128.

Detailed settings in CNNs Caffe (2014) is used to implement ACT loss and CNNs. The ResNet-50 network is used in the experiments. For fair comparison, we respectively train four models under the supervision of softmax loss, triplet loss, HST loss and ACT loss (the latter three all used softmax for the network initialization). These models are trained with batch size of 128 with 3 blocks of parallel GPUs (1080 Ti). For the softmax loss model, the learning rate is start from 0.01, and divided by 10 every 10,000 iterations. For the next three models, it is observed that the model converges slower, and as a result, the max iteration is set 50,000.

Detailed settings in testing LFW dataset and YTF dataset, and IJB-A dataset are used to evaluate the proposed algorithm. We follow the protocol of these three datasets. For LFW dataset, there are 6000 testing pairs for the standard protocol, where 3000 of them are paired and the rest are unpaired. YTF dataset contains 10 folders of 500 video pairs. We follow the standard verification protocol and report the average accuracy on splits with cross-validation in Table 3. The deep features extracted by the network are concatenated as the representation. The score is computed by the Euclidean Distance of two features. Note that, we only use single model for all the testing.

Fig. 4
figure 4

a Examples from LFW dataset. b Examples from YTF dataset. c Examples from IJB-A dataset

4.3 Effectiveness of the HST loss

In this part, the HST loss on two famous face recognition benchmarks under unconstrained environments is evaluated, namely LFW and YTF datasets. They are excellent benchmarks for face recognition. Some examples of the datasets are shown in Fig. 4.

The following observations are from the results in Table 1. The HST loss beats the baseline (supervised by the softmax loss) and the triplet loss, improving the performance from (96.41% on LFW and 85.44% on YTF) to (97.25% on LFW and 87.80% on YTF) and (97.25% on LFW and 87.80% on YTF) to (98.42% on LFW and 89.86% on YTF) respectively. This shows that the HST loss can learn more discriminative features than the softmax loss and the triplet loss. Figure 5 shows the ROC curve on LFW and YTF datasets respectively, which verifies the effectiveness of the HST loss function.

4.4 Effectiveness of the ACT loss

The verification rates at 1% FAR (False Accepted Rate) of the ACT loss and the softmax loss, triplet loss and the HST loss on LFW and YTF datasets are compared in Table 1.

From Table 1, it is observed that the performance of the ACT loss on the LFW and YTF datasets are better than the softmax loss, triplet loss and the HST loss. It illustrates the feature learned by ACT loss is more discriminative than other three losses, and verifies the effectiveness of the ACT loss. Figure 5 shows the ROC curve on LFW and YTF datasets of the softmax loss, triplet loss, HST loss and ACT loss, and equally proves the effectiveness of the ACT loss.

Table 1 The verification rates (%) at 1% FAR (false accepted rate) of the ACT loss and the softmax loss, triplet loss and the HST loss on LFW and YFT datasets
Fig. 5
figure 5

The left is the ROC curves of different losses on LFW dataset, and the right one is the results on YTF dataset

4.5 Comparison with other algorithm with different loss functions

The performance of the features learned by the proposed algorithm are verified on LFW dataset, YTF dataset and IJB-A datasets, and the results are shown in Tables 2, 3 and 4 respectively.

Table 2 The verification rates of different algorithm with different loss function on LFW dataset
Table 3 The verification rates of different algorithm on YTF dataset
Table 4 The verification rates of different algorithm on IJB-A dataset
Fig. 6
figure 6

The results of different methods on LFW (left) and YTF (right) dataset

From the results in Tables 2 and 3, one can observes that our algorithm achieve a comparable result. This shows that the ACT loss can enhance the discriminative power of deeply learned features, demonstrating the effectiveness of the ACT loss. It is worth mentioned that there only a single CNN model in our experiments, and it is easy to implement. In addition, the ROC curve of the different method on LFW and YTF datasets have shown in Fig. 6, and it also verifies the same statement.

5 Conclusion

In this paper, a new loss function called ACT loss is proposed. By adding an absolute constraint to the HST loss,the joint loss function make face verification difficulty caused by non-uniform of inter-class distance distribution of different identities is alleviated. The effectiveness of proposed method is verified on LFW and YTF datasets respectively. It is worth mentioned that only with a single model, it can achieve a comparable result in experiments. In addition, this work is easy to transfer to the face recognition based on videos, so our feature work may involve the recognition based on videos.