Keywords

1 Introduction

Recently, deep learning has taken the computer vision area by significantly improving in many applications. Varieties vision tasks, such as image classification [1], object detection [2], have benefited from the robust and discriminative representation learnt via CNN models. For face recognition, methods in [3,4,5, 9] are far beyond excellent traditional hand-crafted features and classifiers [10, 11]. The accuracy on LFW [12] benchmark has been improved from 97% [13] to 99% [3, 14, 15]. A general framework of face recognition task consists of two steps. Firstly, a deep CNN model which is supervised by multiclass loss is trained to extract a feature vector with relatively high dimension. Then, combine with PCA [16], Bayesian [3,4,5] or metric-learning [14, 15] to get a more efficient low dimensional representation to distinguish faces of different identities. Meanwhile, huge amount of labeled face data is another important factor to the performance because deep learning is a data driven approach. The amount of training data can range from 100K up to 260M in their methods. Unfortunately, most of these data is western faces and some of them are not public. Therefore, how to use them appropriately is a headache problem.

In this paper, we will introduce our multiple-step method model training method for face recognition in the real scene for Asian face. In step1, we train a baseline model on a global large-scale dataset which mainly enhanced the generalization ability of model. In step2, we improved the specificity of the model by using the data which has closer data distribution in the real scene. And in the last step, metric learning is used to make the model more discriminative and expressive. Meanwhile, some strategy including data cleaning, data augmented and data balance are used in the model training to improve the whole performance. Experiments will show how each step influence the performance in the part of experiment. Moreover, we will demonstrate the possibility of the utilization of face verification technique in real world.

The rest of this paper is organized as follows: In Sect. 2, we introduce our work for the three steps mentioned above in detail. Some experiments are presented and analyzed in Sect. 3. Finally, we draw a conclusion in Sect. 4 with a brief summary.

2 Method

We target our method on face recognition model training aiming at Asian faces in the real application scene. Since the CNN model is in a data-driven way, and collecting enough Asian face data to train a perfect model is so hard that no one can achieve. However, there are a lot of Western face database are public that we can utilize these resources for the model training and transfer some of the learning result to Asian faces. So we propose three steps to train the CNN model: (a) pre-train a baseline model base on a global large-scale face data with different ages and nations; (b) fine-tune the model on smaller Asian face image; (c) learning the metric embedding during the real-scene face situation. The process of the training frame is shown in Fig. 1. The details of each step of our approach are presented in the following subsections.

Fig. 1.
figure 1

Training frame

2.1 Network Architecture

Before we introduce the detail of the three training steps, we first introduce the architecture of the CNN network used in this paper. The detailed architecture is shown in Fig. 2. It closely follows the architecture of the residual network [1] for it can solve performance degradation problem as learning depth increasing using of identity mapping by shortcuts. And compared with the VGG method in [14], it shows better speed advantage with 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs) [14].

Fig. 2.
figure 2

Architectures of residual net

2.2 Data Preparation

To achieve ultimate accuracy, the training dataset for CNN is becoming larger (Table 1). Several face datasets have been published such as CASIA-WebFace [9], CelebFaces+ [3], VGG face dataset [14] and MSCeleb-1M [17]. As shown in Table 2. The published face databases are becoming larger and larger.

Table 1. Some common face training datasets
Table 2. Test result on LFW

As mentioned at the beginning of Sect. 2, our three steps training process need different kinds of dataset to satisfy different training purpose. Firstly, we need a wild range of face data to meet general recognition need, and then narrow the scope of face data gradually in the next steps since the application scene is more specific. Therefore, we prepare three kinds of different dataset.

For step1 model training, we choose MS-Celeb-1M public dataset in this paper for its considerably wide distribution range. It is about 1 Million celebrities from global world, and each identity may contain faces from its different ages. This means that CNN model can learn more extensive knowledge from mounts of faces. However, this large-scale datasets contain massive noisy labels especially because they are automatically collected from internet. Therefore, how to learn a CNN model from the large-scale face data with massive noisy labels is a headache problem. In the data cleaning section, we will explain the details.

For step2, we select a set of relatively small and specific but clean data for training. The advantage is that it not only retains the original large-scale and similar data characteristics avoiding lack of data in the real application scene but also made the model more specific in the real situation. Here we are mainly aiming at Asian people face tasks, so we use an Asian Celebrities dataset which is private for the moment with about 10K identities of 500K face images. It is similar in terms of quantity of identities and faces with the CASIA-Webface dataset. At this stage, data argumentation is heavily needed to increase the diversity of training data which is described in section Data preprocessing and argumentation.

For step3, metric learning is a commonly used method combined with the CNN classify model. The main role of this step is to make face feature more discriminative and more easy to be distinguished such as contrastive-loss [3], triplet-loss [15] and center-loss [8]. Targeting ultimate task of face recognition in the real application environment such as security checking in the station and bank account opening, we are more easily to collect pairs of images of one identity, and only two or three images in each pair. So it is very difficult to train a face classification model on such dataset for lacking of samples, but if we use triplet-loss method for metric learning, it is very appropriate.

Data Cleaning

Noisy label is an important issue in machine learning when datasets tend to be large-scale. Many methods [14, 17,18,19] are devoted to deal with noisy label problems. These methods have their own respective strengths in their applications. Our data cleaning scheme is similar in spirit to that of [14, 19].

First, train a baseline model on a pure dataset such as CASIA-Webface and VGG-face. Second, employ the trained model to predict the MS-Celeb-1M dataset and select the top 50 images of each identity to form positive training samples, and collect the top 50 images of all other identities to construct negative training samples. Third, a linear SVM is trained for each identity using the Fisher Vector Faces descriptor [15, 20] to rank the images for each identity. According to favor high precision in the positive predictions, we choose the threshold N to determine the number of how many samples retained for one identity. Finally, the data set was remained 6,193,218 face images for 99,891 identities.

Data Preprocessing and Argumentation

Before training the CNN model, all the face images are detected by a face detector and resized to a fixed size, but no face point alignment is used in our method considering of the face postures are different in the real scene and the network should learn the features automatically.

In order to enhance the learning ability of the training model, we do random mirror to enrich training data, random rotation, random noise and random color casting are performed as [20]. Each pixel is normalized to [−1, 1] by a subtraction and a division. Except that, we random cropped the training data samples with multiple patches to adapt with the different face pose and angle.

2.3 Implementation Details

Baseline Training

For the first step training, we use the cleaned MS-Celeb-1M data set which contains 6,193,118 images for 99,891 identities to train a baseline model. Our baseline model is based on residual net, as shown in Fig. 2, we employ 50-layers configuration [1] because our server memory is limited and too much time consumption if we choose bigger network parameters although deeper network often bring more excellent performance.

We employ Caffe to train the proposed deep architecture model on four Titan X GPUs with a batch size of 80. The learning rate is set to 0.01 initially and reduced by 0.1 at 140,000 iterations and end at 260,000 iterations. The momentum is set to 0.9 and the weight decay is set to 0.0005. At last the model can reach 96.8% accuracy on the validation set.

Fine-Tune

For step2, a private dataset, we call Asian-Celeb, which is collected for Asian Celebrities about 10K identities and 500K images, is employed during fine-tune the pre-trained model. This dataset has similar distribution with our target faces, and is relatively small, specific and clean. That indicates it is easy for fine-tuning on this image set.

All the training data was preprocessed by data argumentation. We fixed the parameters of the first three blocks of the convolutions. And the learning rate is set to 0.01 initially and then gradually decreased from 1e−2 to 1e−4 by step size policy of reduced by 0.1 at iterations 80,000 and 160,000 iterations with batch size of 80. The momentum and the weight decay are set to the same with these in the baseline training.

Metric Learning

Metric learning is a very effective means to enhance the accuracy of model. Since the model we trained above could be seen as a feature extractor, and when to be used in face verification or recognition, we need construct a distance such as cosine, Euclidean to measure the similarity of faces.

After the training work above, shown in Fig. 2, we can extract a 2048 dimension feature to represent an image. It is a high representative dimension feature but not efficient enough. Metric learning with a triplet loss [14] aims at shortening the Euclidean distance of the samples belonging to the same identity and enlarging it between samples from different ones and lower the dimension at the same time. Finally, a 512 dimension feature is used to represent a face thus the parameters of the model is reduced. The implementation details are shown in the Fig. 3.

Fig. 3.
figure 3

Triplet implementation detail

One thorny problem of triplet loss method is how to select triplets to make the training converge fast. A triplet (a, p, n) contains an anchor image a as well as a positive image \( {\text{p}} \ne {\text{a}} \) and negative n examples of the anchor’s identity. Here the negative samples of triplets were the ones that violate the triplet loss margin but not the most maximally. The margin in this paper is set to 0.5, and the initial learning rate a = 0.005 and is fix in the all training periods.

3 Experiments

We evaluation our method on two datasets. One is commonly used LFW dataset but only tested for the first step model since that the last two step training is aiming at Asian face with different data distribution. Another is established by ourselves which contains 50,000 identities, and each identity contains two or more samples from the real application scene. Most of them are Asian people faces, and there are some wrong face pairs but very few compared with the whole dataset. We will test each step model we trained on this dataset to illustrate the effect of each step in our method. Test results are shown in Tables 2 and 3.

Table 3. Test result on real scene

For face verification, we use equal error rate (EER) accuracy and the extremely low false acceptance rate VR@FAR = 0 which is more practical criterion to evaluate the model effect. For face recognition, we test the Rank-1 detection and identification rate (DIR), which is genuine probes matched in Rank-1 at a 1% false. We can see that our first step model has achieved a comparable result with others’ useless of LFW. Moreover, verification rate at VR@FAR = 0 and identification rate at low false acceptance are even more challenging but have outperformed the published methods.

On the other test set of Asian faces, we just use EER accuracy to explain the importance of each step. There are six kinds of step training combinations, and we have taken labels like A, B, C…for each model from the respective combination. First, we can see that the accuracy of model C can reach 93.56% which is better than the model A or B. It suggests that just using a small dataset of Asian face for training is far not enough. Although MS-Celeb dataset has a different distribution with the target, it has so large quantity that can be used to learn more details, which just make up for the inadequacy of model B. Of course, it just applies when you have not a dataset with large scale and the same distribution to the test for training. If added step3-metric learning, the accuracy have been improved about 3–4% points like model D and E. That indicates metric-learning step is really an effective means for face recognition. When we combined all the training steps, the accuracy has been to 97.78%, which is the best result than any other combinations.

4 Conclusion

In this paper, we proposed a multiple-step model training method which is flexible and effective in face recognition. We applied data cleaning and data augmentation to the network and achieved comparable results to the state of the art on LFW. Also, we achieved a good performance on the extreme real application scene after the following steps. We believe that it can be well applied in practice. However, this paper only provides an effective training idea when to different data application but not devotes on model construction. In the future, we will do some work on model compressing and time reducing to improve the application efficiency.