Abstract
Recently, computer vision based on deep learning is developing rapidly. As an important branch in this area, face recognition has made great progress. The state of art has achieved 99.77% [1] pair-wise verification accuracy on LFW dataset. But the face dataset in the real application environment such as security checking in the station and bank account opening is much more complex than LFW because of face shelter, postures, uneven illumination and the different resolutions and so on. Except that, LFW dataset only contains the faces like western people but little of other area. Since faces from different areas have not consistent distribution, their methods always cannot achieve high recognition accuracy in practice. In this paper, aiming at Asian face, we propose a multiple-step model training method based on CNN network for real scene face recognition in the absence of large amounts of appropriate data. In the whole training process, each step plays an important role. For step1, it mainly enhanced the generalization ability of model by using a large-scale data set from different source. For step2, it improved the specificity of the model by using a smaller dataset which has closer data distribution in the real scene. And for the final step, metric learning is used to make the model more discriminative and expressive. Meanwhile, some strategy including data cleaning, data augmented and data balance are used in our method to improve the whole performance. Experiments show that this method can achieve high-performance for face recognition in the real application scene.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Recently, deep learning has taken the computer vision area by significantly improving in many applications. Varieties vision tasks, such as image classification [1], object detection [2], have benefited from the robust and discriminative representation learnt via CNN models. For face recognition, methods in [3,4,5, 9] are far beyond excellent traditional hand-crafted features and classifiers [10, 11]. The accuracy on LFW [12] benchmark has been improved from 97% [13] to 99% [3, 14, 15]. A general framework of face recognition task consists of two steps. Firstly, a deep CNN model which is supervised by multiclass loss is trained to extract a feature vector with relatively high dimension. Then, combine with PCA [16], Bayesian [3,4,5] or metric-learning [14, 15] to get a more efficient low dimensional representation to distinguish faces of different identities. Meanwhile, huge amount of labeled face data is another important factor to the performance because deep learning is a data driven approach. The amount of training data can range from 100K up to 260M in their methods. Unfortunately, most of these data is western faces and some of them are not public. Therefore, how to use them appropriately is a headache problem.
In this paper, we will introduce our multiple-step method model training method for face recognition in the real scene for Asian face. In step1, we train a baseline model on a global large-scale dataset which mainly enhanced the generalization ability of model. In step2, we improved the specificity of the model by using the data which has closer data distribution in the real scene. And in the last step, metric learning is used to make the model more discriminative and expressive. Meanwhile, some strategy including data cleaning, data augmented and data balance are used in the model training to improve the whole performance. Experiments will show how each step influence the performance in the part of experiment. Moreover, we will demonstrate the possibility of the utilization of face verification technique in real world.
The rest of this paper is organized as follows: In Sect. 2, we introduce our work for the three steps mentioned above in detail. Some experiments are presented and analyzed in Sect. 3. Finally, we draw a conclusion in Sect. 4 with a brief summary.
2 Method
We target our method on face recognition model training aiming at Asian faces in the real application scene. Since the CNN model is in a data-driven way, and collecting enough Asian face data to train a perfect model is so hard that no one can achieve. However, there are a lot of Western face database are public that we can utilize these resources for the model training and transfer some of the learning result to Asian faces. So we propose three steps to train the CNN model: (a) pre-train a baseline model base on a global large-scale face data with different ages and nations; (b) fine-tune the model on smaller Asian face image; (c) learning the metric embedding during the real-scene face situation. The process of the training frame is shown in Fig. 1. The details of each step of our approach are presented in the following subsections.
2.1 Network Architecture
Before we introduce the detail of the three training steps, we first introduce the architecture of the CNN network used in this paper. The detailed architecture is shown in Fig. 2. It closely follows the architecture of the residual network [1] for it can solve performance degradation problem as learning depth increasing using of identity mapping by shortcuts. And compared with the VGG method in [14], it shows better speed advantage with 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs) [14].
2.2 Data Preparation
To achieve ultimate accuracy, the training dataset for CNN is becoming larger (Table 1). Several face datasets have been published such as CASIA-WebFace [9], CelebFaces+ [3], VGG face dataset [14] and MSCeleb-1M [17]. As shown in Table 2. The published face databases are becoming larger and larger.
As mentioned at the beginning of Sect. 2, our three steps training process need different kinds of dataset to satisfy different training purpose. Firstly, we need a wild range of face data to meet general recognition need, and then narrow the scope of face data gradually in the next steps since the application scene is more specific. Therefore, we prepare three kinds of different dataset.
For step1 model training, we choose MS-Celeb-1M public dataset in this paper for its considerably wide distribution range. It is about 1 Million celebrities from global world, and each identity may contain faces from its different ages. This means that CNN model can learn more extensive knowledge from mounts of faces. However, this large-scale datasets contain massive noisy labels especially because they are automatically collected from internet. Therefore, how to learn a CNN model from the large-scale face data with massive noisy labels is a headache problem. In the data cleaning section, we will explain the details.
For step2, we select a set of relatively small and specific but clean data for training. The advantage is that it not only retains the original large-scale and similar data characteristics avoiding lack of data in the real application scene but also made the model more specific in the real situation. Here we are mainly aiming at Asian people face tasks, so we use an Asian Celebrities dataset which is private for the moment with about 10K identities of 500K face images. It is similar in terms of quantity of identities and faces with the CASIA-Webface dataset. At this stage, data argumentation is heavily needed to increase the diversity of training data which is described in section Data preprocessing and argumentation.
For step3, metric learning is a commonly used method combined with the CNN classify model. The main role of this step is to make face feature more discriminative and more easy to be distinguished such as contrastive-loss [3], triplet-loss [15] and center-loss [8]. Targeting ultimate task of face recognition in the real application environment such as security checking in the station and bank account opening, we are more easily to collect pairs of images of one identity, and only two or three images in each pair. So it is very difficult to train a face classification model on such dataset for lacking of samples, but if we use triplet-loss method for metric learning, it is very appropriate.
Data Cleaning
Noisy label is an important issue in machine learning when datasets tend to be large-scale. Many methods [14, 17,18,19] are devoted to deal with noisy label problems. These methods have their own respective strengths in their applications. Our data cleaning scheme is similar in spirit to that of [14, 19].
First, train a baseline model on a pure dataset such as CASIA-Webface and VGG-face. Second, employ the trained model to predict the MS-Celeb-1M dataset and select the top 50 images of each identity to form positive training samples, and collect the top 50 images of all other identities to construct negative training samples. Third, a linear SVM is trained for each identity using the Fisher Vector Faces descriptor [15, 20] to rank the images for each identity. According to favor high precision in the positive predictions, we choose the threshold N to determine the number of how many samples retained for one identity. Finally, the data set was remained 6,193,218 face images for 99,891 identities.
Data Preprocessing and Argumentation
Before training the CNN model, all the face images are detected by a face detector and resized to a fixed size, but no face point alignment is used in our method considering of the face postures are different in the real scene and the network should learn the features automatically.
In order to enhance the learning ability of the training model, we do random mirror to enrich training data, random rotation, random noise and random color casting are performed as [20]. Each pixel is normalized to [−1, 1] by a subtraction and a division. Except that, we random cropped the training data samples with multiple patches to adapt with the different face pose and angle.
2.3 Implementation Details
Baseline Training
For the first step training, we use the cleaned MS-Celeb-1M data set which contains 6,193,118 images for 99,891 identities to train a baseline model. Our baseline model is based on residual net, as shown in Fig. 2, we employ 50-layers configuration [1] because our server memory is limited and too much time consumption if we choose bigger network parameters although deeper network often bring more excellent performance.
We employ Caffe to train the proposed deep architecture model on four Titan X GPUs with a batch size of 80. The learning rate is set to 0.01 initially and reduced by 0.1 at 140,000 iterations and end at 260,000 iterations. The momentum is set to 0.9 and the weight decay is set to 0.0005. At last the model can reach 96.8% accuracy on the validation set.
Fine-Tune
For step2, a private dataset, we call Asian-Celeb, which is collected for Asian Celebrities about 10K identities and 500K images, is employed during fine-tune the pre-trained model. This dataset has similar distribution with our target faces, and is relatively small, specific and clean. That indicates it is easy for fine-tuning on this image set.
All the training data was preprocessed by data argumentation. We fixed the parameters of the first three blocks of the convolutions. And the learning rate is set to 0.01 initially and then gradually decreased from 1e−2 to 1e−4 by step size policy of reduced by 0.1 at iterations 80,000 and 160,000 iterations with batch size of 80. The momentum and the weight decay are set to the same with these in the baseline training.
Metric Learning
Metric learning is a very effective means to enhance the accuracy of model. Since the model we trained above could be seen as a feature extractor, and when to be used in face verification or recognition, we need construct a distance such as cosine, Euclidean to measure the similarity of faces.
After the training work above, shown in Fig. 2, we can extract a 2048 dimension feature to represent an image. It is a high representative dimension feature but not efficient enough. Metric learning with a triplet loss [14] aims at shortening the Euclidean distance of the samples belonging to the same identity and enlarging it between samples from different ones and lower the dimension at the same time. Finally, a 512 dimension feature is used to represent a face thus the parameters of the model is reduced. The implementation details are shown in the Fig. 3.
One thorny problem of triplet loss method is how to select triplets to make the training converge fast. A triplet (a, p, n) contains an anchor image a as well as a positive image \( {\text{p}} \ne {\text{a}} \) and negative n examples of the anchor’s identity. Here the negative samples of triplets were the ones that violate the triplet loss margin but not the most maximally. The margin in this paper is set to 0.5, and the initial learning rate a = 0.005 and is fix in the all training periods.
3 Experiments
We evaluation our method on two datasets. One is commonly used LFW dataset but only tested for the first step model since that the last two step training is aiming at Asian face with different data distribution. Another is established by ourselves which contains 50,000 identities, and each identity contains two or more samples from the real application scene. Most of them are Asian people faces, and there are some wrong face pairs but very few compared with the whole dataset. We will test each step model we trained on this dataset to illustrate the effect of each step in our method. Test results are shown in Tables 2 and 3.
For face verification, we use equal error rate (EER) accuracy and the extremely low false acceptance rate VR@FAR = 0 which is more practical criterion to evaluate the model effect. For face recognition, we test the Rank-1 detection and identification rate (DIR), which is genuine probes matched in Rank-1 at a 1% false. We can see that our first step model has achieved a comparable result with others’ useless of LFW. Moreover, verification rate at VR@FAR = 0 and identification rate at low false acceptance are even more challenging but have outperformed the published methods.
On the other test set of Asian faces, we just use EER accuracy to explain the importance of each step. There are six kinds of step training combinations, and we have taken labels like A, B, C…for each model from the respective combination. First, we can see that the accuracy of model C can reach 93.56% which is better than the model A or B. It suggests that just using a small dataset of Asian face for training is far not enough. Although MS-Celeb dataset has a different distribution with the target, it has so large quantity that can be used to learn more details, which just make up for the inadequacy of model B. Of course, it just applies when you have not a dataset with large scale and the same distribution to the test for training. If added step3-metric learning, the accuracy have been improved about 3–4% points like model D and E. That indicates metric-learning step is really an effective means for face recognition. When we combined all the training steps, the accuracy has been to 97.78%, which is the best result than any other combinations.
4 Conclusion
In this paper, we proposed a multiple-step model training method which is flexible and effective in face recognition. We applied data cleaning and data augmentation to the network and achieved comparable results to the state of the art on LFW. Also, we achieved a good performance on the extreme real application scene after the following steps. We believe that it can be well applied in practice. However, this paper only provides an effective training idea when to different data application but not devotes on model construction. In the future, we will do some work on model compressing and time reducing to improve the application efficiency.
References
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, https://arxiv.org/abs/1512.03385 (2015)
Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. CoRR, https://arxiv.org/abs/1506.02640 (2015)
Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identification-verification. Proc. Adv. Neural Inf. Process. Syst. 27, 1988–1996 (2014)
Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3476–3483 (2013)
Sun, Y., Wang, X., Tang, X.: Deeply learned face representations are sparse, selective, and robust. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2892–2900 (2015)
Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708 (2014)
Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Web-scale training for face identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2746–2754 (2015)
Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Proceedings of the European Conference on Computer Vision, pp. 499–515. Springer (2016)
Yi, D., Lei, Z., Liao, S., Li, S.Z.: Learning face representation from scratch. CoRR, https://arxiv.org/abs/1411.7923 (2014)
Chen, D., Cao, X., Wen, F. and Sun, J.: Blessing of dimensionality: high-dimensional feature and its efficient compression for face verification. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3025–3032. IEEE (2013)
Cao, X., Wipf, D., Wen, F., Duan, G.: A practical transfer learning algorithm for face verification. In: International Conference on Computer Vision (ICCV) (2013)
Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007
Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708 (2014)
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of the British Machine Vision Conference (2015)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Zhou, E., Cao, Z., Yin, Q. Naive-deep face recognition: touching the limit of LFW benchmark or not? Technical report, arXiv:1501.04690
Sukhbaatar, S., Fergus, R.: Learning from noisy labels with deep neural networks. CoRR, https://arxiv.org/abs/1406.2080 (2014)
Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. CoRR, https://arxiv.org/abs/1412.6596 (2014)
Wu, X., He, R., Sun, Z., et al.: A light CNN for deep face representation with noisy labels. Computer Science (2016)
Wu, R., Yan, S., Shan, Y., et al.: Deep image: scaling up image recognition. arXiv preprint arXiv:1501.02876, 22, 388 (2015)
Dai, W., Yang, Q., Xue, G.R., et al.: Boosting for transfer learning. In: International Conference on Machine Learning, pp. 193–200. ACM (2007)
Acknowledgements
The authors of this paper are members of Shanghai Engineering Research Center of Intelligent Video Surveillance. Our research was sponsored by following projects: the National Natural Science Foundation of China (61403084, 61402116); Program of Science and Technology Commission of Shanghai Municipality (Nos. 15530701300, 15XD15202000); 2012 IoT Program of Ministry of Industry and Information Technology of China; Key Project of the Ministry of Public Security (No. 2014JSYJA007); the Project of the Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University(ESSCKF 2015-03); Shanghai Rising-Star Program (17QB1401000).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Li, D., Zhang, X., Song, L., Zhao, Y. (2018). Multiple-Step Model Training for Face Recognition. In: Abawajy, J., Choo, KK., Islam, R. (eds) International Conference on Applications and Techniques in Cyber Security and Intelligence. ATCI 2017. Advances in Intelligent Systems and Computing, vol 580. Edizioni della Normale, Cham. https://doi.org/10.1007/978-3-319-67071-3_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-67071-3_21
Published:
Publisher Name: Edizioni della Normale, Cham
Print ISBN: 978-3-319-67070-6
Online ISBN: 978-3-319-67071-3
eBook Packages: EngineeringEngineering (R0)